The Value Added of Machine Learning to Causal Inference: Evidence from Revisited Studies

A new and rapidly growing econometric literature is making advances in the problem of using machine learning methods for causal inference questions. Yet, the empirical economics literature has not started to fully exploit the strengths of these modern methods. We revisit influential empirical studies with causal machine learning methods and identify several advantages of using these techniques. We show that these advantages and their implications are empirically relevant and that the use of these methods can improve the credibility of causal analysis.


Introduction
One of the key goals of empirical research in economics is to estimate the causal effect of a variable of interest on a targeted outcome. To avoid biases in the coefficients of interest due to omitted variables, particularly in observational studies, it is often desirable to include a large number of controls. However, even when the number of raw covariates is relatively small, the inclusion of technical controls (e.g. dummy variables for geographical location, time periods, etc.), interactions and transformations can lead to settings in which the number of covariates is large relative to the sample size.
Machine learning (ML) methods can potentially be useful in such settings. However, standard ML prediction models are aimed for fundamentally different problems than most of the empirical work in economics. ML methods are designed and optimized for predicting the outcome in a test sample. 1 Thus, a model is selected by optimizing the goodness of fit on the held-out test set. In contrast, in empirical economic research, the goodness of fit of a model is oftentimes reduced when estimating a causal effect, and the predictive accuracy is sacrificed in order to learn more deeply about a fundamental relationship that can guide policy decisions and counterfactual predictions (Athey and Imbens, 2019). These fundamental differences will eventually generate biased estimates if standard ML techniques, designed for prediction, are used in the context of causal inference. 2 Nevertheless, a new and rapidly growing econometric literature is making advances in the problem of using ML methods for causal inference questions (see, e.g., Chernozhukov et al., 2018a;Athey et al., 2018;Wager and Athey, 2018;Chernozhukov et al., 2018b). This literature brings in new insights and theoretical results that are novel for both the ML and the econometrics/statistics literature. Despite these advances, the empirical economics literature has not started yet to fully exploit the strengths of these new 1 Note that by 'prediction' here, we do not mean 'forecasting'. Rather, we refer to a setting where we observe both the outcome and the features/covariates in a training sample and the aim is to predict the outcomes for each observation in an independent test sample, based on the actual values of the covariates in that test sample. 2 The main underlying reason is that high dimensional regression adjustments such as lasso, ridge, elastic net etc., shrink the estimated effects by construction, and ignoring this shrinkage will lead to biased treatment effect estimates. modern causal inference methods.
The aim of this paper is to contribute to the general understanding of when and how ML methods add value to economic causal analysis. To this end, we revisit a number of influential papers with causal ML methods and compare the results with the traditional methods used in the original study. We highlight the relevance and additional gains that causal machine learning methods bring to the table relative to the standard econometric approaches. In our analysis, we focus on both the average treatment effect (ATE) and heterogeneous treatment effects (HTE). We provide evidence that causal ML methods can improve the credibility of causal analysis, and can help identify relevant heterogeneity in treatment effects that may be missed with traditional methods.
When interested in the ATE, we employ the double/debiased machine learning (DML) method of Chernozhukov et al. (2017); when the focus is on heterogeneous treatment effects (HTE), we work with the random forest method of Wager and Athey (2018), and with the generic machine learning method for heterogeneous treatment effects developed by Chernozhukov et al. (2018b). These are newly developed causal machine learning methods for the estimation of the ATE and HTE with well established theoretical properties. We re-examine a set of relatively recent but influential studies that span a variety of topics in applied economics, published in the following journals: The Quarterly Journal of Economics, American Economic Journal: Macroeconomics, American Economic Journal: Applied Economics. We choose papers for which the full replication data set is available either on the journal's website or on the authors' website. For the ATE, we revisit three observational studies: the study of Djankov et al. (2010) on the effect of corporate taxes on investment and entrepreneurship, the analysis of  on the long-term effect of plough agriculture on gender norms, and the paper by Nunn and Trefler (2010) on the effect of skill-biased tariffs on long-term economic growth. For the HTE, we select one observational study and one randomized control trial: we extend the observational study by DellaVigna and Kaplan (2007), which investigates the effect of Fox News on the Republican vote share, and the analysis by Loyalka et al. (2019) on the effect of a teacher training randomized intervention on student performance. All these papers include careful econometric analyses of the main research question and mechanisms, which we do not aim to re-examine in full. We instead focus on analyzing the main questions.
Our findings show important differences in the ATE and HTE estimates compared to the traditional methods, both in terms of size of the treatment effect estimates, and in terms of statistical significance. From our results, we derive four main observations about the reasons why causal machine learning methods are relevant for causal analysis and add value relative to the traditional methods. These observations are supported by the theoretical econometrics literature on causal ML (see, for example, Athey et al., 2018;Chernozhukov et al., 2018b;Wager and Athey, 2018).
Firstly, causal ML methods are powerful tools in using data to recover complex interactions among variables and flexibly estimate the relationship between the outcome, the treatment indicator and covariates. This feature is key when drawing inference based on the assumption that the treatment is unconfounded as in the case of most of the revisited studies, since this assumption is not testable. As some covariates can be correlated with both the treatment variable and the outcome, failing to condition on all relevant confounders may lead to biased estimates for the treatment effect. For example, for the effect of corporate taxes on investment and entrepreneurship, the original analysis in Djankov et al. (2010) shows a negative and significant effect of corporate taxes on investment and entrepreneurship, but the authors show that these results do not survive when conditioning on all the potential controls at once. However, when implementing DML, we obtain larger estimates compared to Djankov et al. (2010), which are often statistically significant. Similarly, our DML results for the effect of plough cultivation on gender roles suggest a larger effect of the plough compared to the findings in , when we use the instrumental variable strategy employed in the original analysis. Furthermore, our analysis of the effect of skill-biased tariffs on growth suggests a smaller effect compared to Nunn and Trefler (2010), which is often not statistically significant. We thus argue that the DML estimates are more robust to potential nonlinear confounders. 3 Secondly, causal ML methods allow for the inclusion of a large number of covariates, even when the sample size is relatively small, by assuming that the model is sparse, (i.e., only a small number of covariates are relevant), and using regularized regressions. For instance, in the study by Djankov et al. (2010) and in some of the specifications in  and Nunn and Trefler (2010), the number of "raw" covariates is large compared to the sample size, thus taking into account all possible nonlinear terms, such as interactions and transformations, would not be possible when using traditional methods. Indeed, only a limited set of prespecified nonlinear terms is included in , no nonlinear terms other than logarithms are considered in Nunn and Trefler (2010), and no nonlinear terms are included in Djankov et al. (2010). In contrast, by using the DML method we ensure that our results take into account all potentially relevant confounders at once, both linearly and nonlinearly.
Thirdly, the use of causal ML methods allows to implement systematic model selection. ML methods search for the best functional forms by estimating and comparing a wide range of alternative model specifications; the model selection is thus data-driven and fully documented. For example, our results for the effect of corporate taxes, originally explored by Djankov et al. (2010), show that the data-driven model selection implemented by DML, which keeps a smaller set of influential confounding factors from among a large set of potential controls, leads to larger coefficients in absolute value and lower standard errors compared to OLS regressions where all the covariates are included. With the traditional approach to model selection, uncertainty about the correct specification of the model can lead to choices that are relatively ad hoc; different specifications may lead to different point estimates, which in turn may lead to different policy decisions. Moreover, we further illustrate how these methods are also very useful tools for supplementary analyses or robustness checks. Typically, supplementary analysis is performed by presenting a number of selected regression specifications, while the approach of causal ML methods is more systematic, and ensures that important transformations of covariates that are not considtion has already been considered in the semiparametric econometrics literature (see the review paper of Wooldridge, 2009, andRubin, 2015) . However, in practice, these semiparametric kernel methods quickly break down if they have to deal with more than a few covariates. ered relevant a priori are not missed. For instance, when revisiting the OLS robustness analysis of , the DML results show small and insignificant treatment effects, in contrast to the original robustness regressions. Furthermore, our analysis of Nunn and Trefler (2010) shows that the main results are not robust to flexibly controlling for a data-driven function of the covariates.
Finally, causal machine learning methods prove to be very useful when one is interested in estimating heterogeneous treatment effects. As causal ML methods can handle many covariates potentially responsible for effect heterogeneity in a systematic way, it is less likely that relevant heterogeneous effects will be missed, compared to manually modelling different interaction terms. This feature is exemplified by our analysis of the heterogeneous effects of Fox News on the Republican vote share first explored by DellaVigna and Kaplan (2007) and of the teacher training intervention studied by Loyalka et al. (2019): our results reveal drivers of heterogeneity that were unexplored in the original analysis. In addition, note that causal ML methods tailored for estimating heterogeneous treatment effects provide valid confidence intervals in high dimensional settings, as opposed to traditional methods where standard p-values for single hypothesis testing are not reliable. This is due to the multiple hypothesis testing problem, which can occur when researchers search iteratively for treatment effect heterogeneity, over a large number of covariates. 4,5 The econometric theory literature on adapting standard machine learning techniques to causal inference questions is by now fast growing. See for example Chernozhukov et al. (2017), Chernozhukov et al. (2018a, Athey et al. (2018), Farrell et al. (2018 Colangelo and Lee (2020) for the ATE; and Athey and Imbens (2016), Wager and Athey (2018), Athey and Wa-4 While solutions have been proposed to correct for the issue of multiple hypothesis testing (for example, List et al., 2016), when the number of covariates is large, the power of these approaches to detect heterogeneity is low (Athey and Imbens, 2017). 5 A related issue is the ex-post selection of significant heterogeneous effects. To avoid this problem, in randomized control trials researchers are often required to specify before the experiment which heterogeneous effects they are interested to look into, in order to avoid searching for, and only reporting, significant effects. However, this limits the ability of the researcher to find unexpected relevant heterogeneity. Causal ML methods ensure that relevant heterogeneity is not missed while also providing valid confidence intervals. In addition, in observational studies, where pre-analysis plans are not common practice, causal ML methods can be particularly useful. ger (2019), Chernozhukov et al. (2018b), Semenova et al. (2018 for the HTE. In the statistics literature, estimation of HTE with machine learning methods has been the focus in Hill (2011) In what follows, we present our methodology and main findings on the ATE using double machine learning in Section 2. The methodology and analysis of HTE via the causal random forest is summarized in Section 3. Section 4 focuses on the methodology and analysis of HTE using the generic machine learning method. Finally Section 5 concludes.

Machine Learning
This section contains the analysis on the ATE for the effect of corporate taxes on investment and entrepreneurship (Djankov et al., 2010), the effect of plough agriculture on gender roles , and the effect of skill-biased tariffs on growth (Nunn and Trefler, 2010), using the double machine learning method (Chernozhukov et al., 2017).

Methodology: Double Machine Learning
The method is suitable in settings with a large number of covariates relative to the sample size (either because the number of raw covariates is large to begin with, or there is a large number of technical controls), where typical non-parametric kernel or spline methods break down.
The main model specification of the method, in the notation of Chernozhukov et al. (2018a), is the partially linear regression: where Y is the outcome, D is the treatment variable of interest, X is a (high-dimensional) vector of controls, and U and V are disturbances.
Equation (1) is the main equation of interest and the parameter θ 0 is the treatment effect we would like to estimate. In this model, θ 0 quantifies the average treatment effect. The second equation is not of direct interest, but it keeps track of the dependence of the treatment on confounders. The covariates are related to the treatment through the function m 0 (X) and to the outcome variable through the function g 0 (X). While m 0 (X) and g 0 (X) can be nonlinear, the treatment variable, D, enters the model linearly (and additively). In observational studies, the function m 0 is typically nonzero, which means that the treatment assignment is not random, but depends on the covariates. The partially linear regression model is also extended to a partially linear IV model to allow for endogenous treatment. We refer to this model as DML-IV. A first idea one might have for estimating θ 0 with ML methods would be to use a predictive-based ML approach and predict Y using D and X to obtain Dθ 0 +ĝ 0 (X). This can be done for example by an iterative method that alternates between estimating g 0 with some ML method and θ 0 with OLS. While this 'naive' ML approach will have very good prediction performances, the iterative ML estimator will be heavily biased with a slower than 1/ √ n convergence rate. The primarily reason for this poor performance is the bias introduced by regularization. In order to optimize prediction and avoid overfitting the data with complex functional forms, ML methods use regularization and shrink the less important coefficients towards zero. This reduces overfitting by decreasing the variance of the estimator but at the same time introduces bias. The bias in estimating g 0 transfers to the parameter of interest θ 0 . The issue is similar to the omitted variable bias.
To overcome regularization bias, Chernozhukov et al. (2017) propose 'double machine learning' i.e., solving two predictions problems instead of one. First, a ML model is fitted for m 0 in the treatment equation, and the effect of X is partialed out from D to get the residualsV = D −m 0 (X).
Second, a ML method is fitted for g 0 in the outcome equation and the residualsŴ = Y −ĝ 0 (X) are obtained. 6 Finally, the residualsŴ are regressed on the residualsV to obtained the 'debiased' machine learning estimator,θ 0 . It can be shown that by orthogonalizing D with respect to X and eliminating the effect of confounders by subtracting an estimate of g 0 ,θ 0 removes the effect of regularization bias. 7 However,θ 0 is still subject to bias due to overfitting. For instance, when g 0 is overfit, it will mistake noise for signal and thus it will pick up some of the noise U from the outcome equation. If U and V are correlated, the estimation error inĝ 0 will be correlated with V . To break this correlation and avoid bias due to overfitting, one can rely on sample splitting. To this end, the data is partitioned into two subsamples: a main sample and an auxiliary sample. The ML models for the two nuisance functions m 0 and g 0 are fit on the auxiliary sample, while the residual on residual regression to obtainθ 0 is fit on the main sample.
A drawback of sample splitting is that the estimator of the parameter of interest θ 0 is obtained using only the main sample, which can lead to loss of efficiency. However, one can switch the role of the main and auxiliary samples (procedure called cross-fitting) and average the results, which will lead to a more efficient estimator. In addition, one can perform a K-fold version of the cross-fitting procedure, where the size of each fold is n/K. Each sample partition or fold is successively taken as the main sample while the complement for each fold will be the auxiliary sample. One can take then the average of the estimates over the K-folds. To make the results robust to data partitioning, the splitting in folds procedure is performed S times, and the final DML estimator is the mean (or median) over the splits. The median version is more robust to outliers and this is the one we use in the applications. 6 The nuisance functions m 0 and g 0 can be estimated with a variety of ML methods such as: lasso, regression trees, random forest, boosting, neural networks, or hybrid methods. 7 This is because the scaled estimation error, √ n(θ 0 − θ 0 ), contains now a term based on the product of two estimation errors (the estimation errors inm 0 and inĝ 0 ), which vanishes faster than the equivalent term obtained from using the naive estimator that depends on the estimation error ofĝ 0 .

The Effect of Corporate Taxes on Investment and Entrepreneurship
Description of Original Analysis. The first paper that we revisit using causal machine learning methods investigates the relationship between corporate taxes on investment and entrepreneurship (Djankov et al., 2010). This is an observational study that shows a negative effect of corporate taxes on investment and entrepreneurship, by estimating OLS countrylevel regressions with different measures of corporate tax rates for the year 2004. The sample includes a set of 50-85 countries, depending on the specification. In the original paper, four outcome variables are examined: investment as a percentage of GDP, FDI as a percentage of GDP, business density per 100 people, and the average entry rate. Three measures of corporate taxes are considered: statutory corporate tax rates, actual first-year corporate income tax liability of a new company, and the tax rate which takes into account actual depreciation schedules going five years forward. The original paper reports the results for several regression specifications with different sets of control variables, to account for potential confounders that correlate with corporate tax rates, and are also determinants of the outcomes. 8 Djankov et al. (2010) present regression results where the first three sets of covariates are added separately. A final robustness check includes all control variables (12 in total) in the same regression. In the specifications which include only one set of controls at a time, the paper shows a negative and statistically significant effect of corporate taxes on entrepreneurship and investment. However, when adding all the controls, the relationship is still negative, but the coefficients are smaller in size and no longer statistically significant.
DML Analysis. We revisit the final robustness check of the paper, which includes all four sets of covariates at the same time, using the DML partially linear model. Table 1 presents the results. Columns (1) to (7) 8 The first set of controls includes measures of other taxes; the second set includes measures for the number of other tax payments made and for tax evasion; the third set includes measures for institutions; the fourth set includes measures of inflation. Section A.1 of the Appendix includes more details on the regressions estimated in Djankov et al. (2010) and describes the control variables. original paper results with the full set of covariates, reported in column (8), the magnitude of the DML coefficients is higher in absolute value, and the standard errors are lower in most regressions. Additionally, the results are statistically significant, at least at the 10% level, in half (42 out of 84) of the regressions. The main difference between our and Djankov et al. (2010)'s approach is that the original paper results are based on the assumption of linearity and additivity of the conditional expectations, while the DML method allows for a more flexible specification. Thus, our findings are more robust to potential nonlinear confounders compared to the original paper estimates. A researcher might be interested in investigating what are these nonlinear terms that make the estimates different. However, this can be a challenging task when ML methods (such as neural networks, hybrid methods etc.) are used to estimate the nuisance functions. What can potentially be done is analyzing the lasso coefficients that are not shrunk to zero and looking for nonlinearities among these. As an example, we show in Figure B.1 the most relevant among the nonlinear terms selected by the lasso, for one of the DML regressions reported in Table 1. Here, we note that several nonlinear terms appear in both the treatment nuisance functionm(·) and in the outcome nuisance functionĝ(·). 9 This is suggestive of the fact that there are nonlinearities that are correlated with both the treatment variable and the outcome. These were missed by the analysis in the original paper, and their omission could lead to biased coefficients of the corporate taxes variables. In this case, controlling for all relevant confounders strengthens the main results of the original analysis: in many cases the DML treatment effect estimates are larger in absolute value, and statistically significant. This empirical example is also useful to illustrate a typical trade-off that the applied researcher might face. On the one hand, the researcher wants to control for as many potential confounders as possible, in order to improve the credibility of the unconfoundedness assumption. On the other hand, naively controlling for a large set of covariates, especially when the sample size is small, can lead to imprecise estimates and larger standard errors. Notice that in this example, the authors implement a "kitchen sink" regression and control for all the covariates at once, resulting in larger standard errors than the ones that we obtain. The DML method helps with this trade-off by improving the credibility of the unconfoundedness assumption (as it captures more flexibly the effect of confounders), but, at the same time, it implements a data-driven variable selection technique to keep a smaller set of influential confounding factors from among a large set of potential controls, thus resulting in lower standard errors.

The Effect of Plough Agriculture on Gender Roles
Description of Original Analysis. The study by  examines the relationship between historical plough agriculture and gender roles. The mechanism is the following: since the plough requires physical strength to be operated, in areas where plough agriculture was widespread, men had an advantage in agriculture compared to women. This would result in societies in which men worked in farming, whereas women's work would be performed mainly within the home. The division of labour by gender would translate into norms and cultural beliefs about the role of women in the society, which would still persist nowadays, even after societies have moved out of agriculture as the main economic activity.
In the paper, the authors present results using country-level and individuallevel regressions. We revisit the main question addressed in the original paper, focusing on the country-level results, as the majority of the regressions reported in the original paper are based on this data. For the country-level baseline regressions, estimated with OLS, three contemporary outcome variables are examined as measures of gender roles: female labour force participation, the share of firms with a woman among its principal owners, and the proportion of seats held by women in the national parliament. The treatment variable measures the share of individuals in each country whose ancestors practiced plough agriculture. The baseline regressions control for income, income squared, and for measures of the historical characteristics of the ethnicities living in a country. Continent fixed effects are added in some specifications. 10 As mentioned by , concerns about potential endo-geneity in the baseline regressions arise. It is possible that plough agriculture may have been more common in countries that had less equal genderrole attitudes. This would cause the OLS estimates to be biased away from zero. Moreover, plough agriculture may have been more likely in areas where economic development was historically higher. If historical and contemporary economic development are correlated, and more economically advanced countries tend to have higher female labour force participation and more equal gender roles, OLS estimates may be biased towards zero.
To tackle these issues, the following two solutions are offered in the paper. First, motivated by the thought that the potential bias may be partly due to observable characteristics, a number of additional controls are included in the regressions. These include both historical and contemporary controls. 11 Second, the authors use an instrumental variable approach, which exploits the fact that plough adoption is correlated with the suitability of the land for cereal crops that would benefit, and crops that would not benefit, from the plough. To this end, two instruments for plough adoption are constructed, based on the analysis by Pryor (1985). The first is the suitability for "plough-positive" (i.e. which benefit most from the plough) cereal crops, and the second is the suitability for "plough-negative" (i.e. which benefit least from the plough) cereal crops. 12 DML Analysis. In our analysis, we re-examine both the country-level OLS and IV regressions, applying the DML method. For the OLS analysis, we begin by estimating a DML partially linear model that only includes the baseline set of controls as raw covariates. We then revisit the robustness analysis of this specification, by including as raw covariates the largest set of controls used in the robustness checks (this corresponds to Table 7, column 8 of the original paper), to which we also add the continent fixed effects. 13 This amounts to a total of 36 raw covariates. For the IV analysis, as noted 11 The additional controls are listed in Section A.2 of the Appendix. 12 See  for details on the data used and how the instruments are constructed. The paper also shows that the two instruments, indeed, predict plough adoption.
13 When revisiting the robustness analysis with DML, we include continent fixed effects, even though the original paper did not include them in their most complete robustness checks. As causal ML methods can handle a large number of covariates, we include all the covariate which were considered in the original paper, to ensure that all potential confounders are taken into account.
in , the main concern with the instrumental variable strategy is the possibility that suitable areas for different crops could be correlated with geographic characteristics that have an effect on gender norms through other channels, besides plough adoption (i.e., the exclusion restriction might not hold). Therefore, for the IV analysis, in addition to the baseline controls and in line with the original paper, we consider the geoclimatic characteristics the authors use in their IV robustness checks (Table  A14 of the Online Appendix of the original paper). To these variables, we again add the continent fixed effects. Further details on how the DML estimates are obtained and the tuning choices are described in Section A.2 of the Appendix. Table B.2 reports the results of the DML partially linear model that replicates the baseline regression. In accordance with the original paper, the treatment effect estimates are negative and statistically significant. They are also close to the original estimates (reproduced here for convenience in column 8 of Table B.2), and reassuringly, fairly stable across the ML methods. We find however very different results when carrying out the robustness analysis of this baseline specification with the DML method. Panel A of Table 2 reports the results. While the effect is still negative, albeit much smaller in absolute value, statistically significance is now lost. Interestingly, when  include all covariates at once (the estimate is reproduced in the last column of our Table 2), the treatment effect becomes smaller in absolute value, compared to when groups of covariates are added separately (see their Table 7, columns 1 to 7), or compared to the baseline specification (reproduced in column 8 of our Table  B.2). With DML, the treatment effect of interest does not only become smaller, but also statistically insignificant.
Our findings up to this point would lead us to (mistakenly) conclude that the negative effect of plough adoption on attitudes towards gender roles may not be as large as suggested by the original analysis, and that the effect is not statistically significant. However, our estimates from the DML partially linear model may still be subject to endogeneity. While flexibly controlling for a large number of covariates can account for the confounding effect of observed characteristics, the remaining concern is that plough adoption may be correlated with unobserved characteristics that  also affect the outcome. The instrumental variable approach suggested by  can alleviate this potential issue. We consider the same instruments as in the paper (described above) and we turn to re-evaluating the results by estimating a DML -IV model. Panel B of Table 2 reports the results. As in the original analysis, the estimated coefficients have a negative sign, and they are now statistically significant at the 10% level for most of the ML methods, with the exception of neural networks and ensemble. It is interesting to note that the magnitude of the coefficients is larger than in the DML partially linear model (both baseline and extended). This is consistent with the original paper, which also finds that the IV coefficients are larger than the OLS estimates. It is worth to further notice that compared to the IV results of the original paper, our DML -IV findings suggest an even larger effect of the plough adoption on female labour force participation. We attribute this to causal machine learning methods being able to control for a large number of covariates in a more flexible way. 14 Overall, when looking at both the robustness analysis and the IV analysis and comparing them to the baseline results, we notice that our estimates move in the same direction as the original paper estimates, but our estimates move even more, supporting the idea that DML controls more flexibly for relevant covariates. This empirical example is a good illustration to show the gains from combining modern ML tools with quasi-experimental methods such as instrumental variables. While causal ML methods can make the unconfoundedness assumption more plausible by flexibly controlling for observed confounders, they cannot account for unobserved confounders. In such settings, the researcher could combine causal ML methods with quasi-14 As explained above, our DML specification differ from the original paper's robustness analysis because it considers nonlinearities and it includes continent fixed effects. Therefore, the differences between the DML and the original estimates could, in principle, be driven by the continent fixed effects, and not by the nonlinearities. The original paper shows that adding the continent fixed effects to the baseline specification leads to very small changes in the OLS estimates (see Table 4 in the original paper), while it results in larger changes in the IV case (see Table 8 in the original paper). However, even in the IV case, including the continent fixed effects only increases the absolute size of the plough coefficient by 3-4 percentage points, while the DML coefficients exceed the OLS and 2SLS estimates by more than double that amount (with the except of the neural network and ensemble estimates). Thus, we conclude that allowing for a more flexible nuisance function is likely to be driving at least part of the differences between the DML and the 2SLS (and OLS) estimates. experimental methods such as IV, which potentially overcomes biases caused by unobserved factors. Integrating the two methods could provide powerful tools for the researcher's toolkit.
Furthermore, this empirical paper illustrates how causal machine learning methods can serve as useful tools for the empirical researcher to perform supplementary analyses. In order to support the credibility of the empirical evidence, researchers typically report a number of different model specifications and evaluate the sensitivity of estimates to these alternativessimilar to the above-mentioned robustness checks performed in the original paper. The usual approach to evaluating the variability of estimates to different model specifications can be somewhat ad-hoc and not a systematic way of implementing sensitivity analysis. In addition, relevant covariates or interactions of these covariates which are not considered important a priori by the researcher might be missed. Instead, causal machine learning methods use systematic algorithms that compare a wide range of model specifications for the nuisance functions and choose the one that best fits the data. This makes them more robust methods for sensitivity analyses than the current practice in the literature. Indeed, the example discussed here shows that the robustness analysis performed with DML suggest different conclusions compared to the original paper's robustness checks. Thus, we view causal machine learning methods as promising tools for sensitivity analysis in empirical work.

The Effect of Skill-Biased Tariffs on Growth
Description of Original Analysis. The study by Nunn and Trefler (2010) investigates the relationship between skill-biased tariffs, i.e., a tariff structure that disproportionately favours skill-intensive industries, and long-term economic growth. The authors develop a theoretical framework based on Grossman and Helpman (1991) that shows how tariffs that focus on skill-intensive industries can lead to a disproportional expansion of skill-intensive industries, which then leads to higher long-term growth. Furthermore, using both cross-country and industry level data, the paper provides evidence of a positive relationship between the two variables, and delves into the mechanisms of this relationship. The findings suggest that the mechanisms from the theoretical framework can explain only part of the total correlation between skill-biased tariffs and growth. The paper attributes the remaining part of the correlation to the endogeneity of skill-biased tariffs, and in particular to the relationship between institutions and the skill-bias of tariffs: countries with good institutions tend to protect more skill-intensive industries.
In Nunn and Trefler (2010), three measures of the skill-bias of tariffs in the initial time period are used: 15 the correlation between the industry tariffs and the industry's skill-intensity, and two measures based on the difference between the log average tariffs in skill-intensive industries and log average tariffs in unskilled-intensive industries, which use different cut-off values for industry skill-intensity. In the country-level estimates, the outcome is log annual per capita GDP growth, and the regressions include a set of control variables. 16 The country-level regressions includes 63 observations.
For the industry-level estimates, the outcome variable is the average annual log change in industry output in each country, and the regressions include all the controls that appear in the country-level regressions, plus industry fixed effects. These regressions include 1004 data data points for 59 countries. An additional variable (the initial industry tariff) is included in some specifications to capture a potential mechanism: skill-biased tariffs can shift resources towards skill-intensive industries that generate positive externalities, thus leading to higher long-term growth. Thus, industries that have higher initial tariffs should have higher long-run output. If this channel can explain the effect of skill-bias on growth, the coefficient of the skill-bias of tariffs would decrease in size when this variable is included in the regression.
DML Analysis. We revisit the country and industry-level regressions reported in Tables 4 (columns 1, 2 and 4), Table 5 (columns 1, 2 and 4) and Table 6 (columns 1, 3 and 7) of Nunn and Trefler (2010). Further details on 15 The initial time period is 1972 for 21 countries, 1980-83 for 30 countries and 1985-87 for 12 countries. The end period is 2000 for most countries, except for 3 of them, for which data ends in 1996. See Nunn and Trefler (2010), Table 1 for a list of the countries included and the respective time periods.
16 Further details on the regressions estimated by Nunn (2007) and on the control variables are described in Section A.3 of the Appendix.   how the DML estimates are obtained and on the tuning parameter values are reported in Section A.3 of the Appendix. Table 3 shows the results of the DML partially linear model using country-level data. The DML treatment effect estimates are considerably smaller than the original paper's across all ML methods and across the three different treatment variables. Moreover, the estimated effects are not statistically significant, except the coefficients estimated using the lasso, which are significant at the 10% level. Additionally, we report the DML results using the industry-level data set (Table B.3 and Table B.4 show the results with and without including the initial industry tariff respectively). Similarly to the country-level estimates, the industry-level estimates are not statistically significant across all methods, except for the boosting estimates, which are significant at the 10% level.
Overall, the DML results suggest that the correlation between skillbiased tariffs and long-term economic growth is not robust to controlling for an unknown function of the average tariff level, country characteristics, initial production structure, cohort and region fixed effects. Indeed, the fact that the DML estimates are insignificant points to the presence of nonlinear confounding effects that are not accurately captured by the OLS regressions.
It is worth noting here that the original paper attributes most of the correlation found between the treatment variables and long-term growth to the endogeneity of the skill-biased tariff variables, arising from the fact that skill-biased tariffs are more likely in countries with better institutions. Interestingly, in this example the country-level DML estimates are in line with the notion that the direct effect of the skill bias of tariffs is smaller than what is estimated by the OLS regressions. Finally, our results only concern the relationship between skill-biased tariffs and long-run economic growth, and not the relationship between skill-biased tariffs and institutions, or between institutions and long-run growth, which are examined in the original paper. Thus, our findings are consistent with the alternative mechanism described in Nunn and Trefler (2010), i.e. the existence of a causal relationship between institutions and economic growth.

Heterogeneous Treatment Effects with Causal Random Forest
This section focuses on the analysis of HTE for the effect of Fox News on Republican voting (DellaVigna and Kaplan, 2007) using the causal random forest method (Wager and Athey, 2018).

Methodology: Causal Random Forest
The causal random forest method is an adaptation of the original random forest for prediction, introduced by Breiman (2001), to the problem of causal inference. In this section, we start by briefly presenting the general idea of standard regression trees used for prediction, after which we describe how causal trees and causal random forests work.
The idea of regression trees is to partition (or split) the data into groups based on the values of the covariates. The groups that are eventually obtained are referred to as leaves. First, one starts with the whole data set as one group. Then, for each value of each covariate, the regression tree algorithm forms candidate splits, by placing all observations that have a covariate value that is lower than than the current value in the left leaf, and all observation for which their covariate value is greater than the current value in the right leaf. Among all these candidate splits, the one that is implemented is the one that minimizes an in-sample criterion function, such as the mean squared error (MSE) of the outcome variable within a leaf. 17 For each of the two new leaves, the algorithm repeats the procedure until a stopping rule 18 is reached, resulting in a tree-format partition of the data. Using the terminal leaves, when the purpose is prediction, the outcome variables of out-of-sample observations can be predicted by determining which terminal leaf a new observation belongs to, based on the values of the covariates, and assigning as its predicted outcome the mean of the outcomes in that leaf.
Next, we turn to the causal random forest method of Wager and Athey (2018) which builds on the causal tree method of Athey and Imbens (2016). For the causal tree, first, a percentage p from the sample N is drawn without replacement. Then, the subsample n = p * N is further randomly split in half to form a training sample n tr and an estimation sample n e . Using only the training sample n tr , for each value of each covariate candidate splits are formed and a regression tree as described above is constructed.
The key difference in the causal case compared to the prediction case is the objective function that is optimized when determining the split to be implemented.
Due to the fundamental problem of causal inference, directly training machine learning methods on the difference Y i (1) − Y i (0), i.e., the difference of the outcomes that observation i would have experienced with and without the treatment, is not possible, as we do not observe both outcomes for any individual unit. Thus, instead of minimizing an infeasible MSE, Athey and Imbens (2016) propose to maximize a criterion function that rewards a split that increases the variance of treatment effects across leaves and penalizes a split that increases within-leaf variance. The goal is to accurately estimate treatment effects within leaves, while preserving heterogeneity across leaves. The split is performed if it increases the criterion function, compared to no split. When no more splits can be done, the tree constructed based on the first subsample is defined.
The subsequent step involves turning to the estimation sample n e , and based on the covariates, sorting each observation in this sample into the same tree. Using only the estimation sample, the treatment effect in each leaf is computed asτ l =ȳ lt −ȳ lc i.e., the mean outcome difference between treated (t) and control (c) observations within a leaf (l). The final step consists in returning to the full sample of N observations, examining to which leaf each observation belongs based on the values of their covariates, and assigning that leaf's treatment effect as the predicted treatment effect of the observation. Given that estimates from a single tree can have a high variance, the whole algorithm described above is repeated for a number of B subsamples on which a number of B trees are obtained that eventually form a causal random forest. The predicted treatment effect for each unit will be the average of predictions for that particular observation across the trees.
Notice that independent samples are used for: i) growing the tree (splitting the data), and ii) estimating treatment effects within each leaf of the tree. This property is called honesty. Honesty leads to two desirable characteristics: it reduces bias from overfitting, and it makes the inference valid, since the asymptotic properties of the treatment effect estimates are the same as if the structure of the tree had been exogenously given. 19 Wager and Athey (2018) establish consistency and the first asymptotic normality results for random forests which are then extended for the causal setting. For valid confidence intervals, a consistent estimator of the asymptotic variance is proposed, based on an infinitesimal jackknife for random forests. Further details regarding the tuning parameters of the causal random forest are provided in Section A.4 of the Appendix.

Application: The Effect of Fox News on the Republican Vote Share
Description of Original Analysis. In this section we revisit and further analyze the study by DellaVigna and Kaplan (2007). This paper examines the impact of media bias on voting outcomes. Specifically, it analyzes the impact of the entry of a conservative cable television channel, Fox News, on the Republican Party's vote share in the United States. To identify the causal effect of Fox News on voting, the authors investigate whether towns where Fox News became available between 1996 and 2000 experienced an increase in the vote share for the Republican Party in Presidential elections during the same time period. The estimation is performed on a data set at the town level, comprising information on 9256 towns.
We consider the main outcome variable, i.e. the change in the vote share for the Republican party between 1996 and 2000. The treatment variable is a dummy indicating whether Fox News had become available between 1996 and 2000. To capture potential confounders, a number of control variables are included in the regressions. 20 DellaVigna and Kaplan (2007) find a positive effect of Fox News on the Republican vote share. Moreover, they explore heterogeneity along a selected set of town characteristics: the number of available cable channels, the share of urban population, and whether the town is in a swing or Republican district. They do this by adding to the regression interaction effects of these covariates with the treatment variable. 21 Causal Forest Analysis. We perform the HTE analysis using the causal random forest method. Exploring heterogeneous effects is important for this study, in order to understand whether there are town or district characteristics that act as effect modifiers. While the average effects are informative for the impact of Fox News on the whole sample, it is often the case that treatment effects are not homogeneous. It is possible that the effect of Fox News was concentrated in some areas only. Understanding better the characteristics of the areas which saw the strongest and weakest responses can shed light on the mechanisms. The aim of this exercise is twofold. First, we take an agnostic view about the nature of heterogeneity, and we investigate whether there are town or district characteristics which are treatment effect modifiers. Second, we examine whether the HTE analysis from the original paper matches the results from the causal ML methods. We focus on one of the two preferred specifications from the original paper: the one that includes district fixed effects. We present results for two versions of the causal random forest, which account for district-level effects in different ways. In the first set of results, we include in the analysis dummy variables indicating the congressional district where the town is located. In the second set of results, we implement a cluster-robust version of the random forest developed by Athey and Wager (2019), where we treat each district as a separate cluster. The advantage of the cluster-robust causal forest is that it does not assume that clusters have an additive effect on the outcome. Further details on the clustered-robust causal forest and tuning parameter values used for the analysis are discussed in Section A.4 20 Further details on the regressions and on the control variables in DellaVigna and Kaplan (2007) are described in Section A.4 of the Appendix. 21 The findings are reported in Table 6 of the original paper. of the Appendix. We begin by discussing the average treatment effect. The results are presented in Table 4. As in the original analysis, we find a positive and significant effect of Fox News on the Republican vote share, both when including district dummies, and when implementing the clustered-robust causal forest; however, the standard error in the clustered forest is larger. Our results suggest that in towns where Fox News became available the Republican party obtained a higher vote share by 0.65 percentage points on average, compared to towns where Fox News was not available. The ATE estimates are similar to the original paper estimates, which range between 0.004 and 0.007 (reported in Table 4 of DellaVigna and Kaplan, 2007, columns 4-7).
Next, we want to assess whether the causal forest can recover heterogeneity of treatment effects. As pointed out in Athey and Wager (2019), we can group observations according to whether their estimated out-of-bag conditional average treatment effect (CATE) is above or below the median CATE, and we can estimate the average treatment effect separately for these two subgroups. These are reported in Table 4 as Fox News effect above median and Fox News effect below median. The difference between the two subgroup estimates is large when including district dummies, suggesting that there is potential for heterogeneity, and it is statistically significant, as indicated by the fact that the 95% confidence interval for the difference between the two estimates does not contain zero (see column 1 of Table 4). However, the same heuristic test for the clustered-robust forest does not detect significant heterogeneity in the treatment effect. This could indicate that heterogeneity in the model with district dummy variables is overstated, because the dummy variables cannot appropriately capture the district-specific effects. The cluster-robust causal forest offers a more flexible way to capture district-specific effects, and may be more suitable in this case. 22 Although the results of the test for overall heterogeneity are mixed, it is still possible for heterogeneity to be present along some of the covariates. Hence, we investigate whether any of the included covariates are possible sources of heterogeneity. To do this, for each variable, we split the sample in two parts, based on whether the value of the covariate of interest is below and above the median, and we estimate the average treatment effect for the two subsamples. Table 5 reports the HTE results along the variables that appear to be significant determinants of heterogeneity in both specifications, while B.5 and B.6 report the results for the remaining variables. In addition, to gain further insight into which variables are more important for heterogeneity, we compute a measure of variable importance (Athey and Wager, 2019). 23 Tables B.7 and B.8 report the variable importance measure for the covariates included in the district dummy variable specification and for the clustered-robust forest, respectively. We note that for both specifications, the variable importance measure is decreasing smoothly and we do not observe any variable that clearly stands out in terms of importance.
Our results in Table 5 show that three variables appear to be significant determinants of heterogeneity (at least at the 10% level) in both specifications: the change in employment between 1990 and 2000, the share of the population with education level equal to high school degree, and the 10th decile in number of cable channels available. We observe that the effect of Fox News on Republican voting is stronger in towns that experienced a smaller increase in the employment rate between 1990 and 2000. This finding may relate to the phenomenon of economic voting, i.e. the fact 22 Athey and Wager (2019) find a similar result in their application, when comparing the causal forest without clustering with the cluster-robust version.
23 See Section A.4 of the Appendix for details on how this measure is constructed. that voters tend to reward incumbents during periods of economic prosperity (e.g. Fair, 1978;Kramer, 1971;Lewis-Beck and Stegmaier, 2000;Pissarides, 1980). Areas that experienced lower economic growth (and a smaller increase in employment) may have been more easily persuaded to vote Republican in 2000, since prior to the Presidential election of 2000 a Democratic President (Bill Clinton) had been in power for two consecutive mandates. Moreover, we observe a larger effect of Fox News in towns where the share of population with education level equal to high school degree is below median. We also find a larger positive effect of Fox News in towns where the 10th decile in the number of cable channels is below median, while the effect is negative and insignificant in towns where this variable is above median. 24 Next, we investigate whether the findings regarding heterogeneity from the original paper are confirmed with the causal forest. DellaVigna and Kaplan (2007) found a larger effect of Fox News on the Republican vote share in towns with a smaller number of cable channels available when including district fixed effects. While we do not observe significant heterogeneity along this variable, our results for the 10th decile in the number of cable channels are in line with the findings of the original analysis, and hence suggest that the effect of Fox News diminishes in the presence of higher competition in cable channels. It is also interesting to note that the number of cable channels emerges as the variable with the highest importance score in both specifications, which further points to the importance of this variable for heterogeneity. When investigating heterogeneity along the political orientation of the district, we confirm the findings of DellaVigna and Kaplan (2007): we observe no significantly different effect for swing districts, and we obtained mixed results for Republican districts, as we find a significantly smaller effect of Fox News in Republican districts (at the 10% level) when including district dummies, but not with the clusterrobust forest. 25 However, in contrast to the original analysis, we do not find a significant difference in the effect of Fox News in rural versus urban towns, despite this being the only heterogeneity result that is robust in all specifications in DellaVigna and Kaplan (2007).
In conclusion, our analysis of the HTE of Fox News on Republican voting confirms some of the findings from DellaVigna and Kaplan (2007), namely the presence of heterogeneity along the number of cable channels and no robust heterogeneous effects for districts with different political orientations, but as opposed to the original paper it does not show different effects for urban and rural areas. The analysis with the causal forest further uncovers additional heterogeneity that was previously unexplored, such as a larger effect in towns that experienced a smaller increase in the employment rate, and a larger effect in towns with a lower share of population with high school degree. Finally, including district dummy variables results in the causal forest detecting more heterogeneity in treatment effects compared to the cluster-robust version, both when implementing the overall heterogeneity test and when analysing the HTE in terms of individual covariates. However, the model with district dummy variables could overstate the heterogeneity compared to the cluster-robust forest if the district dummies do not appropriately capture the district-specific effects. This points to the need of a more careful treatment of the issue of clustered observations when employing causal random forests for empirical applications (Athey and Wager, 2019).

Heterogeneous Treatment Effects with Generic Machine Learning
This section focuses on the analysis of HTE for the effect of a teacher training intervention (Loyalka et al., 2019) using the generic machine learning method (Chernozhukov et al., 2018b).

Methodology: Generic Machine Learning
A different causal ML approach for HTE is the generic machine learning method of Chernozhukov et al. (2018b). To make inference possible, the method does not focus directly on the HTEs, but on features of HTEs such as: the best linear predictor of the heterogeneous effects (BLP), the group average treatment effects (GATES) sorted by the groups induced by machine learning proxies, and the average characteristics of the units in the most and least affected groups, or classification analysis (CLAN). The generic machine learning method is thus useful for empirical work as: (1) it allows detection of heterogeneity in the treatment effect, (2) computes the treatment effect for different groups of observations (such as least affected or most affected groups), and (3) describes which covariates are correlated the most with the heterogeneity.
The approach is based on random splitting of the data into an auxiliary and a main sample. The two samples are approximately equal in size. Based on the auxiliary sample, a ML estimator, called proxy predictor, is constructed for the conditional average treatment effect (CATE). Any generic ML method can be used for this approximation (e.g., elastic net, random forest, neural network, etc.). The proxy predictors are possibly biased and consistency is not required. We simply take them as approximations and use them to estimate and make inference on features of the CATE. Based on the main sample and the proxy predictors, we can compute the estimates of interest: BLP, GATES and CLAN, and then make inference relying on many splits of the data in auxiliary and main samples.
We give a brief description on how the method works in practice. Let Y . Using the auxiliary sample we obtain ML estimators (or proxy predictors) for the baseline conditional average and the conditional average treatment effect. As mentioned above, these are possibly biased predictors and consistency is not required. Then, for each unit in the main sample, we compute the predicted baseline effects, B(Z) and the predicted treatment effects, S(Z). Note that the predicted treatment effects, S(Z), are obtained as the difference between the predictions for the treatment group model and the control group model. Following the notation from Chernozhukov et al. (2018b), the BLP parameters are obtained using the main sample, by estimating the following regression by weighted OLS, with weights 1/(p(Z)(1 − p(Z)): where X  Thus, it orthogonalizes this regressor to all other covariates that are functions of Z. The coefficient β 1 gives the average treatment effect, while β 2 quantifies how well the proxy predictor approximates the treatment heterogeneity. If β 2 is different from zero, it means that there exists heterogeneity in the treatment effects.
Once we obtain the predicted treatment effects, we can divide the observations from the main sample in group: G 1 , G 2 , . . . , G K , based on their treatment effects. In our empirical applications, we choose K = 5, such that group G 1 contains the observations with the lowest 20% treatment effects and G 5 contains observations with the highest 20% treatment effects. Then, using again the main sample, we obtain the sorted group average treatment effects by estimating the weighted regression: where 1(G k ) is an indicator function for whether an observation is in group k, and where the weights are the same as in (3). The parameters γ k give the average effect in each group (GATES). Also, if the difference γ k − γ 1 is significantly different from zero, we again have evidence for treatment effect heterogeneity between the most affected and least affected groups. Lastly, we can analyze the properties or characteristics of the most affected and least affected groups, via Classification Analysis (CLAN). Let g(Y, Z) be a vector of characteristics of an observation. We can compute average characteristics of the most affected and least affected group i.e., δ 1 = E[g(Y, Z)|G 1 ] and δ 2 = E[g(Y, Z)|G K ], the parameters of interest being averages of variables directly observed. Similarly to GATES, we can compute and make inference on the difference δ k − δ 1 .

Student Performance
Description of Original Analysis. We reanalyze a large-scale randomized experiment that investigates the effect of a teacher professional development (PD) program in China on student achievement and on other student and teacher outcomes. The experiment was first studied by Loyalka et al. (2019). Three hundred mathematics teachers, each employed in different schools across one province, took part in the intervention. The teachers were randomly assigned to one of the different treatment arms: PD only; PD plus a continuous follow-up with additional material and tasks for the trainees; PD plus an evaluation of the extent to which the teachers remembered the content of the training sessions; or no PD (control group). The PD intervention consisted of lectures and discussions.
Randomization was implemented at the school level, and in each school one teacher was nominated to participate in the intervention. The main results are obtained by estimating a cross-sectional regression, where the treatment variable is a dummy indicating the treatment arm that the school was assigned to. The data was collected at three points in time: at baseline, midline and endline. Outcomes are measured at midline, or endline, and the main outcome of interest is student math achievement. 26 The control variables include student characteristics, teacher characteristics and class size. 27 The original paper finds no significant effect of the PD intervention on students' achievement after one academic year, neither for the PD intervention alone, nor for the PD combined with the follow up and/or the evaluation treatments. The authors also do not find any effect on other outcomes, such as teacher knowledge or student motivation. The lack of effectiveness of the program is attributed to several factors: the content was too theoretical, the PD was delivered passively, and teachers could face constraints in the implementation of the suggested practices in the schools. Furthermore, the paper analyzes heterogeneous treatment effects, by interacting the treatment variable with a number of student and teacher characteristics: student's household wealth, baseline achievement level, the amount of training the teacher has received prior to the intervention, student and teacher gender, whether the teacher has a college degree and whether the teacher majored in math. The findings suggest that the effect of the treatment on students' achievement can differ by teacher characteristic; however, no heterogeneous effects are found in terms of characteristics of students.
Generic ML Analysis. We extend the analysis of HTE conducted in the original paper, by implementing the generic machine learning method developed by Chernozhukov et al. (2018b). Exploring heterogeneous treatment effects is particularly relevant for this intervention, because a small and insignificant estimate for the ATE could hide significant heterogeneity. Our aim is to dig deeper into the analysis of heterogeneous treatment effects. First, we investigate whether there is significant heterogeneity in treatment effects; second, we analyze whether causal machine learning methods, by implementing a systematic search for heterogeneity across a large number of covariates, can offer additional insights about the charac-  Notes: The estimates are obtained using neural network to produce the proxy predictor S(Z). The values reported correspond to the medians over 100 splits.
teristics of those who benefited from the program and those who did not, compared to the traditional methods used in the original paper. In our analysis, we focus on the main outcome of interest, i.e. student math achievement. Since the results in the original paper are consistently close to zero when comparing the three different treatment arms with the control group, we choose to only analyze one of the treatment arms, corresponding to the PD intervention plus the evaluation. The sample that we use includes 10,006 students in 201 schools. We follow Loyalka et al. (2019) and cluster standard errors at the school level. In addition to the full set of controls included in the original paper, we also add to our analysis other variables that could be treatment effect modifiers: the baseline values of a number of student-level variables, plus variables indicating teachers behaviour in the classroom, evaluated by students at baseline. 28 The generic method can be used in conjunction with a range of ML tools and Chernozhukov et al. (2018b) provide two measures (Best BLP and Best GATES) to compare the performance of the different ML methods used for the estimation of the proxy predictors. We consider the following methods: elastic net, neural network, and random forest. Based on the results of the Best BLP and Best GATES analysis, reported in Table B.9 of the Appendix, we choose to further work with the neural network. 29 28 These additional variables are described Section A.5 of the Appendix. In Loyalka et al. (2019), the baseline value of the outcome variable is included as a control. Hence, the baseline characteristics described above are not included in all regressions in the original analysis. However, we consider these characteristics as potential drivers of heterogeneity; therefore, we include the baseline values of all available variables in our heterogeneity analysis.
29 Further details on the Best BLP and GATES measures and on the tuning parameters used in this analysis are discussed in Section A.5.

Figure 1: Teacher Training -Generic Method: GATES
Notes: The estimates are obtained using neural network to produce the proxy predictor S(Z). The point estimates and 90% confidence intervals correspond to the medians over 100 splits.
We first analyze whether overall heterogeneity in treatment effects can be detected. We present results for the best linear predictor (BLP) of the CATE in Table 6. In line with the original paper, the estimated ATE, given by the coefficient β 1 , is small (the estimated impact of the PD is 0.002 standard deviations) and not significantly different from zero. The estimated β 2 is instead large and significantly different from zero, which indicates that there is heterogeneity in treatment effects. Next, we estimate the group average treatment effects (GATES). We split the sample into five groups, based on the quintiles of the ML proxy predictor S(Z). This analysis reveals further insights into the extent of heterogeneity. Table B.10 of the Appendix reports the GATE in the top and bottom quintile and shows that the GATE in the top quintile is positive, whereas for the bottom quintile the estimated GATE is negative. Both estimates are statistically significant at the 10% level. The difference between the GATE for the top and the bottom quintile is significant, which confirms the presence of heterogeneity in treatment effects. Additionally, Figure 1 reports the GATES estimate and the 90% confidence interval for the five quintiles, as well as for the whole sample (the ATE is represented as a blue dashed line, and the confidence Notes: This table shows the average value of the teacher and student characteristics for the most and least affected groups. The estimates are obtained using neural network to produce the proxy predictor S(Z). 90% confidence intervals are reported in parenthesis. The variables Student math score at baseline and Student baseline math anxiety are normalized. The values reported correspond to the medians over 100 splits.
interval as two red dashed lines). Notice that for the three middle quintiles the effect of the teacher training intervention is not significantly different from zero. We then turn to analysing the possible sources of heterogeneity, by implementing the Classification Analysis (CLAN). Thus, we analyze further the top and botton quintile in terms of ATE, for which the effect of the PD intervention is positive and negative respectively. In particular, we compare the student and teacher characteristics in the two groups. As a large number of covariates is available, we focus on the ten covariates for which the correlation with the proxy predictor, S(Z), is highest, reported in Table 7. Table B.11 in the Appendix shows the CLAN analysis for the remaining covariates. 30 We start by analyzing the characteristics of the teachers whose students belong to the least and most affected group. Interestingly, the variable in-30 Table B.12 reports the correlation for each of the covariates with S(Z). dicating whether the teacher has a college degree or not is the variable that is most correlated with the proxy predictor, and it was the only one among the variables tested which was found to be a treatment effect modifier across all treatment arms in the original paper. The students in the top quintile are more likely to be taught by a teacher who does not have a college degree, compared to the students in the bottom quintile. This is consistent with the results from Loyalka et al. (2019), who found that the intervention has a negative effect on students whose teachers have a college degree, but a positive effect on students whose teachers are less qualified. Hence, the PD may help teachers who are less qualified, but, for more qualified teachers, the benefits of the intervention on their students do not outweigh the negative effect of the teachers being absent from the classroom in order to participate in the intervention. Whether or not the teacher majored in math is found to be a potential driver of heterogeneity with the generic method (the results are reported in Table B.11), whereas in the original paper it was not found to be significant when considering the effect of the PD plus evaluation, which we focus on. 31 The direction of the effect is consistent with what was found in the original analysis: the students in the top quintile are more likely to have been taught by a teacher who does not have a major in math, compared to the students in the bottom quintile. It is also interesting to note that the number of hours of training that the teacher received prior to the intervention, which is not found to be a determinant of heterogeneity in the original paper, is higher in the most affected group compared to the least affected group. 32 This may reflect the fact that teachers who have had more training in the past may be able to better implement the suggestions from the PD intervention. Table 7 shows that teacher rank, experience and age are higher in the most affected group compared to the least affected group. This is consistent with the existence of a similar mechanism: teachers who have more experience 31 When considering the PD plus follow-up, the authors find a significant negative effect on the scores of students whose teachers majored in math relative to the scores of those whose teachers did not. 32 The variable indicating teacher training hours previous to the intervention is a categorical variable, based on the terciles of the continuous variable. As the continuous variable is not included in the replication data set of the original paper, for our analysis we use this categorical variable, which takes values 1 to 3, where 3 is the top tercile in the number of training hours. may be able to better implement the suggestions from the PD intervention. As the PD is mainly theoretical, having had other types of training, or having more experience, may be helpful for an effective implementation of the practices learned during the PD.
We then examine whether any of the student characteristics are potential drivers of heterogeneity. In contrast to the findings in Loyalka et al. (2019), who did not find heterogeneity in terms of student features, we find that students in the most affected group differ in terms of several characteristics compared to students in the least affected group. Among the most correlated with the heterogeneity score (listed in Table 7) are student age and gender: students in the most affected group are on average about half a year older than students in the least affected group, and the most affected group includes a larger share of male students. Additionally, students in the most affected group, on average, have a lower baseline math score, and tend to be more anxious about math. Thus, teacher PD could be more beneficial for weaker students, and for students who are more anxious about the subject. Finally, class size appears to be a possible determinant of heterogeneity: students who benefit more from the PD tend to be in smaller classes. This result suggests that in smaller classes it may be easier for teachers to implement some of the practices introduced during the PD training. For instance, Loyalka et al. (2019) mention having students work together in small groups as one of the techniques that were suggested in the PD; this technique is likely to be easier to implement in smaller classes.
In conclusion, our analysis confirms the presence of heterogeneous effects of the teacher PD intervention, and uncovers a rich set of potential determinants of heterogeneity. With the GATES analysis, we are able to show that the achievement of students belonging to the bottom quintile is negatively affected by the intervention, while the achievement of students in the top quintile is positively affected by the intervention. This confirms what was suggested by Loyalka et al. (2019): that there is a group of students who benefits from the intervention, and a group who does not. In addition, the GATES analysis shows that the effect is not significantly different from zero for the students belonging to the middle quintiles. With the CLAN analysis, we can obtain a clearer picture of the characteristics of the groups who benefit and who do not from the intervention, compared to the original HTE analysis. In line with Loyalka et al. (2019), we find that teacher characteristics such as having a college degree or having a major in math are potential determinants of heterogeneity. However, our study uncovers additional differences (that were not identified in the original paper) between the least and the most affected groups, in terms of both teacher and student characteristics, such as teacher's rank, experience, age and number of training hours, as well as student's gender, age, baseline math score, baseline math anxiety and class size.

Conclusion
Our main message is that appropriately combining predictive methods with causal questions adds value to traditional methods and should be more often explored in applied research. We argue that in each revisited study the researcher would have benefited from employing causal ML methods and would have gained additional insights not provided by standard causal inference tools.
When the researcher works with an observational study and is interested in the ATE, causal machine learning methods can improve the credibility of causal analysis by making the unconfoundedness assumption more plausible -as causal ML methods control for potential confounders in a more flexible way; implement a systematic model selection; and are robust approaches for sensitivity analysis. 33 If the researcher is interested in HTE, causal machine learning methods can ensure that relevant heterogeneity and its determinants are not missed, or falsely discovered due to multiple hypothesis testing issues. Also, causal ML methods can be used to uncover heterogeneity ex-post, without being bound to explore HTE only for the specific subgroups indicated in the pre-analysis plan.
where c is an index for country. Four different outcome variables are examined: investment as a percentage of GDP, FDI as a percentage of GDP, business density per 100 people, and the average entry rate (measured as percentage). Three separate measures of corporate taxes are considered. The first is the statutory corporate tax rates, which is the marginal tax rate on income a corporation has to pay assuming the highest tax bracket. The second is the actual first-year corporate income tax liability of a new company, relative to pre-tax earnings. The third is the tax rate which takes into account actual depreciation schedules going five years forward.
The term X c denotes the control variables, aimed at capturing the effect of potential confounding factors. This is an observational study, in which tax rates are not randomly assigned across countries. It is likely that there will be factors which are correlated with both the treatment (corporate tax rates), and with the outcomes (measures of entrepreneurship and investment). To deal with this issue, the effect of corporate taxes on the outcomes is estimated by adding several control variables to the regressions. The first set of control variables are measures of other taxes: the sum of other taxes payable in the first year of operation, VAT tax, sales tax, and the highest national rate on personal income tax. The second set of covariates include the logarithm of the number of tax payments made (which is used as a measure of the burden of tax administration), an index of tax evasion, and the number of procedures to start a business. The third set of controls are institutional variables: a property rights index, an indicator of the rigidity of employment laws, a measure of a country's openness to trade, and the log of per capita GDP. The fourth set of covariates are measures of inflation: average inflation in the previous ten years, and seigniorage, which captures government reliance on printing money.
Details on the DML Analysis. The results are based on 100 splits and 2 folds. The point estimates are calculated as the median across splits, and the standard errors are adjusted for the variability across sample splits using the median method, see Chernozhukov et al. (2018a). We use two hybrid ML methods in our analysis. Ensemble is a weighted average of estimates from lasso, boosting, random forest and neural networks, the weights being chosen to give the lowest average mean squared out-of-sample prediction error. Best chooses the best method for estimating the nuisance functions in terms of the average out-of-sample prediction performance among all the other methods.
The lasso estimates are based on 1 -penalized regressions with the penalty parameter obtained through 10-fold cross-validation. As controls, for the lasso we consider the set of all raw covariates as well as first-order interactions. For the rest of the ML methods, we consider the set of raw covariates as controls. The regression tree method fits a CART (classification and regression tree) tree with a penalty parameter (that restricts the tree from overfitting and makes sure that only splits that are considered "worthy" are implemented) obtained with 10-fold cross validation. The random forest estimates are obtained using 1000 trees, while the Boosting estimates are obtained with 1000 boosted regression trees. For the boosting, the minimum number of observations in trees' terminal nodes is set to 1 and the bag.fraction parameter is set to 0.5, except for Panel D of Table 1, where it is increased to 0.8. For the neural networks we used 2 neurons and a decay parameter of 0.01; the activation function is set to the linear function. 34 For the analysis of nonlinear terms with lasso, we examine the estimated nuisance functions for the outcome average entry rate and the treatment variable first-year effective tax rate. In our analysis, for the estimation of the two nuisance functions, the lasso selects among the simple covariates, and their two-way interactions. 35 It is interesting to note that a large 34 In general, the activation function can be set to the linear function for regression problems (when the outcome is continuous) and to the logistic function for classification problems (when the outcome is categorical). 35 For the lasso estimation, depending on the application, other nonlinear terms could number of interaction terms is selected. Figure B.1 depicts the seven largest interaction terms and their coefficients in the treatment nuisance function m(·) and in the outcome nuisance functionĝ(·). 36 The lasso coefficients are calculated as the median coefficients across splits. Among these, some appear in both nuisance functions (the coefficients of the common terms are depicted in purple in Figure B.1). A particular issue that appears with the lasso when the interest is on analyzing the interaction terms is worth mentioning here. Since the lasso implements regularization by shrinking the smallest coefficients to zero, it is possible that interaction terms are included in the regression, but the coefficients of the raw covariates forming the interactions are shrunk to zero. It is thus important to check whether the raw covariates forming these interactions also appear in the regression. If the coefficients on the raw covariates are shrunk, the coefficient of the 'pure' interaction terms might not be properly captured and the found interaction terms might actually reflect the effect of the raw covariates, diminishing the importance of our uncovered nonlinearities. Thus, when analyzing the relevance of the interaction terms, we are careful to only report the coefficients of the interactions for which both main effects are included in the lasso estimation. The lasso coefficients of all the raw covariates are reported in Table B.1 of the Appendix.

A.2 The Effect of Plough Agriculture on Gender Roles
Details on the Original Analysis.  consider several empirical strategies and data sets. They start with OLS regressions performed using country-level and micro-level data. Then, to tackle possible endogeneity issues, the paper follows two approaches: first, several potential confounders are included in the regressions; second, an instrumental variable strategy is used. 37 Our focus is on the country-level regressions. The baseline OLS country-level results in the original analysis (reported in Table 4 of  are obtained by estimating the following be added, such as the squares of the covariates, or three-way interactions.
36 It is important to note here that we do not make inference using the lasso coefficients, but just analyze the magnitude of the coefficients, as a measure of the covariates' importance for predicting the outcome and the treatment variables. 37 The instrumental variable strategy is summarized in Section 2.2.2. regression: where c stands for country. In the paper, three outcome variables are examined as measures of gender roles: female labour force participation, attitudes about women's work, and attitudes about women as leaders. The first outcome variable is an indicator variable that equals one if the individual is in the labor force in 2000; the second is the share of firms with a woman among its principal owners in the period 2003-2010; finally, the third is the proportion of seats held by women in the national parliament in 2000. The treatment variable, plough use c , is calculated as the estimated proportion of individuals living in a country with ancestors that used the plough in pre-industrial agriculture. The vector X H c includes historical ethnographic variables at the country level. These controls capture the historical characteristics of ethnicities living in a country, and they are meant to account for differences between ethnicities that historically adopted the plough and those that did not. They include: ancestral suitability for agriculture, fraction of ancestral land that was tropical or subtropical, ancestral domestication of large animals, ancestral settlement patterns, and ancestral political complexity. The vector X C c denotes contemporary country-level controls: natural log of real per capita GDP, and its square. These are included as the level of economic development is believed to have an impact on female labour force participation, and the square of per capita GDP is intended to capture the observed U-shaped relation between the two variables. Continent fixed effects are also added in some specifications.
The extended set of controls includes additional historical and contemporary controls. Just as with the baseline controls, the additional historical controls are measures of the characteristics of the ancestors of the current population living in a country. These are: the intensity of agriculture; the proportion of subsistence provided by hunting and by the herding of large animals; the fraction of countries' ancestors without land inheritance rules, with patrilocal post-marital residence rules, and with matrilocal postmarital rules; the fraction of countries' ancestors with a nuclear and an extended family structure; the average year the ethnicities were sampled in the Ethnographic Atlas. The contemporary controls are: years of civil and interstate conflicts ; terrain ruggedness; whether a country was under a communist regime after WWII; the fraction of a country's population with European descent; oil production per capita; agricultural, manufacturing and services shares of GDP; and the fraction of a country's population who is Catholic, Protestant, other Christian, Muslim, and Hindu.  provide the rationale for including each of these controls, and details on how the variables are constructed.
The geo-climatic characteristics included in the IV analysis are: terrain slope, soil depth, average temperature, average precipitation. In the original paper, the geo-climatic characteristics are added linearly, in quadratic forms, and as linear interactions.
Details on the DML Analysis. As in the previous example, the results are obtained with 100 splits and 2-fold cross-fitting. We report median estimates of the coefficients across the splits, and standard errors adjusted for the variability across sample splits using the median method. The values of the tuning parameters are the same as in the first example.

A.3 The Effect of Skill-Biased Tariff on Growth
Details on the Original Analysis. For the country-level results, Nunn and Trefler (2010) estimate the following regression equation: where ln y c1 /y c0 is the log annual per capita GDP growth in country c between the beginning and the end of the time period considered, SBτ c0 is a measure of initial skill-bias of tariffs, and X c0 represents the controls. Three measures of the skill-bias of tariffs are used: the first is the correlation between the industry tariffs and the industry's skill-intensity, while the second and third are based on the difference between the log average tariffs in skill-intensive industries and log average tariffs in unskilled-intensive industries (the two measures differ in the choice of the cut-off value for industry skill-intensity, with the second using a lower cut-off than the third). The controls include: the log of the initial average level of tariffs in the country, three country characteristics measured at the initial period (the log of GDP per capita, the log of human capital, and the log of the ratio of investment-to-GDP), cohort fixed effects (to account for the fact that countries have different initial time periods), region fixed effects (accounting for 10 different regions), and two measures of initial production structure (the log of output in skill-intensive and in unskilled-intensive industries separately). Additionally, Nunn and Trefler (2010) estimate the following regression equation, using industry-level data: where lnq ic1 /q ic0 is the average annual log change in industry output in industry i and country c; lnq ic0 is the log of industry output in the initial period; τ ic0 is the log initial-period tariff; lnτ c0 is the average tariff, SBτ c0 is one of the three measures of skill-bias of tariffs, and α i are industry fixed effects. The variable X c0 indicates the controls which are the same as in the country-level regressions.
The original results show a strong, positive correlation between skillbiased tariffs and long-term per capita income growth at the country level ( Table 4 in Nunn and Trefler, 2010). The correlation is strong also between the skill bias of tariffs and industry output growth, with and without including the initial industry tariff in the regression (Tables 5 and 6 in Nunn and Trefler, 2010 respectively). The fact that the size of the coefficient of skill-biased tariffs remains large when adding the variable initial industry tariffs suggests that the mechanism highlighted in the model, i.e. skillbiased tariffs shifting resources towards skill-intensive industries, cannot fully account for the correlation between the treatment variable and longterm growth. Nunn and Trefler (2010) further show, with country-level regressions, that the model mechanism can explain up to one quarter of the total correlation between the skill bias of tariffs and long-term growth ( Table 7 in the original paper). The paper then investigates other alternative mechanisms that can explain the independent effect of skill-biased tariffs on output growth in Sections V, VI and VII, in the original paper.
Details on the DML Analysis. As in the previous examples, the results are obtained with 100 splits and 2-fold cross-fitting. We report median estimates of the coefficients across splits, and standard errors are adjusted for the variability across sample splits using the median method. The tuning choices are the same as in the previous two examples except for Neural Network in the country-level regressions where the estimates are obtained using 3 neurons and a decay parameter of 0.001.

A.4 The Effect of Fox News on the Republican Vote
Share Details on the Original Analysis. To produce the main results (see Ta-ble IV in the original paper), the authors estimate the following regression: where k denotes a town in a congressional district j. The dependent variable is the change in Republican vote share between the 1996 and the 2000 presidential elections. The treatment variable d F OX k,j,2000 is an indicator variable taking the value of 1 for towns where Fox News was available by the year 2000, and 0 otherwise. The regression includes demographic controls at the town level: total population, the employment rate, the share of African Americans and of Hispanics, the share of males, the share of the population with some college education, the share of college graduates, the share of high school graduates, the share of the town that is urban, the marriage rate, the unemployment rate, and average income. These controls are added both as levels in 2000 (X k,2000 ) and as changes between 1990 and 2000 (X k,00−90 ), and aim at capturing possible confounders that could be correlated with both the availability of Fox News and voting. In addition to the demographic controls, the regression includes a set of cable system features, denoted by C k,2000 , which are potentially correlated with the treatment variable. These are deciles in the number of channels provided and in the number of potential subscribers. Finally, fixed effects (congressional district fixed effects or county fixed effects) denoted by θ j , are added to capture trends in voting that might be common to a geographical area and also correlated with Fox News availability. In the original analysis, standard errors are clustered at the cable company level.
The paper also tests whether Fox News increased voter turnout and the Republican vote share in the Senate election.
The results from the heterogeneity analysis of DellaVigna and Kaplan (2007) show a negative but insignificant effect for swing district. Additionally, the authors find that the effect of Fox News on the Republican vote share is significantly smaller in towns where the number of cable channels is higher, suggesting a negative impact of higher competition on the effect of Fox News. Moreover, the effect is found to be significantly larger in more urban areas and smaller in more Republican districts. Regarding the latter two findings, the authors point out that in rural areas and in Republican districts the Republican party tends to have a larger vote base to begin with, thus diminishing the share of voters that could potentially be convinced by Fox News. Out of the four effects, only the differential effect for urban population is significant in both main specifications (county and district fixed effects). The interaction of the treatment variable with the Republican district variable is only significant when including county fixed effects, but not when including district fixed effects, and the opposite is true for the interaction of the treatment with the number of cable channels. The authors also make a note that they find a smaller effect in the South, but this result is not reported in their paper, and we do not focus on it in our analysis.
Details on the Analysis with the Causal Random Forest. There are a number of parameters to be set in the causal random forest, algorithm such as the number of trees, the size of the subsample, and the minimum number of control and treatment units in each leaf. The number of trees is typically chosen as a trade-off between computation times and the test error rate. A larger number of trees reduces the Monte Carlo error due to subsampling, which means that the treatment effect predictions will vary less across different forests. A higher number of minimum treatment and control units will lead to bigger leaves and a less deep tree. This will predict less heterogeneity. A smaller number will increase the variance as the treatment effect will be estimated with too few observations in a given leaf. Setting a smaller subsample size will decrease the dependence across trees, but will increase the variance of each estimate in a tree. The sizes of the training and estimation samples are typically fixed to 50% of the drawn subsample. If there are reasons to allocate more observations to one or the other sample, these proportions can be changed. In the algorithm, there is also a standard parameter for the number of covariates considered for a split, before building a tree, within a forest. 38 This is often set to K/3 in the literature, where K is the total number of covariates. In our analysis, the tuning parameter values are optimised via crossvalidation, except the number of trees which is set to 2000. We performed sensitivity analysis with different values for the number of trees (1000 and 5000). The results are available upon request.
In the cluster-robust causal forest (Athey and Wager, 2019), when constructing the subsample on which the forest is trained, we do not directly draw observations, but clusters. In addition, in the final step, when constructing the predicted out-of-bag treatment effects, an observation is considered out-of-bag if its cluster was not drawn in the subsample.
The variable importance measure reported in Tables B.7 and B.8 takes into account the proportion of splits over all trees for a particular variable, weighted by depth, and it is useful for describing which covariates influence the most the final estimates when employing the causal forest, as the interpretability of a single tree is lost in this case. Recall from the main text that in the causal forest splits are performed if they maximize a criterion function that rewards splits that increase the variance of the treatment effect across leaves, while penalizing splits that increases the variance within a leaf. Hence, higher values for this measure indicate higher importance in terms of heterogeneity of treatment effects.
where Y i,j is the outcome, measured at midline or endline, for student i in school j ; D j is a dummy variable indicating the treatment assignment; the vector X ij includes the control variables, measured at baseline; τ k indicates the block fixed effects. 39 The main outcome of interest, student achievement, is measured with a 35-minutes mathematics test at endline. The full set of control variables includes students characteristics (age, gender, parent educational attainment, household wealth), class size, and teacher characteristics (gender, age, experience, education level, rank, a teacher certification dummy, and a dummy indicating whether the teacher majored in math).
The findings from the heterogeneity analysis suggest that the program has a small positive effect on achievement of students taught by less qualified teachers and a negative effect on students whose teachers are more qualified. In addition, some evidence of heterogeneity is found in terms of whether or not the teachers majored in math, with a negative effect on achievement for those students whose teachers did major in math (this effect is only found when comparing the PD plus follow up with the control group).
Details on the Generic ML Analysis. In addition to the full set of controls included in the original paper, we add to our analysis the following variables: the baseline values of a number of student-level variables (math self concept, math anxiety, intrinsic motivation for math, instrumental motivation for math, time spent each week studying math), plus a number of variables indicating teachers behaviour in the classroom, evaluated by students at baseline (instructional practices of teacher, teacher care, classroom management of teacher, teacher communication). The generic ML method takes into account two sources of uncertainty: estimation uncertainty, as the final estimates of interest are obtained conditional on the auxiliary sample, and splitting uncertainty, as the data is randomly split in many auxiliary and main samples. The point estimates are obtained as the median estimates over the different splits of the data. The confidence intervals are constructed by taking the medians of the lower and upper bounds over the random splits. Their nominal level is adjusted to 90% to account for the splitting uncertainty. In a similar way, the pvalues are computed based on the median of many random conditional p-values, with nominal level adjusted again for splitting uncertainty.
The Best BLP and Best GATES measures are based on maximizing the correlation between the proxy predictor of the conditional average treatment effect, S(Z), and the true conditional treatment effect, s 0 (Z) (see Chernozhukov et al., 2018b). Table B.9 shows that this correlation is the largest for the Neural Network. Therefore, we carry out the HTE analysis using the Neural Network.
The values of the tuning parameters were optimized via cross-validation for the Elastic Net and Neural Network. For the random forest they are set to default values to save on computation time. For the random forest, the number of trees is set to 2000 and the number of covariates considered for a split is set to K/3, which gives a value of 8. Notes: The table shows the lasso coefficients of the raw covariates, obtained by estimating the nuisance functions g(·) (column 1) and m(·) (column 2). The lasso coefficients are calculated as the median over splits. Notes: The table reports the effect of Fox News on the Republican vote share for towns with values below (column 1) and above (column 2) the median of each variable. Column 3 presents the p-value for the null of no difference between the estimates in columns 1 and 2. Standard errors are reported in parentheses. The estimates are obtained from the causal random forest that includes district dummy variables. As we are not interested in exploring heterogeneity along the congressional districts, the HTE results for district dummy variables are omitted from the table. Notes: The table reports the effect of Fox News on the Republican vote share for towns with values below (column 1) and above (column 2) the median of each variable. Column 3 presents the p-value for the null of no difference between the estimates in columns 1 and 2. Standard errors are reported in parentheses. The estimates are obtained from the cluster-robust causal forest.    Notes: The estimates are obtained using neural network to produce the proxy predictor S(Z). The values reported correspond to the medians over 100 splits.  Notes: The figure plots the seven largest lasso coefficients of the interaction terms, obtained estimating the nuisance functions m(·) and g(·). Colons indicate interactions of variables. The treatment variable D, is the first year effective corporate tax rate. The dependent variable Y is the average entry rate. The lasso coefficients are calculated as the median over splits.