Identifying heterogeneity using recursive partitioning: evidence from SMS nudges encouraging voluntary retirement savings in Mexico

Abstract Individuals regularly struggle to save for retirement. Using a large-scale field experiment (N=97,149) in Mexico, we test the effectiveness of several behavioral interventions relative to existing policy and each other geared toward improving voluntary retirement savings contributions. We find that an intervention framing savings as a way to secure one’s family future significantly improves contribution rates. We leverage recursive partitioning techniques and identify that the overall positive treatment effect masks subpopulations where the treatment is even more effective and other groups where the treatment has a significant negative effect, decreasing contribution rates. Accounting for this variation is significant for theoretical and policy development as well as firm profitability. Our work also provides a methodological framework for how to better design, scale, and deploy behavioral interventions to maximize their effectiveness.

• Appendix E provides the results comparing the heterogeneous treatment effects relative to the average treatment effects.
• Appendix F provides the plots from the first stage of the Causal Forest exercise, as well as the estimates of the treatment effects produced via Causal Forest for all behavioral interventions.
• Appendix G contains details of the simulation study used for the Causal Forest approach.
• Appendix H shows the results for identifying heterogeneous treatment effects using OLS.
• Appendix I presents a detailed analysis of the persistence results post-intervention, using different time intervals and age specifications.    Note: This table reports the estimated relationship of contribution amount as a function of our treatment interventions during our experimental period (October 3 to December 31) and in the two month period following our experimental period (January 1-February 28) after all individuals went back to receiving just the standard account statement. We compare treatment interventions relative to the standard account statement. We use the logarithm of one plus the total amount contributed across an individual's contributions. Columns 3 and 4 estimate the same variable but during the post-experimental period. Columns 2 and 4 present results inclusive of a control for prior contribution status. All regressions include a constant. Robust standard errors are used and shown in parentheses. Statistical significance is indicated by *, **, and *** for 10%, 5%, and 1% p-value, respectively, using a two-tailed test. This table reports the estimated relationship of contribution likelihood and contribution amount as a function of our treatment interventions during our experimental period (October 3 to December 31). We compare treatment interventions relative to the new statement. Column (1) estimates the relationship between the treatment interventions versus the new statement on contribution likelihood. Column (2) estimates the relationship between the treatment interventions versus the new statement on total amount contributed during the experimental period using an inverse hyperbolic sine transformation. Positive coefficients represent the behavioral intervention increasing contributions relative to the New Statement, while negative coefficients represent the behavioral intervention decreasing contributions. Results comparing the New Statement versus the Standard Account Statement are presented in Table 2. Robust standard errors are used and shown in parentheses. All regressions include a constant and controls for prior contribution status. Statistical significance is indicated by *, **, and *** for 10%, 5%, and 1% p-value, respectively, using a two-tailed test. This table reports the estimated relationship of contribution likelihood and contribution amount as a function of our treatment interventions during our experimental period (October 3 to December 31). We compare treatment interventions relative to the Family Security SMS. Column (1) estimates the relationship between the treatment interventions versus the Family Security SMS on contribution likelihood. Column (2) estimates the relationship between the treatment interventions versus the Family Security SMS on total amount contributed during the experimental period using an inverse hyperbolic sine transformation. Positive coefficients represent the behavioral intervention increasing contributions relative to the Family Security SMS, while negative coefficients represent the behavioral intervention decreasing contributions. Robust standard errors are used and shown in parentheses. All regressions include a constant and controls for prior contribution status. Statistical significance is indicated by *, **, and *** for 10%, 5%, and 1% p-value, respectively, using a two-tailed test.

Figure S4: Percent Change in Contribution Likelihood Relative to Family SMS
Note: Significance is indicated by *, **, and *** at 10%, 5%, and 1% for a two-tailed test, respectively and is based on a comparison between each experimental group relative to those who received the Family SMS intervention.

Approaches: Causal Tree and Causal Forest
In this section, we discuss how researchers can use recursive partitioning to detect heterogeneous treatment effects. One advantage of these approaches is that they are relatively simple to implement, with readily available packages that can be installed in R. Our experience is that Causal Tree, once installed, took roughly five minutes using a Apple Macbook Pro with 16 GB of RAM. Causal Forest took slightly longer, producing results within 30-60 minutes. Models with larger datasets, or more regressors, would likely require more memory or longer computational time to run.
First, researchers will need to install specific packages, if they have not done so already.
For Causal Tree, we will need to install the "devtools" package in RStudio, which can be downloaded from github using the website: https://github.com/susanathey/causalTree. The code for the package is as follows: install.packages("devtools") library(devtools) installgithub("susanathey/causalTree") It should be noted that other packages may also need to be downloaded for "devtools" to properly run. In our case running the "devtools" package produced the following error message, "There is no package called 'shiny'." Shiny is an R package that makes it easier to build interactive web apps from R. To rectify this issue, we chose Tools -> Install packages, and then downloaded the "shiny" package in RStudio. Researchers may have different dependencies and so should download additional packages accordingly. If you are using Windows, install the correct version of RTools for your version of . For Mac, this may require installing a gcc complier such as Apple's Command Line Tools. Once all packages are downloaded and no errors are indicated, the "devtools" package can be re-run.
Once this is complete, the data can be loaded into R. We then followed the example code provided in the documentation that can be found on this website: https://github.com/susanathey/causalTree/blob/mast Our approach leveraged the Honesty approach, so followed the code provided in section 5.1, "Honest Estimation," of the documentation. In this approach, the 50% of the data will be used for the training stage while the other 50% is used for the holdout sample. The data that is used for the 50% holdout, test stage should be saved in a separate location as per the documentation provided by Athey et al. (2016). We specifically used the following: > honestTree <-honest.causalTree(y x1 + x2 + x3 + x4, + data = traindata, + treatment = traindata$treatment, + estdata = estdata, + esttreatment = estdata$treatment, + split.Rule = "CT", split.Honest = T, + HonestSampleSize = nrow(estdata), + split.Bucket = T, cv.option = "fit", In our setting, contribution likelihood was defined as our main outcome measure, "Y," while Age, Gender, and prior contribution status was defined as our x1, x2, and x3. If researchers have more X variables made available for exploring heterogeneous treatment effects, then these additional variables can added to the code after x4 as follows: "honestTree <-honest.causalTree(y x1 + x2 + x3 + x4 +x5, +x6,....+xN." It should also be noted that while we used the "CT" criteria for cross validation, there are other evaluation functions made available for computing the cross-validation error. Details for this can be found on github and following the documentation provided by Athey et al. (2016).
Following the example code on the Github website, the resulting tree can be pruned and the dendrogram plotted using the rpart.plot function. Researchers can also specify the minimum number of treated observations and the number of control observations in the leaf to be split by specifying min = n. In our case and following the minimum number of observations specified by Athey and Imbens (2016), we specified the minimum as 2. However, we also tested whether our results changed using different specifications for the min, i.e., 2,3,4,5,6,7,8,9,10,15, and 20. We observed no significant changes in the Causal Tree output regardless of the min specification used. We note that researchers can also specify the minimum and maximum number of buckets, by adding in the line "bucketmin = n, bucketMax = n" following the split.Bucket line entry. Given that we were not interested in constraining the number of buckets, we opted not to specify a minimum or maximum bucket number. A visual representation of the cutpoints identified in the first, 'training' stage of Causal Tree are portrayed via a dendrogram. In our setting, Figure 4 represents the dendrogram for the first stage of the Causal Tree approach.
Using the 50% holdout sample, we can now move to the estimation and 'test' stage to assess heterogeneous treatment effects. To ensure robust standard errors, researchers should download the 'lmtest' and 'sandwich' packages prior to estimation. Both packages can downloaded by using the Tools->Install package and typing in 'lmtest' and 'sandwich,' respectively. Once this is done, the researcher can run a regression of the outcome variable on a series of dummy variables corresponding to the different leaves of the resulting tree on the prediction data.
One thing researchers should keep in mind with the Causal Tree procedure is that estimates are based only on a single tree. As a result, prediction error can be high. The exact tree structure obtained from Causal Tree can also be sensitive to the sub-sample of the data used. If we run the procedure using different 50% sub-samples, it is quite possible that some smaller partitions may disappear or be replaced with other small partitions. One approach is to use the Causal Forest procedure, which constructs predictions by averaging over many trees grown on sub-samples of the data. Causal Forest procedures have the advantage of lower prediction variance and greater stability of the estimates (Athey et al., 2019;Breiman, 2001; Wager and Athey, 2018). In addition, as is the case with many statistical tests and machine-learning approaches, researchers should take care to use judgment when making conclusions about heterogeneity in the data. There may be instances where splits or parititons do not align with behavioral patterns or where the partitions occur with such a small sample of the data that meaningful insights may be difficult to glean.
Researchers should combine insights gleaned from recursive partitioning techniques along with intuition and other empirical evidence prior to scaling interventions or make accurate predictions broadly.
To run Causal Forest, we used the R package grf. This package is available on the CRAN repository and can be installed in R using this same approach as previous installations. The package description and documentation are available at the website https://cran.r-project.org/web/packages/grf/index.html. The Causal Forest is easier to implement than Causal Tree, because estimation is self-contained a single command (causal_forest). Honesty (sample splitting) is directly incorporated into the causal_forest function, since the Causal Forest is run on many different random sub-samples of the data. Thus, Causal Forest can be run on the entire dataset without a need for splitting the sample before estimation. The syntax for the causal_forest() is as follows. Its first argument, X, is a matrix of independent variables that are used to estimate treatment effect heterogeneity. In our case, we use age, gender, and prior contribution. The next argument, Y, is the outcome variable -a contribution indicator in our case. The third argument, W, is an indicator for treatment, in our case, if an individual receives the Family Security SMS. The function also allows one to choose a number of optional tuning parameters, as well as the number of CPU threads the algorithm is run on. The tuning parameters can be set automatically through cross-validation by running a function called tune_causal_forest prior to estimation. Our simulation study, presented in Section G of the Supplement, indicated that the results we would expect to see were robust to the tuning parameters, and so we left them at the default values, except the number of trees, which we set to 25,000 to improve statistical efficiency. Again, however, 25,000 trees was likely conservative, as our simulation study indicated the results were robust to the number of included trees.
The causal_forest function returns a large object containing estimation results that can be used to compute conditional average treatment effects. We note that this object can be relatively large in terms of disk space (in our case, it was 2-3 GB). One can compute conditional average treatment effects (CATEs) on subsamples of interest using the function average_treatment_effect included in the grf package, which computes a doubly robust estimate of an average treatment effect. This function takes three arguments. The first is the object returned by the causal_forest function. The second specifies the target sample -in our case, we specified "all" for this option since we wanted to compute average treatment effects for both treatment and control observations. The third option indicates a particular sub-sample of the data to compute the average treatment effect for. This is a vector of TRUE and FALSE values, where TRUE indicates an observation to be included in the CATE calculation. For example, if we wish to compute the CATE for women under 28, we would pass this third argument a vector that is TRUE for all observations in the data where the individual is female and has an age less than or equal to 28. The function returns both the CATE for the indicated sub-sample, as well as a standard error on the CATE, which can be used to compute confidence bounds. Note: This table reports the estimated relationship of contribution likelihood as a function of the age and gender cutpoints implied from the first stage of Causal Forest using the 50% holdout sample. Statistical significance is indicated via *, **, and *** representing 10%, 5%, and 1% for a two-tailed test, respectively. All regressions include a constant. Note: This table reports the estimated relationship of contribution likelihood during our experimental period, comparing the various behavioral intervention relative to sending the redesigned, new account statement alone. We first present the results of the overall main effects, collapsing across all ages. We then present the regression results split by i) those who are 28 years of age and below, and ii) those who are 29 years of age and above. Positive coefficients represent the behavioral intervention performing better relative to sending the new account statement by itself, while negative coefficients represent the behavioral intervention performing worse. The results comparing the New Statement to the Standard Account Statement are not presented as they are presented in Table 3. Robust standard errors are shown in parentheses. All regressions include controls for prior contribution status. Statistical significance is indicated by *, **, and *** for 10%, 5%, and 1% p-value, respectively, using a two-tailed test. Note: This table reports the estimated relationship of contribution likelihood and contribution amount (IHS transformed) during our experimental period, comparing the various behavioral intervention relative to the Family Security SMS intervention. We first present the results of the overall main effects, collapsing across all ages. We then present the regression results split by the two overall age buckets implied by the machinelearning approaches: those who are 28 and younger and those who are 29 and older. Negative coefficients represent the SMS behavioral intervention decreasing contributions relative to the Family SMS, while positive coefficients represent the behavioral intervention increasing contribution relative to the Family SMS. All regressions include controls for prior contribution status. Robust standard errors are used and shown in parentheses. Statistical significance is indicated by *, **, and *** for 10%, 5%, and 1% p-value, respectively, using a two-tailed test. All regressions include a constant. Note: This table presents the conditional average treatment effects of October SMS Treatment on the probability of making a contribution from October 3-December 31, bracketed by age. Coefficients represent comparisons between the heterogeneous treatment effect and the average treatment effect using the coefficients presented in column 2 of Table 2. Robust standard errors are included in parentheses. Statistical significance is indicated by *, **, and *** for 10%, 5%, and 1% p-value, respectively, using a two-tailed test. Note: This table presents the comparisons between the heterogeneous treatment effect and the average treatment effect using the coefficients presented in Column 2 of Table 2. Robust standard errors are included in parentheses. Statistical significance is indicated by *, **, and *** for 10%, 5%, and 1% p-value, respectively, using a two-tailed test. Note: This table depicts the estimated treatment effects and confidence intervals derived from Generalized Random Forest estimation, comparing contribution likelihood for the New Statement intervention relative to the Standard Account Statement during the experimental period. The sample is restricted to those individuals who are in each respective condition, with the sample size totaling to 27,777 individuals. The estimates were derived by using a supplied procedure that estimates the conditional average treatment effect for subsets of the data. Negative coefficients represent the New Statement decreasing contribution likelihood relative to the Standard Account Statement, while positive coefficients represent the New Statement increasing contribution likelihood. All regressions include a constant and controls for prior contribution status. Statistical significance is indicated by *, **, and *** for 10%, 5%, and 1% p-value, respectively, using a two-tailed test. Note: This table depicts the estimated treatment effects and confidence intervals derived from Generalized Random Forest estimation, comparing contribution likelihood for the Basic Alert SMS intervention relative to the Standard Account Statement during the experimental period. The sample is restricted to those individuals who are in each respective condition, with the sample size totaling to 27,761 individuals. The estimates were derived by using a supplied procedure that estimates the conditional average treatment effect for subsets of the data. Negative coefficients represent the Basic Alert SMS decreasing contribution likelihood relative to the Standard Account Statement, while positive coefficients represent the Basic Alert SMS increasing contribution likelihood. All regressions include a constant and controls for prior contribution status. Statistical significance is indicated by *, **, and *** for 10%, 5%, and 1% p-value, respectively, using a two-tailed test. Note: This table depicts the estimated treatment effects and confidence intervals derived from Generalized Random Forest estimation, comparing contribution likelihood for the Fresh Start SMS intervention relative to the Standard Account Statement during the experimental period. The sample is restricted to those individuals who are in each respective condition, with the sample size totaling to 27,784 individuals. The estimates were derived by using a supplied procedure that estimates the conditional average treatment effect for subsets of the data. Negative coefficients represent the Fresh Start SMS decreasing contribution likelihood relative to the Standard Account Statement, while positive coefficients represent the Fresh Start SMS increasing contribution likelihood. All regressions include a constant and controls for prior contribution status. Statistical significance is indicated by *, **, and *** for 10%, 5%, and 1% p-value, respectively, using a two-tailed test. Note: This table depicts the estimated treatment effects and confidence intervals derived from Generalized Random Forest estimation, comparing contribution likelihood for the Small Amounts SMS intervention relative to the Standard Account Statement during the experimental period. The sample is restricted to those individuals who are in each respective condition, with the sample size totaling to 27,779 individuals. The estimates were derived by using a supplied procedure that estimates the conditional average treatment effect for subsets of the data. Negative coefficients represent the Small Amounts SMS decreasing contribution likelihood relative to the Standard Account Statement, while positive coefficients represent the Small Amounts SMS increasing contribution likelihood. All regressions include a constant and controls for prior contribution status. Statistical significance is indicated by *, **, and *** for 10%, 5%, and 1% p-value, respectively, using a two-tailed test. Note: This table depicts the estimated treatment effects and confidence intervals derived from Generalized Random Forest estimation, comparing contribution likelihood for the Individual Goals SMS intervention relative to the Standard Account Statement during the experimental period. The sample is restricted to those individuals who are in each respective condition, with the sample size totaling to 27,803 individuals. The estimates were derived by using a supplied procedure that estimates the conditional average treatment effect for subsets of the data. Negative coefficients represent the Individual Goals SMS decreasing contribution likelihood relative to the Standard Account Statement, while positive coefficients represent the Individual Goals SMS increasing contribution likelihood. All regressions include a constant and controls for prior contribution status. Statistical significance is indicated by *, **, and *** for 10%, 5%, and 1% p-value, respectively, using a two-tailed test. Note: This table depicts the estimated treatment effects and confidence intervals derived from Generalized Random Forest estimation, comparing contribution likelihood for the Family Security SMS intervention relative to the Standard Account Statement during the experimental period. The sample is restricted to those individuals who are in each respective condition, with the sample size totaling to 27,755 individuals. The estimates were derived by using a supplied procedure that estimates the conditional average treatment effect for subsets of the data. Negative coefficients represent the Family Security SMS decreasing contribution likelihood relative to the Standard Account Statement, while positive coefficients represent the Family Security SMS increasing contribution likelihood. All regressions include a constant and controls for prior contribution status. Statistical significance is indicated by *, **, and *** for 10%, 5%, and 1% p-value, respectively, using a two-tailed test.

Data Experiment & Analyses
We estimate a Causal Forest using the Generalized Random Forest approach of Wager and Athey (2018) on the family treatment relative to the control. We once again use the proportion making a savings contribution as our main outcome variable of interest. Figure 5 presents the plots of the resulting conditional average treatment effects by age and gender. We use gender, age, and prior contribution status as our potential moderators of interest. 9 We can directly apply Causal Forest to the entire dataset without explicitly specifying training and estimation data, because the underlying trees used to compute treatment effects are honest: For every observation in the data, the treatment effect is estimated using trees grown on other random subsamples of the data that do not contain that observation. 10 It can be seen in Figure 5 that for women, there is an increase in the effectiveness of the Family Security treatment (from negative to positive) around age 28, and the positive effects diminish just after age 40. For men, we see a similar increase around age 28 and a decrease after age 40, although the predicted effects are not negative for younger men. Overall, the results suggest that there are significant heterogeneous treatment responses. Moreover, the results not only confirm the validity of the Causal Tree results but can also help determine whether there are heterogeneous outcomes across all treatment conditions as a function of these age brackets and gender or whether this is specific to the Family Security SMS in particular.
For comparison, we present the estimated treatment effects from using the Causal Forest approach for all treatments for all values of age and gender. This can be found in Appendix F, Tables S11 through Tables S16.
We evaluate the sensitivity of estimated treatment effects from the Causal Forest estimator to tuning parameters using a simulation exercise. We construct a simulated dataset which mimics the dataset used for estimation in the following way: First, we take a population of N = 25, 000 simulated individuals and split individuals into treatment and control groups with probability 0.5. Next, we randomly assign individuals to a) either gender with probability 0.5, b) an age between 20 and 65, putting equal weight on each year, and 9 Although we include prior contribution status as an independent variable in the Causal Forest, we found it does not split on it. Thus, we only present plots of CATEs for age and gender. 10 The confidence intervals shown in the plots are produced by the estimation procedure, and are based on the infinitesimal jackknife estimator. We use the supplied R package called 'grf' and use the default values and 5,000 trees for our tuning parameters. We also present the results of several simulation studies to show that a dataset of our size using a similar data generating process for estimated heterogeneous treatment effects produce estimated effects are robust to the number of trees and the tuning parameter values. c) to be a prior contributor with probability 0.05. We assume the individual's outcome is determined by the equation: across individuals, and the effects of the treatment by age are δ 1 = −0.01, δ 2 = 0.01, and δ 3 = 0. age i represent individual i's age, and prior i is an indicator variable with 1 indicating that individual i has made a prior contribution. After generating the artificial data, we compute estimated treatment effects using Causal Forest for each individual using (1) the default settings with 5,000 trees, (2) automatic tuning procedure with 5,000 trees, and (3) the default settings with 25,000 trees. In Table S17, we show the conditional average treatment effects for each estimation run for the age bins corresponding to where the parameters change. The results show that the parameter estimates are close to the truth in all cases, and the true parameters are well within the 95% confidence bounds implied by the estimated standard errors. We do not perform a full Monte Carlo study that includes coverage rates, as the computational burden of running the estimation once is high.  Note: This table reports the estimated relationship of contribution likelihood as a function of the Family Security treatment intervention, and various treatment by demographic interaction terms during our experimental period (October 3 to December 31). We compare the main effect of the treatment as well as the interaction terms relative to our baseline condition: the standard, control account statement. The sample is restricted to individuals in the control or family treatment conditions, and the total sample size is 27,755 individuals. Robust standard errors are used and shown in parentheses. Statistical significance is indicated by *, **, and *** for 10%, 5%, and 1% p-value, respectively, using a two-tailed test. All regressions include a constant as well as controls for prior contribution status and the main effects of the variables used for interactions (a gender indicator, age trends/age bucket indicators). Note: This table reports the estimated relationship of contribution likelihood as a function of the main effect of the Family Security SMS intervention, 5-year age intervals, and various treatment by 5-year age interval interaction terms. We estimate these effects during our experimental period (October 3 to December 31). We compare the main effect of the treatment as well as the interaction terms relative to our baseline condition: the standard, control account statement. Column 1 restricts the sample to men, while column 2 restricts the sample to women. Robust standard errors are used and shown in parentheses. *, **, and *** indicate significance at 10%, 5%, and 1% for a two-tailed test. All regressions include a constant as well as controls for prior contribution status. Note: This table depicts the estimated treatment effects using OLS. We compare all intervention to the control group on the likelihood to make a contribution during the experimental period. Negative coefficients represent the behavioral intervention performing worse relative to the Control condition, while positive coefficients represent the behavioral intervention performing better. All regressions include controls for prior contribution status. Robust standard errors are used and shown in parentheses. *, **, and *** indicate significance at 10%, 5%, and 1% for a two-tailed test.  (1) and (2) estimate the relationship between our treatments interventions versus a standard, control statement on contribution likelihood in the first two months of our experimental period.

I Measuring Persistence Post-Intervention
Columns (3) and (4) estimate the relationship between our treatments interventions versus a standard, control statement on contribution amount in the two-month period following our experimental period after all individuals were all sent the standard account statement without any follow-up SMS interventions. We compare treatment interventions relative to the group who were given the standard, control account statement during the main experimental period. In specifications (2) and (4), we present our results inclusive of a control for prior contribution status indicating whether a contribution was made in the year prior to the intervention for each period of interest. Robust standard errors are used and shown in parentheses. Statistical significance is indicated by *, **, and *** for 10%, 5%, and 1% p-value, respectively, using a two-tailed test. All regressions include a constant. Note: This table reports the estimated relationship of contribution likelihood in the period post intervention (January 1 to February 28), comparing the various behavioral intervention relative to the Standard Account Statement. We first present the results of the overall main effects, collapsing across all ages. We then present the regression results split by our three age buckets implied by the machine-learning approaches: those who are 28 and younger, those between 29 and 41, and those who are 42 and older. Note: After the conclusion of the experimental period, individuals were sent the standard account statement and no additional followup texts regardless of treatment. Robust standard errors are used and shown in parentheses. Statistical significance is indicated by *, **, and *** for 10%, 5%, and 1% p-value, respectively, using a two-tailed test. All regressions include a constant and controls for prior contribution status. Note: This table presents the estimates of the relationship between our experimental groups versus the Family Security SMS on contribution likelihood in the two month period following our experimental period, after all individuals went back to the standard account statement. Robust standard errors are used and shown in parentheses. Robust standard errors are used and shown in parentheses. All regressions include a constant and controls for prior contribution status. Statistical significance is indicated by *, **, and *** for 10%, 5%, and 1% p-value, respectively, using a two-tailed test. Note: This table presents the estimates of the relationship between our experimental groups versus the Family Security SMS on total contribution amount (using an inverse hyperbolic sine transformation) in the two month period following our experimental period, after all individuals went back to the standard account statement. Robust standard errors are used and shown in parentheses. Robust standard errors are used and shown in parentheses. All regressions include a constant and controls for prior contribution status. Statistical significance is indicated by *, **, and *** for 10%, 5%, and 1% p-value, respectively, using a two-tailed test.