- Split View
-
Views
-
Cite
Cite
Brennan S Thompson, Matthew D Webb, A simple, graphical approach to comparing multiple treatments, The Econometrics Journal, Volume 22, Issue 2, May 2019, Pages 188–205, https://doi.org/10.1093/ectj/utz006
- Share Icon Share
Summary
We consider a graphical approach to comparing multiple treatments that allows users to easily infer differences between any treatment effect and zero, and between any pair of treatment effects. This approach makes use of a flexible, resampling-based procedure that asymptotically controls the familywise error rate (the probability of making one or more spurious inferences). We demonstrate the usefulness of this approach with three empirical examples.
1. INTRODUCTION
When an experiment involves more than one treatment (e.g., several different drugs designed to treat a particular disease), there is often interest in comparing each treatment not only to a control, but also to the other treatment(s). With k treatments under consideration, this can be seen to involve testing a total of |${k+1 \atopwithdelims ()2}$| hypotheses. For example, with k = 2 treatments, there are 3 hypotheses of interest: (i) that the effect of the first treatment is equal to zero; (ii) that the effect of the second treatment is equal to zero; and (iii) that the effects of the first and second treatments are equal to each other. With k = 3 treatments, there are 6 hypotheses of interest, and so on.
Of course, when testing more than one hypothesis at a given nominal level, the probability of rejecting at least one true hypothesis, i.e., the familywise error rate (FWER), is typically well in excess of that given nominal level.1 In recognition of this issue, a wide variety of multiple-testing procedures, ranging from the simple Bonferroni correction to resampling-based stepwise procedures (Romano and Wolf, 2005a, 2005b), have been developed to control the FWER and other generalized error rates such as the false discovery rate (the expected proportion of true hypotheses rejected; Benjamini and Hochberg, 1995). While such procedures are often used when multiple treatments are examined in biostatistics (Dunnett, 1955; Dunnett and Tamhane, 1991), the econometrics literature has, with the exception of the forthcoming paper by List, Shaikh and Xu (2019), hereafter LSX, ignored the problem of multiple testing whenever multiple treatments are considered.2 Nonetheless, as LSX note, this issue is pervasive in many areas of economics, as multiple treatments are considered 'in nearly every experiment that is published today' (p. 3).
In this paper, we consider a graphical approach to comparing multiple treatments that uses the resampling-based procedure of Bennett and Thompson (2016), hereafter BT, to asymptotically control the FWER. Our main contribution here is simply to demonstrate how this (very general) procedure can be adapted to a regression framework to compare treatment effects while controlling for other sources of heterogeneity. We are hopeful that the approach we consider here will be both illuminating and easy to implement for practitioners.
The advantage of the graphical approach we adapt is that it allows users to easily visualize both statistical and practical significance in the (signed) differences between each treatment effect and zero, and between each pair of treatment effects. That is, unlike standard multiple-testing procedures, such as that utilized by LSX, it offers users more than a 'Yes–No' decision on all of the hypotheses of interest. Ultimately, the user is provided with a single, easy to interpret figure that clearly suggests an ordering of the treatment effects rather than a cumbersome (k + 1)-by-(k + 1) table of test statistics (or p-values). In our own practice, we have found that such a figure is particularly useful in summarizing the results of an analysis for a live audience.
The following section describes our approach in more detail and provides a very simple illustration using data from a field experiment in which k = 2 types of performance pay for teachers are compared. Two additional empirical examples are provided in Section 3. The first of these examples is of interest because it involves a large number of treatments (k = 36). In the second, we consider a case where treatment effects are estimated using an instrumental variables approach. In both examples, controlling for multiple comparisons meaningfully changes the statistical inferences. Section 4 concludes.
2. METHODOLOGY
2.1. Setup
2.2. The Overlap Procedure
The procedure of BT is designed to facilitate all pairwise comparisons within a set of parameters that have been |$\sqrt{n}$|-consistently estimated; their framework is quite general and they do not explicitly consider any regression models (indeed, in their simulations, the parameters of interest are simply the means of a collection of random variables with varying degrees of correlation).
These uncertainty intervals are used to make inferences about the ordering of the parameters of interest as follows. We infer that βs > βt if the uncertainty interval for βs lies entirely above the uncertainty interval for βt (i.e., if Ln,s > Un,t). If the uncertainty intervals for βs and βt overlap one another (i.e., if Cn,s ∩ Cn,t ≠ ∅), we can make no such inference. For this reason, BT refer to their procedure as the overlap procedure; it allows users to easily make comparisons between a pair of parameters by visually checking to see whether or not their uncertainty intervals overlap.
It must be emphasized that using confidence intervals to make inferences in this manner would be completely inappropriate. Specifically, when k = 1, such inferences would be overly conservative (cf. Online Supplement Section S1); as k grows, the FWER would quickly become larger than one minus the nominal confidence level.
(a)|$\sqrt{n}(\hat{\beta }_{n} - \beta )$| and |$\sqrt{n}(\hat{\beta }^*_{n}-\hat{\beta }_{n})$|both have the same (continuous and strictly increasing) (k + 1)-variate limiting distribution; (b)|$\sqrt{n}\times \text{se}\left(\hat{\beta }_{n,s}\right)$|and|$\sqrt{n}\times \text{se}\left(\hat{\beta }^*_{n,s}\right)$|both converge in probability to the same (positive) constant, for each s ∈ K.
These assumptions are very mild, and hold if, say, our (OLS, 2SLS, etc.) estimates are asymptotically normal and our resampling procedure is appropriately chosen (i.e., is compatible with the data generating process).
Our main result follows directly from Theorem 3.1 in BT:
Let Assumption2.1hold. Then the overlap procedure (a) bounds the FWER from above by α asymptotically, (b) is consistent, in the sense that any true differences between parameter pairs are inferred with probability one asymptotically, and (c) infers a correct ordering of the parameters (when they are unequal) with probability one asymptotically.
Simulation evidence presented both in BT and in our Online Supplement Section S2 suggests that the overlap procedure provides satisfactory control of the FWER and has good (average) power properties in finite samples.
We conclude this section by showing how the overlap procedure can be used to allow users to make inferences about the treatment effects, δ1, …, δk, more directly.
Denoting the lower and upper endpoints of |$\tilde{C}_{n,s}$| by |$\tilde{L}_{n,s}$| and |$\tilde{U}_{n,s}$|, respectively, we can then infer that δs > 0 if |$\tilde{L}_{n,s}\gt \tilde{U}_{n,0}$|, that δs < 0 if |$\tilde{U}_{n,s}\lt \tilde{L}_{n,0}$|, and that δs > δt if |$\tilde{L}_{n,s}\gt \tilde{U}_{n,t}$|.
2.3. Stepwise Refinement
BT also propose an iterative stepwise refinement for the overlap procedure that (weakly) increases its power without sacrificing asymptotic control of the FWER. The idea behind this refinement is to iterate the overlap procedure while eliminating any pairwise parameter comparisons that are 'resolved' at a previous step. In this sense, it is analogous to Holm's (1979) stepwise refinement of the Bonferroni correction, which eliminates from consideration any null hypotheses that are rejected at a previous step.
2.4. A Small-Scale Empirical Example
In order to provide an extremely simple illustration of our proposed approach, we utilize data from Muralidharan and Sundararaman (2011), hereafter MS (Section 3 presents the results of two additional empirical examples). This paper describes the results of a field experiment designed to examine the effects of offering teachers performance pay conditional upon students’ academic performance.7 Specifically, MS analyse outcomes from three separate groups of schools: a control group, a group in which teachers were paid based on the scores of their own students, and a group in which teachers were paid based on the performance of all students at their school. In other words, there are k = 2 treatments.
. | Model (2.7) . | Model (2.8) . |
---|---|---|
β0 | 0.132 | 0.132 |
(0.168) | (0.168) | |
δ1 | 0.154 | |
(0.057) | ||
δ2 | 0.283 | |
(0.058) | ||
δ2 − δ1 | 0.129 | |
(0.068) | ||
β1 | 0.286 | |
(0.172) | ||
β2 | 0.415 | |
(0.168) | ||
β1 − β0 | 0.154 | |
(0.057) | ||
β2 − β0 | 0.283 | |
(0.058) | ||
β2 − β1 | 0.129 | |
(0.068) |
. | Model (2.7) . | Model (2.8) . |
---|---|---|
β0 | 0.132 | 0.132 |
(0.168) | (0.168) | |
δ1 | 0.154 | |
(0.057) | ||
δ2 | 0.283 | |
(0.058) | ||
δ2 − δ1 | 0.129 | |
(0.068) | ||
β1 | 0.286 | |
(0.172) | ||
β2 | 0.415 | |
(0.168) | ||
β1 − β0 | 0.154 | |
(0.057) | ||
β2 − β0 | 0.283 | |
(0.058) | ||
β2 − β1 | 0.129 | |
(0.068) |
Note. Clustered standard errors are in brackets.
. | Model (2.7) . | Model (2.8) . |
---|---|---|
β0 | 0.132 | 0.132 |
(0.168) | (0.168) | |
δ1 | 0.154 | |
(0.057) | ||
δ2 | 0.283 | |
(0.058) | ||
δ2 − δ1 | 0.129 | |
(0.068) | ||
β1 | 0.286 | |
(0.172) | ||
β2 | 0.415 | |
(0.168) | ||
β1 − β0 | 0.154 | |
(0.057) | ||
β2 − β0 | 0.283 | |
(0.058) | ||
β2 − β1 | 0.129 | |
(0.068) |
. | Model (2.7) . | Model (2.8) . |
---|---|---|
β0 | 0.132 | 0.132 |
(0.168) | (0.168) | |
δ1 | 0.154 | |
(0.057) | ||
δ2 | 0.283 | |
(0.058) | ||
δ2 − δ1 | 0.129 | |
(0.068) | ||
β1 | 0.286 | |
(0.172) | ||
β2 | 0.415 | |
(0.168) | ||
β1 − β0 | 0.154 | |
(0.057) | ||
β2 − β0 | 0.283 | |
(0.058) | ||
β2 − β1 | 0.129 | |
(0.068) |
Note. Clustered standard errors are in brackets.
Next, MS test the following three hypotheses:
δ1 = 0
δ2 = 0
δ2 = δ1.
T-statistics corresponding to tests of these three hypotheses are 2.702, 4.879, and 1.897, respectively. Thus, MS conclude that both treatment effects are statistically different from zero, and from one another (even MS3 could be rejected in favour of a two-sided alternative at a nominal level of slightly less than 6% if the tests were conducted separately, i.e., without controlling the FWER).
Given a nominal FWER of α = 0.05 and 9,999 replications of the wild cluster bootstrap (Cameron et al., 2008), we obtain a value of 0.497 for γ (this is the value obtained after the first iteration; no further refinement was possible).8 The resulting uncertainty intervals are shown in Figure 1(a), and can be interpreted as follows:
Since the uncertainty intervals for β1 and β0 overlap, we cannot infer anything about their ordering (or, equivalently, anything about the sign of δ1).
Since the uncertainty interval for β2 lies entirely above the uncertainty interval for β0, we infer that β2 > β0 (or, equivalently, that δ2 > 0).
Since the uncertainty intervals for β2 and β1 overlap, we cannot infer anything about their ordering (or, equivalently, anything about the ordering of δ2 and δ1).
Thus, while our results are consistent with rejecting MS2, they are not consistent with rejecting either MS1 or MS3.9
Figure 1(b) displays the same uncertainty intervals centred around the treatment effects, δ1 ≡ β1 − β0 and δ2 ≡ β2 − β0. That is, we subtract |$\hat{\beta }_{n,0}=0.132$| from the endpoints of the uncertainty intervals for β1 and β2 (leaving their lengths unchanged). Moreover, we include a dotted horizontal line at |$\tilde{U}_{n,0}$| (if the vertical axis extended far enough below zero, we would include another dotted horizontal line at |$\tilde{L}_{n,0}$|). Given that |$\tilde{L}_{n,2}$| lies above this dotted horizontal line, for example, one can quickly infer that δ2 > 0. We find that such a figure makes it much easier to distinguish the comparisons of each treatment to a control while simultaneously comparing each treatment to all the others.
2.5. Ignoring Treatment Effect Comparisons
We now consider narrowing our problem to focus solely on whether or not any of the treatment effects is different from zero. That is, we ignore all pairwise comparisons of the treatment effects (i.e., the |${k \atopwithdelims ()2}$| hypotheses in (2.3)), and focus on the so-called problem of 'multiple comparisons with a control' (Hsu, 1996) that was first explored by Dunnett (1955).
To illustrate this approach, we return to the performance pay example introduced in the previous section. Following the procedure described above, we obtain a value of 2.53 for λ (recall that the estimates of δ1 and δ2 and their standard errors have already been provided in Table 1). Figure 2 displays Dn, 1 and Dn, 2, the uncertainty intervals for δ1 and δ2, respectively.
Here, we are able to infer that both treatment effects are positive (i.e., δ1 > 0 and δ2 > 0), since the uncertainty for each lies entirely above zero (recall that in the previous section, we were not able to infer anything about the sign of δ1). This example can thus be seen to nicely illustrate the trade-off that is inherent with any multiple-testing procedure: by controlling for a greater number of comparisons (in this case the comparisons between the treatment effects), one may have to sacrifice some power.
2.6. A Modification for Multiple Comparisons with the Best
Thus far, we have primarily been concerned with controlling the FWER across all pairwise parameter comparisons. This approach allows for a (potentially complete) ranking of all the treatments under consideration. For example, assuming that a larger value of the outcome variable is 'better', one could infer that treatment s ∈ {1, …, k} is the 'best', i.e., βs > βt for all t ∈ K∖{s}, if Ln, s > Un, t for all t ∈ K∖{s}.11 Similarly, one may be able to identify a 'second best' treatment, a 'third best' treatment, and so on.
While such a complete ranking may occasionally be of value, interest often centres on identifying only the (first) best treatment. That is, we may only want to know whether or not the treatment effect that is estimated to be the largest is actually statistically distinguishable from the other treatment effect(s) and from zero. Such a problem is the focus of so-called 'multiple comparisons with the best' procedures (Hsu 1981, 1984; Horrace and Schmidt, 2000).
Here, we follow BT in developing a modification of the overlap procedure to focus on this problem.12 The basic idea behind this modification is that, by eliminating 'irrelevant' pairwise comparisons (i.e., those in which neither of the parameters is estimated to be largest), the power of the procedure is substantially increased (effectively, the number of comparisons is reduced from |${k+1 \atopwithdelims ()2}$| to k).
Simulation evidence presented both in BT and in our Online Supplement Section S2 suggests that the choice of γ resulting from this modification may be substantially smaller than the choice resulting from the unmodified overlap procedure, resulting in greatly increased power.
Before moving on, we illustrate the modified overlap procedure using the performance pay example introduced in Section 2.4. That is, we seek to determine only whether or not the treatment effect for the individual incentive (the treatment effect estimated to be the largest) is statistically distinguishable from the treatment effect for the group incentive and from zero.
Here, we obtain a value of 0.316 for γ, which is less than two-thirds as large as the value we obtained using the unmodified procedure (0.497). Figure 3 displays the lower half of |$\tilde{C}_{n,[1]}=\tilde{C}_{n,2}$| and the upper halves of |$\tilde{C}_{n,[2]}=\tilde{C}_{n,1}$| and |$\tilde{C}_{n,[3]}=\tilde{C}_{n,0}$|. We explicitly include the upper half of |$\tilde{C}_{n,0}$| here (rather than a dotted horizontal line corresponding to its upper endpoint, as in Figure 1b) since the modified overlap procedure cannot be used to make inferences about the sign of any treatment effect that is not estimated to be largest (i.e., we cannot compare the group incentive to the control here).
Our inferences here are much more in line with MS. Specifically, we infer that δ2 > 0 and that δ2 > δ1, decisions that are consistent with rejecting MS2 and MS3. However, since the modified overlap procedure focuses solely on comparisons with the 'best', it does not allow us to infer anything about the significance of δ1. That is, we cannot say anything about MS1. The increased power of the modified overlap procedure comes entirely at the cost of remaining silent on such comparisons.
Ultimately, one must decide which procedure to use based on which comparisons are actually of interest: if identifying 'second best', 'third best', etc. (or even having the ability to infer whether or not any of the treatment effects that are not estimated to be largest are different from zero) is of no concern, the modified overlap procedure can be recommended on the grounds of potentially much higher power. Of course, this choice should be made a priori so as to avoid the temptation to 'cherry pick' results. With this caveat in mind, we use only the unmodified procedure in the empirical example in Section 3.1, and only the modified procedure in the empirical example in Section 3.2.
3. ADDITIONAL EMPIRICAL EXAMPLES
3.1. Matching Grants in Charitable Giving
Karlan and List (2007), hereafter KL, conducted a large-scale field experiment to examine the effect of matching grants on charitable giving.13 Matching grants are schemes in which an individual's donation to a charity is amplified by a third party (the 'matching donor'). For example, with a 2:1 matching ratio, the matching donor donates $2 for every $1 donated by the individual.
The experiment involved sending letters to 50,083 previous donors of a politically oriented charity asking them to donate again. Approximately one-third of these donors were randomly assigned to a control group, and received letters that made no mention of a matching grant. The remaining ('treated') donors received letters that varied along three dimensions: the matching ratio (either 1:1, 2:1, or 3:1), the maximum size of the matching grant (either $25,000, $50,000, $100,000, or none), and the donation amount used to illustrate how the matching grant worked (either 1, 1.25, or 1.50 times the donor's maximum previous donation). That is, there are k = 3 × 4 × 3 = 36 different treatments (the experiment was designed so that 'treated' donors received one of these treatments with probability 1/36).
Although KL consider two outcomes, response (a binary variable) and amount given, we focus here solely on the latter.14 Moreover, our model differs from that of KL in two important ways. First, KL utilize a more restrictive (but also more parsimonious) model in which the different treatments interact, while our model—which conforms to the specification in (2.1)—includes a distinct treatment effect for each of the k = 36 treatments. Second, unlike KL, we include the following individual-level explanatory variables in our model: the number of months since the last donation, the highest previous donation, the number of previous donations, the number of years since the initial donation, an indicator for having previously donated in the same year, an indicator for being female, and an indicator for being a couple. Because data on some of these explanatory variables are missing for some individuals, we are left with n = 48,934 observations.
We estimate our model using OLS and obtain heteroskedasticity-consistent standard errors (specifically, the HC0 variant of MacKinnon and White, 1985). Given a nominal FWER of α = 0.05 and 999 replications of the wild bootstrap, we obtain a value of 2.406 for γ using the unmodified overlap procedure (this is the value obtained after the first iteration; no further refinement was possible).
Figure 4 displays our uncertainty intervals centred around the treatment effects (the dotted horizontal lines correspond to the endpoints of |$\tilde{C}_{n,0}$|). From this figure, it is immediately obvious that we cannot infer anything about the ordering of the treatment effects or any of their signs.
As a point of comparison, we also compute T-statistics for each of the |${36 + 1 \atopwithdelims ()2}=666$| relevant pairwise parameter comparisons. A histogram of these T-statistics is shown in Figure 5 (a complete listing of these T-statistics is given in Online Supplement Section S4). It is interesting to note that 17 of these T-statistics fall outside of the interval [ − 1.960, 1.960].15 In other words, had we separately tested the equality of each pair of parameters at the 5% nominal level (i.e., without any consideration of the FWER), we would have rejected 17 out of 666 hypotheses.
3.2. Student Achievement Programmes
Angrist, Lang, and Oreopoulos (2009), hereafter ALO, conducted a field experiment at a large university in Canada in order to examine programmes aimed at improving students’ academic performance.16 The experiment involved sorting students into a control group and k = 3 treatment groups. Students in the first treatment group were offered support services (supplemental instruction and peer advising), while students in the second treatment group were offered financial incentives (cash awards depending on their performance). Students in the third treatment group were offered both support services and financial incentives.
There are n = 1,542 observations, and the model is estimated using 2SLS. Standard errors are clustered by student. The first column of Table 8 in ALO provides detailed results.
Our focus here is solely on determining whether or not there is a single 'best' treatment (i.e., we use the modified overlap procedure). In doing so, we first rewrite the above model in the form of model (2.4), where β0 is multiplied by an indicator variable for membership in the control group (in the first stage of obtaining 2SLS estimates, the indicator variable for membership in the control group and the indicator variables for each of the treatments are regressed on the instruments). For simplicity, however, we centre our uncertainty intervals around the treatment effects (and zero).
Given a nominal FWER of α = 0.05, we obtain a value of 0.504 for γ using 999 replications of the wild cluster bootstrap (which we modified for 2SLS following the approach of Davidson and MacKinnon, 2010). Figure 6 displays the lower half of |$\tilde{C}_{n,[1]}=\tilde{C}_{n,3}$| and the upper halves of |$\tilde{C}_{n,[2]}=\tilde{C}_{n,2}$|, |$\tilde{C}_{n,[3]}=\tilde{C}_{n,1}$|, and |$\tilde{C}_{n,[4]}=\tilde{C}_{n,0}$|.
Since |$\tilde{L}_{n,[1]}\gt \tilde{U}_{n,[s]}$|, for t ∈ {2, 3, 4}, we can infer that the combined treatment is the 'best'. Note, however, that the modified overlap procedure does not allow us to compare the other treatments to one another (or to the control). ALO, on the other hand, simply test that each of the treatment effects is zero, and conclude that only δ3 is positive (each test is conducted at the 5% nominal level, without any consideration of the FWER).18
4. CONCLUSION
In this paper, we have shown how multiple treatments can be compared using a simple, graphical procedure which (asymptotically) controls the FWER. Our proposed approach complements the growing literature within econometrics that focuses on testing for heterogeneous treatment effects (i.e., situations where different types of individuals may respond differently to the same treatment). A natural extension of our approach would be to incorporate such heterogeneous treatment effects. We leave this to future work.
ACKNOWLEDGEMENTS
The authors would like to thank Co-editor Victor Chernozhukov, as well as Jobu Babin, Chris Bennett, Otavio Camargo-Bartalotti, Steve Lehrer, James MacKinnon, Vincent Pohl, audience members at several seminars and conferences, and, finally, an anonymous referee for helpful comments. We also thank the authors of the empirical papers revisited here for making their data publicly available. This research was supported in part by a grant from Social Sciences and Humanities Research Council.
Footnotes
In general, the FWER is bounded from above by mα0, where m is the number of hypotheses under consideration (here, |$m={k+1 \atopwithdelims ()2}$|), and α0 is the nominal level that each hypothesis is tested at. More specifically, the FWER is equal to αm ≡ 1 − (1 − α0)m < mα0 if the tests are mutually independent. However, if the tests are mutually dependent, as is the case here, the FWER may be greater or less than αm.
Recently, some researchers have used multiple-testing procedures when examining heterogeneous treatment effects, in which different types of individuals (say, men and women) may respond differently to the same treatment; see Anderson (2008), Fink et al. (2014), Lee and Shaikh (2014), Lehrer, Pohl, and Song (2018), and Gu and Shen (2018). Young (2019) "Young on the other hand, jointly tests the (single) hypothesis that all of the treatment effects—which may differ not only across different treatments, but also across different types of individuals—are zero.
In cases where selection issues are a concern, one might, for example, treat participation in a treatment as endogenous and use assignment to that treatment as an instrument (see Section 3.2 for an example). The crucial assumption we make is that the parameters of interest can be |$\sqrt{n}$|-consistently estimated, whether using OLS, 2SLS, or some other method.
In Online Supplement Section S1, we consider the k = 1 case. Although there is only a single hypothesis of interest (δ1 = 0) in this case, it does provide some important insight into our proposed approach. In Section 2.5, we simplify our problem by ignoring the hypotheses in (2.3). Interestingly, in the work on testing for heterogeneous treatment effects cited in footnote 3 above, each treatment effect is compared only to zero (and not to any of the other treatment effects).
In Online Supplement Section S1, we show that, when k = 1 and the limiting distribution of |$\sqrt{n}(\hat{\beta }_{n} - \beta )$| is known, we can easily choose γ without resorting to the bootstrap.
In Section 2.5, where we focus only on comparisons between the treatment effects and zero (i.e., where we ignore the comparisons between the different treatment effects), we utilize uncertainty intervals for the treatment effects themselves.
The data used in this example are available from: https://www.journals.uchicago.edu/doi/suppl/10.1086/659655
Online Supplement Section S3 provides computational details for this example. Moreover, code for the procedure can be found at: https://sites.google.com/site/matthewdwebb/code
We also applied the overlap procedure at a nominal FWER of α = 0.06 (obtaining a value of 0.476 for γ), and found that the uncertainty intervals for β2 and β1 were still overlapping (recall that the absolute value of the T-statistic for the test of MS3 was 1.897, which corresponds to a non-multiplicity-adjusted p-value of just under 0.06). In fact, the smallest nominal FWER at which the uncertainty intervals for β2 and β1 are non-overlapping is α = 0.149 (see BT, Section 3.3, for a discussion of multiplicity-adjusted p-values).
Notice that, for s ∈ {1, …, k}, comparing δs to zero is equivalent to comparing βs ≡ β0 + δs to β0. Hence, if we were to proceed from model (2.4), we would need to construct uncertainty intervals for each βs, s ∈ K, and then determine whether or not any of the uncertainty intervals for β1, …, βk overlap the uncertainty interval for β0. The uncertainty intervals that we construct here for δ1, …, δk can be viewed as 'absorbing' the uncertainty around β0.
Note that βs > βt for all t ∈ K∖{s} is equivalent to δs > 0 and δs > δt for all t ∈ {1, …, k}∖{s}. That is, a treatment is declared the 'best' if its treatment effect is both positive and larger than all of the k − 1 other treatment effects. Of course, such a ranking is not possible when using the procedure discussed in Section 2.5
In fact, BT introduce a generalization of the 'multiple comparisons with the best' approach that allows for comparisons within the 'r best' (r being some integer smaller than the total number of parameters under consideration). Such an approach may be of use when the number of parameters under consideration is very large (perhaps in the hundreds or thousands), and one is willing to abandon pursuit of a complete ranking in return for the ability to resolve more comparisons within the top r.
Data for this paper are available for download from: http://www.aeaweb.org/aer/data/dec07/20060421_data.zip
In their recent multiple-testing-based analysis of the same data, LSX use four different approaches: one in which just different outcomes are considered, one in which just different treatments are considered, one in which just different 'types' of donors are considered, and one in which all the different outcomes, treatments, and 'types' are simultaneously considered. Note that LSX group the 36 different treatments that we consider into just 3 treatments, which vary only on the basis of the matching ratio.
It is important to emphasize that these test statistics will, in general, be correlated. Thus, even if all of the treatment effects were equal to zero, we would expect that less than 5% of the T-statistics (obtained from a single sample) would fall outside of the interval [−1.960, 1.960]. However, the probability that at least one of the T-statistics would fall outside of this interval is well in excess of 0.05.
This dataset is publicly available from: https://www.aeaweb.org/aej-applied/data/2007-0062_data.zip
ALO also examine 'intention-to-treat' effects, where the treatment effects are the coefficients on indicator variables for assignment to the treatments. These effects are estimated for both men and women (both separately and together), while the treatment effects we focus on here are estimated only for women.
ALO do informally compare estimates of different treatment effects in other parts of their paper. For example, on p. 14 they state that 'the [intention-to-treat] estimates for women suggest the combination of services and fellowships ...had a larger impact than fellowships alone'.
REFERENCES
SUPPORTING INFORMATION
Additional Supporting Information may be found in the online version of this article at the publisher's website:
Online Appendix
Replication Package
Notes
Co-editor Victor Chernozhukov handled this manuscript.