- Split View
-
Views
-
Cite
Cite
James J Heckman, Ganesh Karapakula, Using a satisficing model of experimenter decision-making to guide finite-sample inference for compromised experiments, The Econometrics Journal, Volume 24, Issue 2, May 2021, Pages C1–C39, https://doi.org/10.1093/ectj/utab009
- Share Icon Share
Summary
This paper presents a simple decision-theoretic economic approach for analysing social experiments with compromised random assignment protocols that are only partially documented. We model administratively constrained experimenters who satisfice in seeking covariate balance. We develop design-based small-sample hypothesis tests that use worst-case (least favourable) randomization null distributions. Our approach accommodates a variety of compromised experiments, including imperfectly documented rerandomization designs. To make our analysis concrete, we focus much of our discussion on the influential Perry Preschool Project. We reexamine previous estimates of programme effectiveness using our methods. The choice of how to model reassignment vitally affects inference.
1. INTRODUCTION
This paper develops a finite-sample, design-based approach for analysing data from compromised social experiments using a satisficing model of experimenter behaviour. Compromises can take many forms, including exchanges or transfers of subjects across the experimental groups based on post-randomization considerations that are not fully documented. For specificity, we motivate our approach drawing on the Perry Preschool Project, an experimental high-quality preschool programme targeted toward disadvantaged African American children in the 1960s.1
Previous studies of the Perry programme report substantial treatment effects on numerous outcomes.2 These studies have greatly influenced discussions about the benefits of early childhood programmes.3 However, critics of the Perry programme question the validity of these conclusions. They point to the small sample size of the experiment—just over a hundred observations. They also mention incomplete knowledge of, and compromises in, the randomization protocol used to form control and treatment groups. Problems with attrition and nonresponse are also cited. Previous research (Heckman et al., 2010a; Heckman et al., 2020) addresses some of these concerns. We offer an alternative approach that models experimenter decision-making in conducting the experiment. We compare our approach with that of Heckman et al. (2020) in Section 4.4.
The Perry randomization protocol was a multi-stage process. Its main compromised feature is shared by many randomized controlled trials: undocumented rerandomization. This involves reassignment of treatment status after initial random assignment in order to improve balance between experimental groups with respect to baseline covariates, but without a pre-specified, fully documented reassignment plan.
This practice occurs often. Bruhn and McKenzie (2009) survey 25 leading researchers using randomized experiments and report a typical response:
“[Experimenters] regressed variables like education on assignment to treatment,
and then re-did the assignment if these coefficients were too big.”
Some 52% admit to “subjectively deciding whether to redraw” and 15% admit to “using a statistical rule to decide whether to redraw” the treatment assignment vector in at least one of the experiments they conducted.4 The authors conclude that
“this reveals common use of methods to improve baseline balance,
including several rerandomization methods not discussed in print.”
The approach developed in this paper applies to experiments conducted in such a subjective and incompletely documented manner. If rerandomization criteria are specified and adhered to before carrying out final treatment assignment, there exist simpler methods for conducting valid inference.5 We supplement the literature by considering the case where the reassignment rule is only partially documented. We build on and complement the analysis of Heckman et al. (2020) with an explicit model of experimenter behaviour.
We model experimenters as decision-makers who satisfice in seeking to achieve covariate balance with a “suitable” metric. Implicit decision rules underlie all covariate balancing procedures. The decision-makers forming the experimental groups do not necessarily have a precise rule in mind but satisfice in the sense of Simon (1955). Even if experimenters have a specific rule in mind, it may not be carefully documented.
This paper proceeds in the following way. Section 2 illustrates the class of problems addressed in this paper by reexamining the reassignment protocols of an influential compromised small-sample social experiment. Section 3 presents a satisficing model of experimenter behaviour consistent with the available information on it from published and informal accounts. We partially identify the set of randomization protocols consistent with our model. We consider the generality of our approach by discussing the class of experiments to which our model applies. In Section 4, we first discuss hypotheses of interest and conventional testing procedures used in the literature. We then construct worst-case randomization tests using stochastic approximations of least favourable randomization null distributions. We also compare our approach with an existing method for inference with imperfect randomization. Section 5 presents our test statistics and uses our methodology to reexamine the inference reported by Heckman et al. (2020). Section 6 concludes.
2. THE MOTIVATING PROBLEM
To give specificity to our analysis we draw on the Perry Preschool Project, a prototypical social experiment that was conducted in the early 1960s. The original sample for the experiment consisted of 128 children. Five of these children were dropped from the study due to extraneous reasons.6 Starting at age 3, treatment in the following two years included preschool for 2.5 hours per day on weekdays during the academic year. The programme also offered 1.5-hour weekly home visits by the Perry teachers to promote parental engagement with the child.7 For more details on the background and eligibility criteria of the Perry programme, see Heckman et al. (2010a) and Appendix A.
2.1. Randomization protocol
Understanding the randomization protocol is essential for constructing valid frequentist inference for any experiment. As Bruhn and McKenzie (2009) emphasize, many experimental studies in economics do not report the complete set of rules (e.g., balancing criteria) used to form experimental samples. They conduct hypothesis tests that ignore the randomization protocols actually used. In analysing the Perry data, this issue is salient. Reports vary about the procedure used and the exact rules followed in creating experimental samples. We discuss the various descriptions of the randomization protocols. While the core descriptions of the procedure followed are broadly consistent across texts, some of the details provided are vague and inconsistent, even those by the same authors. We account for this ambiguity in designing and interpreting our hypothesis tests. While the details are Perry-specific, the general principles involved are not.
Before the initiation of the randomization procedure by the Perry staff in each of the last four Perry cohorts, any younger siblings of participants enrolled in previous waves are separated from children of freshly recruited families, whom we term “singletons” (Schweinhart et al., 1985; Schweinhart, 2013). As Schweinhart et al. (1985) explain,
“[A]ny siblings [are] assigned to the same group [either treatment or control]
as their older siblings in order to maintain the independence of the groups.”
By construction this does not apply to the very first cohort.
The singletons from new families are then randomized into the two experimental groups as follows. Weikart et al. (1978) detail the second step of the randomization protocol:
“First, all [singletons] are rank-ordered according to Stanford–Binet [IQ]
scores. Next, they are sorted (odd / even) into two groups.”
Singletons are then divided into two groups, one comprising those with even IQ ranks and another with odd IQ ranks. The latter group has one additional person if the singletons are odd in number; otherwise, the sizes of the two groups are equal.
In the third step, children are exchanged between the two groups to balance the vector of means of an index of socioeconomic status (SES), the proportions of boys and girls, and the proportion of children with working mothers, in addition to mean IQ (Weikart et al., 1964; Schweinhart et al., 1993). The exact balancing criteria and the number of exchanges are not specified, and the exchanges are not necessarily restricted to those between consecutively ranked IQ pairs,8 as is sometimes assumed, e.g., in Heckman et al. (2020). After the first three steps, there are two undesignated groups that differ in number by at most one, and the two groups are balanced with respect to mean IQ, mean SES, percentage of boys, and the proportion of children with working mothers, in a manner acceptable to the staff, using balancing rules that are undocumented.
All sources agree that in the fourth step a toss of a fair coin decides assignment of the two groups to treatment and control conditions. The fifth step concerns children with working mothers who are placed in the treatment group after the fourth step. In the fifth step, some of these children are transferred to the control group.9 Although there is no consistent account of the number of transfers, the sources describe the fifth step as involving one-way transfers of some children of working mothers from the treatment group to the control group.10 Weikart et al. (1978) provide reasons for the transfers: “no funds were available [to provide all working mothers with logistical support, and] special arrangements could not always be made.” We interpret this statement as implying that special arrangements could be made for at least some working mothers to enable their children to attend preschool and participate in home visits if placed in the treatment group. The constraints facing programme administrators in doing so likely vary across cohorts. We assume that the Perry staff are impartial as to which working mothers get special arrangements.
Table 1 summarizes the randomization protocol. The main sources of ambiguity are in boldface: (a) the undocumented balancing criteria and rules used to satisfactorily balance the two undesignated groups with respect to the mean levels of baseline variables in the third step; and (b) the nature of constraints on the provision of special home visitation arrangements for children of working mothers in the fifth step.
(1) Recruit participants and separate any younger siblings of participants enrolled in previous waves from singletons (children of freshly recruited families) |
↓ |
(2) Rank singletons by IQ and split into two groups based on whether the rank is even or odd |
↓ |
(3) Exchange singletons between the two groups to satisfactorily balance the mean levels of a vector of IQ, SES, gender, and mother’s working status |
↓ |
(4) Toss a fair coin to determine which of the two groups becomes the initial treatment group |
↓ |
(5) Transfer some children of working mothers from the treatment group to the control group impartially if special arrangements for home visits can be made for only a limited number |
↓ |
(6) Assign any eligible younger siblings to the same group as their enrolled older siblings |
(1) Recruit participants and separate any younger siblings of participants enrolled in previous waves from singletons (children of freshly recruited families) |
↓ |
(2) Rank singletons by IQ and split into two groups based on whether the rank is even or odd |
↓ |
(3) Exchange singletons between the two groups to satisfactorily balance the mean levels of a vector of IQ, SES, gender, and mother’s working status |
↓ |
(4) Toss a fair coin to determine which of the two groups becomes the initial treatment group |
↓ |
(5) Transfer some children of working mothers from the treatment group to the control group impartially if special arrangements for home visits can be made for only a limited number |
↓ |
(6) Assign any eligible younger siblings to the same group as their enrolled older siblings |
(1) Recruit participants and separate any younger siblings of participants enrolled in previous waves from singletons (children of freshly recruited families) |
↓ |
(2) Rank singletons by IQ and split into two groups based on whether the rank is even or odd |
↓ |
(3) Exchange singletons between the two groups to satisfactorily balance the mean levels of a vector of IQ, SES, gender, and mother’s working status |
↓ |
(4) Toss a fair coin to determine which of the two groups becomes the initial treatment group |
↓ |
(5) Transfer some children of working mothers from the treatment group to the control group impartially if special arrangements for home visits can be made for only a limited number |
↓ |
(6) Assign any eligible younger siblings to the same group as their enrolled older siblings |
(1) Recruit participants and separate any younger siblings of participants enrolled in previous waves from singletons (children of freshly recruited families) |
↓ |
(2) Rank singletons by IQ and split into two groups based on whether the rank is even or odd |
↓ |
(3) Exchange singletons between the two groups to satisfactorily balance the mean levels of a vector of IQ, SES, gender, and mother’s working status |
↓ |
(4) Toss a fair coin to determine which of the two groups becomes the initial treatment group |
↓ |
(5) Transfer some children of working mothers from the treatment group to the control group impartially if special arrangements for home visits can be made for only a limited number |
↓ |
(6) Assign any eligible younger siblings to the same group as their enrolled older siblings |
3. MODELLING AND PARTIALLY IDENTIFYING THE RANDOMIZATION PROTOCOL
Since no precise description of the full Perry randomization protocol exists, we do not know who was exchanged in the third step and who was transferred in the fifth step, making a standard bounding analysis intractable. To address this problem, we assume that experimenters satisfice11 in seeking “balance” in the baseline covariate means of treatment and control groups, while facing capacity constraints on special home visits for children of working mothers.
Using this model, we bound the level of covariate balance deemed acceptable by the experimenters at the end of the first three stages of the protocol. We also bound the number of possible transfers at the fifth stage of the assignment procedure. Our model and the identified bounds are used to construct worst-case randomization tests using least favourable null distributions for treatment effects. While the details differ, the approach readily generalizes to the class of compromised rerandomization designs discussed by Bruhn and McKenzie (2009).
3.1. Formalizing the randomization protocol
We first model the Perry randomization protocol and later discuss its generalizability. Let |$\mathcal {S}_c$| be the set of unique identifiers of participants in cohort12c ∈ {0, 1, 2, 3, 4} with no elder siblings already enrolled in the Perry Preschool Project. The cardinality of the set of singletons is |$|\mathcal {S}_c|$|.13 The participants in the set |$\mathcal {S}_c$| are ranked according to their IQs by the Perry staff, using an undocumented method to break any ties. The participants with odd and even ranks are then split into two undesignated groups, with |$\lceil |\mathcal {S}_c|/2 \rceil$| and |$\lfloor |\mathcal {S}_c|/2 \rfloor$| members, respectively.14 Staff exchange participants between the two groups until the mean levels of four variables (Stanford–Binet IQ, index of SES, gender, and mother’s working status) are balanced to their satisfaction.15 The exact metric the staff used to determine satisfactory covariate balance is not documented.
We assume that they use Hotelling’s two-sample t-squared statistic |$\tau ^2_c$|, which is closely related to the Mahalanobis distance metric often used in matching.16 However, for each cohort’s initial groups (partially identified in Section 3.2), the Hotelling statistic and raw mean differences do not correspond to their possible minimum values and are sometimes far away from them.17 Thus, it appears in terms of this model that programme officials were satisficing rather than optimizing (minimizing covariate imbalance) in constructing the two groups.
Define ηc as a parameter indicating the maximum number of children of working mothers in cohort c ∈ {2, 3, 4} for whom special arrangements could be made to enable special home visits.20 We define η0,1 to be the parameter indicating the maximum number of children of working mothers in the pooled cohorts 0 and 1 for whom special home visitation arrangements could be made, averting their transfer to the control group if placed in the initial treatment group.21
3.2. Partially identifying satisficing thresholds and capacity constraints
Using the Perry data, we now demonstrate how we can partially identify the satisficing thresholds δc and the special home visitation capacity constraints ηc using the last three cohorts as examples. We then present a general framework for partially identifying these parameters.
Wave 2 | Di = 0 | Di = 1 | Total |
Mi = 0 | 9 | 7 | 16 |
Mi = 1 | 3 | 3 | 6 |
Total | 12 | 10 | 22 |
Wave 2 | Di = 0 | Di = 1 | Total |
Mi = 0 | 9 | 7 | 16 |
Mi = 1 | 3 | 3 | 6 |
Total | 12 | 10 | 22 |
Wave 2 | Di = 0 | Di = 1 | Total |
Mi = 0 | 9 | 7 | 16 |
Mi = 1 | 3 | 3 | 6 |
Total | 12 | 10 | 22 |
Wave 2 | Di = 0 | Di = 1 | Total |
Mi = 0 | 9 | 7 | 16 |
Mi = 1 | 3 | 3 | 6 |
Total | 12 | 10 | 22 |
Example 1 discusses the steps for bounding the parameters δ2 and η2 in wave 2. Shown is a contingency table of mother’s working status Mi and final treatment status Di for participants |$i \in \mathcal {S}_2$| in cohort 2 with no elder siblings already enrolled in the Perry study. There are 22 such participants in total. Since there are an even number of participants, each of the initial two undesignated groups (as well as the initial treatment and control groups in the next stage) would have been |$\lceil |\mathcal {S}_2|/2 \rceil = \lfloor |\mathcal {S}_2|/2 \rfloor = 11$| in size. However, we observe only 10 members in the final treatment group but 12 members in the final control group. This implies that there must have been one transfer from the initial treatment group to the control group. Thus, one of the 3 children of working mothers in the final control group was in the initial treatment group. However, we do not know exactly which one of these children was transferred, so there are 3 possibilities for the initial treatment group. Let |$\tau ^2_{2,1}, \tau ^2_{2,2}, \tau ^2_{2,3}$| be the Hotelling two-sample statistics for these 3 possibilities. One of these Hotelling statistics was the actual level of covariate imbalance between the initial treatment and control groups, and this level of imbalance is assumed to be within the satisficing threshold δ2 of the Perry staff (by construction). Thus, |$\delta _2 \ge \min \lbrace \tau ^2_{2,1}, \tau ^2_{2,2}, \tau ^2_{2,3}\rbrace$|. In addition, m2 = 4, since there must have been 4 children of working mothers in the initial treatment group, consisting of the 3 participants who remain in the final treatment group and the 1 participant who was transferred to the control group. Since 3 of the initial 4 participants remained in the final treatment group, min (η2, m2) = min (η2, 4) = 3, implying that η2 = 3, the only solution that satisfies the equality. We next present two other examples.
Wave 3 | Di = 0 | Di = 1 | Total |
Mi = 0 | 7 | 9 | 16 |
Mi = 1 | 5 | 0 | 5 |
Total | 12 | 9 | 21 |
Wave 3 | Di = 0 | Di = 1 | Total |
Mi = 0 | 7 | 9 | 16 |
Mi = 1 | 5 | 0 | 5 |
Total | 12 | 9 | 21 |
Wave 3 | Di = 0 | Di = 1 | Total |
Mi = 0 | 7 | 9 | 16 |
Mi = 1 | 5 | 0 | 5 |
Total | 12 | 9 | 21 |
Wave 3 | Di = 0 | Di = 1 | Total |
Mi = 0 | 7 | 9 | 16 |
Mi = 1 | 5 | 0 | 5 |
Total | 12 | 9 | 21 |
In Example 2 we show a contingency table of Mi and Di for the 21 participants |$i \in \mathcal {S}_3$| in cohort 3. The sizes of the larger and smaller undesignated groups would have been |$\lceil |\mathcal {S}_3|/2 \rceil = 11$| and |$\lfloor |\mathcal {S}_3|/2 \rfloor = 10$|, respectively. However, either of these two groups could have been the initial treatment group. Since there are 12 members in the final control group and 9 in the final treatment group, there are 2 possible cases: if the initial treatment group had 10 members, there would have been 10 − 9 = 1 transfer; but if it had 11 members, there would have been 11 − 9 = 2 transfers. Since the number of transfers involving children of working mothers is either 1 or 2, the number of possibilities for the initial treatment group is |$\binom{5}{1} + \binom{5}{2} = 5 + 10 = 15$|, as all the 5 children of working mothers in this cohort are in the control group. Let |$\tau ^2_{3,1}, \dots , \tau ^2_{3,15}$| be the Hotelling statistics for those 15 possibilities. Then, |$\delta _3 \ge \min \lbrace \tau ^2_{3,1}, \dots , \tau ^2_{3,15}\rbrace$|. In addition, m3 ∈ {1, 2}, since m3 is the sum of the number of transfers (either 1 or 2) and the number of remaining children in the final treatment group (0 in this cohort). As no working mother remained in the treatment group, min (η3, m3) = 0, implying that η3 = 0, which is the only number consistent with this equality. Thus, the Perry staff were unable to provide special home visitation accommodations for any of the participants in this cohort.
Wave 4 | Di = 0 | Di = 1 | Total |
Mi = 0 | 5 | 10 | 15 |
Mi = 1 | 4 | 0 | 4 |
Total | 9 | 10 | 19 |
Wave 4 | Di = 0 | Di = 1 | Total |
Mi = 0 | 5 | 10 | 15 |
Mi = 1 | 4 | 0 | 4 |
Total | 9 | 10 | 19 |
Wave 4 | Di = 0 | Di = 1 | Total |
Mi = 0 | 5 | 10 | 15 |
Mi = 1 | 4 | 0 | 4 |
Total | 9 | 10 | 19 |
Wave 4 | Di = 0 | Di = 1 | Total |
Mi = 0 | 5 | 10 | 15 |
Mi = 1 | 4 | 0 | 4 |
Total | 9 | 10 | 19 |
In Example 3 we show a contingency table of Mi and Di for the 19 participants |$i \in \mathcal {S}_4$| in cohort 4. The sizes of the larger and smaller undesignated groups would have been |$\lceil |\mathcal {S}_3|/2 \rceil = 10$| and |$\lfloor |\mathcal {S}_3|/2 \rfloor = 9$|. These coincide with the final sizes of the treatment and control groups, respectively. Accordingly, we can conclude that the observed final treatment group was indeed the initial treatment group for this cohort. Otherwise, the control group would have had at least 10 members. Let |$\tau ^2_{4,1}$| be the Hotelling statistic for the observed partition of |$\mathcal {S}_4$| based on the final treatment status. Then, |$\delta _4 \ge \tau ^2_{4,1}$|. In addition, note that there are no children of working mothers in the final treatment group, which was also the initial treatment group, and so m4 = 0. Since min (η4, m4) = min (η4, 0) = 0 and there are 4 members with working mothers in total, it follows that the capacity constraint could be any of the numbers from 0 through 4, i.e., η4 ∈ {0, 1, 2, 3, 4}, because any of these values satisfies the equality. Thus, the observed data for cohort 4 is not helpful in bounding η4.
Partial identification of the satisficing thresholds and capacity constraints in general
We now present a general characterization of how to partially identify the satisficing thresholds and capacity constraints on special home visits.
Wave c | Di = 0 | Di = 1 | Total |
Mi = 0 | |$\omega_{0,0}$| | |$\omega_{0,1}$| | |$\omega_{0,*}$| |
Mi = 1 | |$\omega_{1,0}$| | |$\omega_{1,1}$| | |$\omega_{1,*}$| |
Total | |$\omega_{*,0}$| | |$\omega_{*,1}$| | |$|\mathcal {S}_c|$| |
Wave c | Di = 0 | Di = 1 | Total |
Mi = 0 | |$\omega_{0,0}$| | |$\omega_{0,1}$| | |$\omega_{0,*}$| |
Mi = 1 | |$\omega_{1,0}$| | |$\omega_{1,1}$| | |$\omega_{1,*}$| |
Total | |$\omega_{*,0}$| | |$\omega_{*,1}$| | |$|\mathcal {S}_c|$| |
Wave c | Di = 0 | Di = 1 | Total |
Mi = 0 | |$\omega_{0,0}$| | |$\omega_{0,1}$| | |$\omega_{0,*}$| |
Mi = 1 | |$\omega_{1,0}$| | |$\omega_{1,1}$| | |$\omega_{1,*}$| |
Total | |$\omega_{*,0}$| | |$\omega_{*,1}$| | |$|\mathcal {S}_c|$| |
Wave c | Di = 0 | Di = 1 | Total |
Mi = 0 | |$\omega_{0,0}$| | |$\omega_{0,1}$| | |$\omega_{0,*}$| |
Mi = 1 | |$\omega_{1,0}$| | |$\omega_{1,1}$| | |$\omega_{1,*}$| |
Total | |$\omega_{*,0}$| | |$\omega_{*,1}$| | |$|\mathcal {S}_c|$| |
In the above contingency table, there are ωm,d participants with (Mi, Di) = (m, d) ∈ {0, 1}2 among the set of participants |$\mathcal {S}_c$| in cohort c.25 The total number of children with nonworking mothers is |$\omega_{0,*} = \omega_{0,0} + \omega_{0,1}$| and that of working mothers is |$\omega_{1,*} = \omega_{1,0} + \omega_{1,1}$|. The total number of participants in the final control group is |$\omega_{*,0} = \omega_{0,0} + \omega_{1,0}$| and that in the final treatment group is |$\omega_{*,1} = \omega_{0,1} + \omega_{1,1}$|. The partial identification of the satisficing thresholds and capacity constraints would vary depending on whether |$|\mathcal {S}_c|$| is even or odd and also depending on whether |$\omega _{*,1} = \lceil |\mathcal {S}_c|/2 \rceil$| or |$\omega _{*,1} \lt \lceil |\mathcal {S}_c|/2 \rceil$|. We discuss each of these cases separately.
First, consider the case where |$|\mathcal {S}_c|$| is even or odd and |$\omega _{*,1} = \lceil |\mathcal {S}_c|/2 \rceil$|. In this case, since the size of the final treatment group remains the same as that of the initial treatment group, there must have been no transfers of children with working mothers from the treatment group to the control group. Since the final treatment group is the same as the initial one, we can bound the satisficing threshold as follows: |$\delta _c \ge \tau ^2_{c,1}$|, where |$\tau ^2_{c,1}$| is the Hotelling statistic for the partition of |$\mathcal {S}_c$| based on the final treatment status. In addition, since there are no transfers, the number of children of working mothers in the initial treatment group mc equals |$\omega_{1,1}$|. Since |$\min(\eta_c,\omega_{1,1}) = \omega_{1,1}$|, it follows that |$\eta_c \in \{\omega_{1,1},\dots,\omega_{1,*}\}$|, i.e., the number of slots available for special home visits must be at least the number |$\omega_{1,1}$| observed in the data.
Second, consider the case where |$|\mathcal {S}_c|$| is even and |$\omega _{*,1} \lt \lceil |\mathcal {S}_c|/2 \rceil$|. As in Example 1, in this case it is clear that the number of transfers in the final stage must have been |$\chi _c = \lceil |\mathcal {S}_c|/2 \rceil - \omega _{*,1}$|, which is a positive number. The |$\chi_c$| transferred children must be among the |$\omega_{1,0}$| members with working mothers in the final control group. Thus, there are |$\binom{\omega _{1,0}}{\chi _c}$| possibilities for the initial treatment group. Let |$\vartheta ^\delta _c$| be the set containing the Hotelling statistics for those possibilities. Then, |$\delta _c \ge \min \vartheta ^\delta _c$|. In addition, there must have been |$m_c = \omega_{1,1} + \chi_c$| children with working mothers in the initial treatment group. It remains to determine which values of |$\eta_c$| are consistent with the equality |$\min(\eta_c,\omega_{1,1} + \chi_c) = \omega_{1,1}$|. Since |$\chi_c > 0$|, it follows that |$\eta_c = \omega_{1,1}$|.
Third, consider the case where |$|\mathcal {S}_c|$| is odd and |$\omega _{*,1} \lt \lceil |\mathcal {S}_c|/2 \rceil$|. As in Example 2, in this case there are two possibilities for the number |$\chi_c$| of transfers in the final stage. Specifically, |$\chi _c \in \lbrace \lfloor |\mathcal {S}_c|/2 \rfloor - \omega _{*,1}, \lceil |\mathcal {S}_c|/2 \rceil - \omega _{*,1}\rbrace$|. These |$\chi_c$| transferred children must be among the |$\omega_{1,0}$| members with working mothers in the final control group. Thus, there are |$\binom{\omega _{1,0}}{\lfloor |\mathcal {S}_c|/2 \rfloor - \omega _{*,1}} + \binom{\omega _{1,0}}{\lceil |\mathcal {S}_c|/2 \rceil - \omega _{*,1}}$| possibilities for the initial treatment group. Let |$\vartheta ^\delta _c$| be the set containing the Hotelling statistics for those possibilities. Then, |$\delta _c \ge \min \vartheta ^\delta _c$|. The number mc of children with working mothers initially assigned treatment is either equal to |$\omega _{1,1} + \lfloor |\mathcal {S}_c|/2 \rfloor - \omega _{*,1}$| or equal to |$\omega _{1,1} + \lceil |\mathcal {S}_c|/2 \rceil - \omega _{*,1}$|. Let |$\vartheta ^\eta _c$| be the set of values of |$\eta_c$| consistent with the equality |$\min(\eta_c, m_c) = \omega_{1,1}$|. If |$m_c = \omega _{1,1} + \lceil |\mathcal {S}_c|/2 \rceil - \omega _{*,1}$|, then |$\eta_c = \omega_{1,1}$|, since |$\lceil |\mathcal {S}_c|/2 \rceil \gt \omega _{*,1}$|. However, if |$m_c = \omega _{1,1} + \lfloor |\mathcal {S}_c|/2 \rfloor - \omega _{*,1}$|, there are two sub-cases: if |$\lfloor |\mathcal {S}_c|/2 \rfloor \gt \omega _{*,1}$|, then |$\eta_c = \omega_{1,1}$|; but if |$\lfloor |\mathcal {S}_c|/2 \rfloor = \omega _{*,1}$|, then |$\eta_c \in \{\omega_{1,1}\dots,\omega_{1,*}\}$|. Therefore, the special home visiting slots can be partially identified as follows: |$\eta _c \in \vartheta ^\eta _c$|, where |$\vartheta ^\eta _c = \lbrace \omega _{1,1}\dots ,\omega _{1,*}\rbrace$| if |$\lfloor |\mathcal {S}_c|/2 \rfloor = \omega _{*,1}$|, and |$\vartheta ^\eta _c = \lbrace \omega _{1,1}\rbrace$| if |$\lfloor |\mathcal {S}_c|/2 \rfloor \gt \omega _{*,1}$|.
This general characterization of the partial identification of satisficing thresholds |$\delta_c$| applies to all cohorts c ∈ {0, 1, 2, 3, 4} but that of the special home visiting capacity constraints |$\eta_c$| applies only to cohorts c ∈ {2, 3, 4}. However, similar reasoning can be used to partially identify the capacity constraint |$\eta_{0,1}$| for pooled cohorts 0 and 1.26
3.3. Applicability of our approach to other compromised experiments
Our approach can be applied to many of the studies that Bruhn and McKenzie (2009) criticize, especially experiments using undocumented rerandomization. All of these experiments have the feature that some criterion determines “satisfactory balance.” For example, Bruhn and McKenzie (2009) quote a survey response that says, “[experimenters] regressed variables like education on assignment to treatment, and then re-did the assignment if these coefficients were too big.” With appropriate modifications, our model of satisficing thresholds directly applies to experiments conducted in such a subjective and incompletely documented manner. Suitable adjustments include replacing Hotelling’s statistic in our model with studentized regression coefficients (selected by pretesting or otherwise) or other metrics actually used to measure covariate imbalance between the treatment and control groups. Our methods for partially identifying the underlying randomization rules can be used when the subjective satisficing thresholds are not documented. Even though we only use one balancing criterion (Hotelling’s statistic) for dimensionality reduction in our definition of |$\mathbb {U}_c(\cdot )$|, it can be trivially modified to accommodate multiple balancing criteria. In addition, if the experiment has strata instead of cohorts, the cs in our model would correspond to strata.
If an experiment does not have transfers after forming the intermediate treatment and control groups, then there are no capacity constraints, i.e., the ηcs play no role. However, in some social experiments, post-randomization transfer of some participants from the control to the treatment group can occur if additional funding for the intervention becomes available. For example, wait-list control groups are used in some clinical studies. While this is the reverse of what occurred in the Perry experiment, our model (with appropriate modifications) can be readily applied. Overall, our approach can be adapted to analyse a variety of compromised experiments across multiple disciplines.
4. HYPOTHESES OF INTEREST AND INFERENCE
The conventional way to analyse randomized experiments is to posit a null hypothesis that the average effect of treatment is zero and to proceed in testing it with large-sample methods using asymptotic or bootstrap distributions. Given the relatively small size of many experimental samples, reliance on large-sample methods can be problematic.27
In some settings, permutation tests can be used to test the null hypothesis that the outcomes in the control group have the same distribution as those in the treatment group without relying on large-sample theory. Permutation tests exploit the property that treatment and control labels within the same strata are exchangeable under the null hypothesis of a common outcome distribution. If randomization of the treatment status did not involve explicit stratification on baseline covariates, permutation tests need to make restrictive assumptions on the strata within which treatment and control labels are exchangeable. This approach is used by Heckman et al. (2010a).28 They assume that conditioning on covariates solves the problem of post-random assignment reallocation but without any explicit model for why it is effective in doing so.29
This paper uses knowledge of the randomization protocol to draw inferences about treatment effects. Once a precise null hypothesis is specified, we can determine the distribution of estimates generated by the randomization scheme and assess the statistical significance of the observed treatment effects.
In this section, we first formulate our hypotheses of interest. We then discuss conventional inferential procedures. Finally, we introduce worst-case (least favourable) randomization tests and discuss how to conduct them using stochastic approximations, and then we compare our methods with alternative approaches for inference with imperfect randomization.
4.1. Hypotheses of interest
Hypothesis |$\mathcal {H}_\mathcal {N}$| nests the sharp null hypothesis |$\mathcal {H}_\mathcal {F}$|. In general there are many configurations of the individual treatment effects that are all consistent with |$\mathcal {H}_\mathcal {N}$|. Thus, to test |$\mathcal {H}_\mathcal {N}$| using only limited knowledge of the randomization protocol, we would need to test each one of all the sharp null hypotheses like |$\mathcal {H}_\mathcal {F}$| that imply |$\mathcal {H}_\mathcal {N}$|.34 However, a nonrejection of |$\mathcal {H}_\mathcal {F}$| implies nonrejection of |$\mathcal {H}_\mathcal {N}$|, and so testing other sharp null hypotheses may not be necessary if we are unable to reject |$\mathcal {H}_\mathcal {F}$|. Of course, a rejection of |$\mathcal {H}_\mathcal {F}$| would not imply a rejection of |$\mathcal {H}_\mathcal {N}$|. The latter is a very conservative criterion. We next discuss conventional hypothesis testing procedures.
4.2. Conventional hypothesis testing procedures
For tests of population-level parameters such as |$\mathcal {H}_\mathcal {C}$| in equation (4.1), the most commonly reported measure of statistical significance is the asymptotic p-value. For completely randomized experiments, it can be interpreted as the p-value based on a large-sample approximation of the distribution of an estimator, say difference-in-means, over all possible randomizations under the null hypothesis |$\mathcal {H}_\mathcal {N}$| (Neyman, 1923). Li et al. (2018) derive an asymptotic theory of the difference-in-means estimator in experiments involving rerandomization with a pre-specified balancing rule using the Mahalanobis distance, for which the asymptotic distribution of the estimator is a linear combination of normal and truncated normal variables. Resampling methods are also widely used to quantify statistical uncertainty. For example, the bootstrap standard error is reported in many research papers with an associated bootstrap p-value.
Permutation tests are often used when researchers are interested in testing whether treatment and control groups have a common outcome distribution without relying on large-sample theory. Such tests rely on the property that the treatment and control labels are exchangable within each stratum of the experiment under the null hypothesis of a common distribution. In their permutation tests, Heckman et al. (2010a) use strata defined by wave, gender, and indicator for above-median socioeconomic status, assuming that experimental labels within each stratum are exchangeable. To compare their permutation procedures with the methods developed in this paper, we use a simplified version of their permutation tests using block permutations within cohorts of eldest participant-siblings (whose treatment statuses determine that of their younger participant-siblings).
In the Perry context, Heckman et al. (2020) develop an extension of permutation tests to account for imperfect randomization. In this paper, we offer an alternative design-based approach to conduct inference for a broader class of compromised experiments. We first present our approach and then compare it with theirs.
4.3. Worst-case randomization tests
Under the model of the randomization protocol in Section3, the hypothesis test that rejects the sharp null hypothesis whenever pw(D) ≤ α controls the Type I error rate at level α for any α ∈ (0, 1).
Let |$p_{\gamma ^*}(D) \equiv \mathbb {P}_{\Lambda _{\gamma ^*}}\lbrace T(\tilde{D}_{\gamma ^*}) \ge T(D)\rbrace$| for all |$\gamma ^* \in \Xi$|, let |$p_w(D) \equiv \sup _{\gamma ^* \in \Xi } p_{\gamma ^*}(D)$| represent the worst-case p-value, and let |$\psi _\alpha (D) \equiv \mathbb {I}\lbrace p_w(D) \le \alpha \rbrace$| be the test for a given D, a realization of the random treatment status vector defined on the probability space Λγ, where γ is the true value of the model parameter. Since pγ(D) ≤ pw(D) by construction, it follows that |$\mathbb {E}_{\Lambda _\gamma }[\psi _\alpha (D)] = \mathbb {E}_{\Lambda _\gamma }[\mathbb {I}\lbrace p_w(D) \le \alpha \rbrace ] \le \mathbb {E}_{\Lambda _\gamma }[\mathbb {I}\lbrace p_\gamma (D) \le \alpha \rbrace ] = \mathbb {P}_{\Lambda _\gamma }\lbrace p_\gamma (D) \le \alpha \rbrace \le \alpha$| under |$\mathcal {H}_\mathcal {F}$| for any α ∈ (0, 1).
This proof is an extension of the simple standard argument used to show the finite-sample validity of randomization tests (see Lehmann and Romano, 2005). The above proposition can be equivalently stated in terms of a critical value for the test statistic, as in Heckman et al. (2020).
Although it would be ideal to compute the exact value of pw(D), it is computationally not feasible. As is common practice in computing permutation and randomization p-values (see Lehmann and Romano, 2005), we resort to stochastic approximations. Even so, there are two challenges in estimating the worst-case p-value. First, approximating the probability |$\mathbb {P}_{\Lambda _{\gamma ^*}}\lbrace T(\tilde{D}_{\gamma ^*}) \ge T(D)\rbrace$| for a given value |$\gamma ^* \in \Xi$| is computationally demanding. Second, estimating pw(D) based on such tail probability estimates for a finite number of points on Ξ is also challenging. We tackle these two challenges sequentially and discuss how we handle some forms of stochastic approximation errors.
4.3.1. Approximating tail probabilities of randomization distributions
4.3.2. Estimating and bounding the worst-case tail probability
In the previous discussion, the test statistic T( · ) used to compute the worst-case tail probability is left general. There is reason to suspect that the choice of the test statistic matters, as shown for permutation tests by Chung and Romano (2013; 2016). Wu and Ding (2020) show that using studentized test statistics in certain randomization tests can control type I error asymptotically under certain weak null hypotheses while preserving finite-sample validity under sharp null hypotheses. Their theory ignores covariates and is limited to completely randomized factorial experiments and stratified or clustered experiments. However, they conjecture that “the strategy [of using studentized test statistics to make randomization tests asymptotically robust under weak null hypotheses while retaining their finite-sample validity under sharp null hypotheses may also be] applicable for experiments with general treatment assignment mechanisms” (Wu and Ding, 2020). While we do not attempt to prove or disprove their conjecture in the Perry experimental setting, we take it seriously given their results for certain randomization tests along with Chung and Romano’s (2013; 2016) results for permutation tests. Thus, we provide worst-case p-values using both the nonstudentized and studentized test statistics.
4.3.3. Multiple testing
Since |$\mathbb {P}_{\Lambda _\gamma }\lbrace p_w(D) \le \alpha \rbrace \le \alpha$| under |$\mathcal {H}_\mathcal {F}$| for any α ∈ (0, 1) by Proposition 4.1, Holm (1979) tests of multiple hypotheses based on the worst-case p-values would also have finite-sample validity. Multiplicity-adjusted p-values can be computed as follows. Let ρ(1), …, ρ(K) be the associated single worst-case p-values arranged in ascending order. Then, the Holm stepdown p-values adjusted for multiple testing are given by |$\varrho _{(k)} = \max _{j \le k} \min (1,(K - j + 1)\, \rho _{(j)})$| for k ∈ {1, …, K}. However, these adjusted p-values can be even more conservative because they assume least favourable dependence structure between the single worst-case p-values (Romano et al., 2010), making this the “worst-case” of the “worst-case.” However, slightly less conservative multiple hypothesis tests are available in the literature (see Romano and Wolf, 2005; Romano and Shaikh, 2010). Since it is unclear how much improvement in terms of power they provide relative to Holm tests in our context, we do not discuss the more computationally involved stepdown procedures in this paper.
4.4. Comparing methods for inference with imperfect randomization
Our approach complements that of Heckman et al. (2020), who improve on the methodology of Heckman et al. (2010a) by (i) exploiting a symmetry generated by the Perry randomization protocol, (ii) using finite-sample inference that accounts for imperfect randomization, and (iii) making transfers in the fifth step of the randomization protocol depend on a binary variable47 indicating whether the mother is available for home visits, assuming programme infrastructure is available to support it. We also exploit the symmetry: Qc represents the result of a fair coin flip to determine which of the two initially undesignated groups becomes the intended treatment group. However, we model other features of the protocol differently.
Heckman et al. (2020) model the reassignment of children of some working women by introducing a partially observed binary variable Ui that equals 1 if the mother of participant i was unavailable for home visits and 0 otherwise. It is known only for children of nonworking mothers, for whom Ui = 0, and for the children of working mothers in the final treatment group, who also have Ui = 0. For children of working mothers in the control group, Ui is not known and could be either 0 or 1. To deal with this difficulty, Heckman et al. (2020) construct two permutation tests. The first test sets Ui to 0 for all children of working mothers in the final control group and conducts a generalized permutation test accordingly. The second test: (i) samples a vector of Ui from the space of possibilities for Ui; (ii) conducts a generalized permutation test given the sampled vector of Ui and obtains the corresponding permutation p-value; and (iii) repeats steps (i) and (ii) until the space of possibilities is exhausted. It then takes the maximum p-value among the computed p-values. Our worst-case inferential methods are similar in spirit. However, there are three key differences between our approach and theirs.
First, Heckman et al. (2020) interpret Ui as a fixed trait of mothers regardless of the (random) circumstances facing programme administrators. However, whether or not a working mother and her child are visited at home (through special arrangements, e.g., on a weekend) depends, at least in part, on the availability and capacity constraints of the Perry staff. While Ui = 0 for nonworking mothers in both papers, we do not view Ui as a fixed binary trait of working mothers. Consistent with our review of the randomization protocol, we assume that children of working mothers are able to participate in the programme if special arrangements, such as weekend home visits, are made for them. In our model, there are capacity constraints for making special arrangements, so only a limited number of slots are available.48 In their model, if Ui = 1 for a working mother, her child would always be placed in the control group, because she would not accept any special accommodations even if provided by the Perry staff. Unlike the Vi,c variable that determines post-randomization transfers in our model, the Ui characteristic in their model is allowed to be related to potential outcomes, but this is a consequence of its interpretation as a fixed trait of mothers independent of the capacities of programme administrators.
Second, their procedure assumes that “some participants were exchanged between the treatment and control groups in order to balance gender and socioeconomic status score while keeping Stanford–Binet IQ score roughly constant.”49 However, as shown in Appendix B, Perry data from wave 4 reveal that the exchanges were not necessarily between consecutively ranked IQ pairs. Our approach accommodates this feature while also making more explicit the notion of balance.
Third, on a more minor note, we incorporate the five children (out of the original 128) who dropped out of the study due to extraneous reasons, since those five children were also a part of the initial randomization protocol. Our approach can also more readily be applied than that of Heckman et al. (2020) to a variety of compromised experiments, including many discussed by Bruhn and McKenzie (2009). We next demonstrate that there are important differences between inferences obtained from our procedure and theirs.
5. REANALYSIS OF HECKMAN ET AL. (2020)
This section uses the methods developed in this paper to reconsider the conclusions reached by Heckman et al. (2020) on the Perry participants through to age 40. We first list our estimators of treatment effects. Using the corresponding test statistics, we then apply our worst-case inferential methods to reanalyse the results in Heckman et al. (2020).
5.1. Estimators and test statistics for hypothesis testing
We could use a local average treatment effect (LATE) estimator, and other standard estimation methods dealing with imperfect compliance, if we knew each observation’s initial treatment status. However, in the Perry example, we do not know which members were transferred from the initial treatment group to the control group in the last step of the randomization protocol. Given this problem, we do not present LATE estimates.57
5.2. Empirical analysis
We first conduct a head-to-head comparison of Heckman et al.’s (2020) methods and ours using the same outcomes they analyse. Additionally, to compare the impact of using mean differences versus AIPW test statistics in the conventional inferential approaches and our design-based worst-case inference, we extend the outcomes they study and analyse data on violent crime.
Tables 2 and 3 report our reanalyses of Heckman et al. (2020), analysing each outcome one at a time using the doubly robust attrition-adjusted AIPW estimator. Tables 4 and 5 provide stepdown p-values for the outcomes based on multiple testing. Extended versions of these tables are presented in Online Appendices S3 to S9 using alternative test statistics for inference.58
. | . | Untreated . | Treated . | AIPW . | Asymptotic . | Bootstrap . | Permutation . | Worst-case . | Worst-case . |
---|---|---|---|---|---|---|---|---|---|
Variable . | Age . | mean . | mean . | estimate . | p-value . | p-value . | p-value . | max. p . | de Haan p . |
Stanford–Binet IQ | 4 | 83.077 | 94.909 | 8.988 | |$\mathbf { 0.0000 }$| | |$\mathbf { 0.0000 }$| | |$\mathbf { 0.0004 }$| | |$\mathbf { 0.0052 }$| | |$\mathbf { 0.0060 }$| |
Stanford–Binet IQ | 5 | 84.793 | 95.400 | 9.167 | |$\mathbf { 0.0000 }$| | |$\mathbf { 0.0002 }$| | |$\mathbf { 0.0004 }$| | |$\mathbf { 0.0050 }$| | |$\mathbf { 0.0059 }$| |
Stanford–Binet IQ | 6 | 85.821 | 91.485 | 3.056 | |$\mathbf { 0.0557 }$| | |$\mathbf { 0.0512 }$| | |$\mathbf { 0.0712 }$| | |$\mathbf { 0.0838 }$| | 0.2362 |
Stanford–Binet IQ | 7 | 87.711 | 91.121 | 1.576 | 0.2040 | 0.2143 | 0.2104 | 0.1905 | 0.3701 |
Stanford–Binet IQ | 8 | 89.054 | 88.333 | −3.829 | |$\mathbf { 0.0512 }$| | |$\mathbf { 0.0719 }$| | |$\mathbf { 0.0556 }$| | 0.1519 | 0.1894 |
Stanford–Binet IQ | 9 | 89.026 | 88.394 | −4.167 | |$\mathbf { 0.0398 }$| | |$\mathbf { 0.0577 }$| | |$\mathbf { 0.0472 }$| | 0.1147 | 0.3800 |
Stanford–Binet IQ | 10 | 86.026 | 83.697 | −4.722 | |$\mathbf { 0.0225 }$| | |$\mathbf { 0.0412 }$| | |$\mathbf { 0.0292 }$| | |$\mathbf { 0.0678 }$| | 0.1012 |
CAT reading score | 14 | 9.000 | 13.926 | 1.815 | 0.2957 | 0.3221 | 0.3112 | 0.3253 | 0.4273 |
CAT arithmetic score | 14 | 8.107 | 16.000 | 3.095 | 0.2410 | 0.2629 | 0.2608 | 0.2722 | 0.3434 |
CAT language score | 14 | 6.536 | 14.333 | 5.029 | |$\mathbf { 0.0815 }$| | |$\mathbf { 0.0995 }$| | 0.1076 | 0.1764 | 0.3482 |
CAT mechanics score | 14 | 6.964 | 15.556 | 5.979 | |$\mathbf { 0.0538 }$| | |$\mathbf { 0.0638 }$| | |$\mathbf { 0.0712 }$| | 0.1234 | 0.2364 |
CAT spelling score | 14 | 11.536 | 18.519 | 3.171 | 0.2652 | 0.2865 | 0.2600 | 0.2741 | 0.4587 |
High school graduate | 19 | 0.513 | 0.485 | 0.015 | 0.4550 | 0.4540 | 0.4868 | 0.5651 | 0.7190 |
Vocational training | 40 | 0.333 | 0.394 | 0.071 | 0.2762 | 0.2886 | 0.2932 | 0.3619 | 0.4378 |
Highest grade completed | 19 | 11.282 | 11.364 | 0.087 | 0.3902 | 0.3901 | 0.4240 | 0.4583 | 0.6282 |
Grade point average | 19 | 1.794 | 1.814 | −0.035 | 0.4366 | 0.4336 | 0.4328 | 0.5267 | 0.6983 |
Total nonjuvenile arrests | 40 | 11.718 | 7.455 | −3.895 | |$\mathbf { 0.0461 }$| | |$\mathbf { 0.0368 }$| | |$\mathbf { 0.0668 }$| | |$\mathbf { 0.0951 }$| | 1.0000 |
Total crime cost | 40 | 775.901 | 424.679 | −313.263 | 0.1376 | 0.1361 | 0.1764 | 0.2024 | 0.3880 |
Total charges | 40 | 13.385 | 9.000 | −4.132 | |$\mathbf { 0.0678 }$| | |$\mathbf { 0.0579 }$| | |$\mathbf { 0.0920 }$| | 0.1216 | 0.2598 |
Nonvictimless charges | 40 | 3.077 | 1.485 | −1.444 | |$\mathbf { 0.0274 }$| | |$\mathbf { 0.0238 }$| | |$\mathbf { 0.0372 }$| | |$\mathbf { 0.0856 }$| | 0.2792 |
Currently employed | 19 | 0.410 | 0.545 | 0.147 | 0.1263 | 0.1315 | 0.1292 | 0.2999 | 0.4666 |
Unemployed last year | 19 | 0.128 | 0.242 | 0.102 | 0.1817 | 0.1827 | 0.2148 | 0.2861 | 0.6064 |
Jobless months (past 2 yrs) | 19 | 3.816 | 5.281 | 1.367 | 0.2572 | 0.2501 | 0.2928 | 0.3243 | 1.0000 |
Currently employed | 27 | 0.564 | 0.600 | 0.089 | 0.2156 | 0.2259 | 0.2452 | 0.3335 | 0.8446 |
Unemployed last year | 27 | 0.308 | 0.242 | −0.081 | 0.2238 | 0.2190 | 0.2388 | 0.3488 | 0.5882 |
Jobless months (past 2 yrs) | 27 | 8.795 | 5.133 | −3.868 | |$\mathbf { 0.0438 }$| | |$\mathbf { 0.0430 }$| | |$\mathbf { 0.0588 }$| | 0.1115 | 0.3548 |
Currently employed | 40 | 0.500 | 0.700 | 0.266 | |$\mathbf { 0.0089 }$| | |$\mathbf { 0.0075 }$| | |$\mathbf { 0.0204 }$| | |$\mathbf { 0.0484 }$| | |$\mathbf { 0.0971 }$| |
Unemployed last year | 40 | 0.462 | 0.364 | −0.143 | |$\mathbf { 0.0843 }$| | |$\mathbf { 0.0957 }$| | |$\mathbf { 0.0912 }$| | 0.1695 | 0.5219 |
Jobless months (past 2 yrs) | 40 | 10.750 | 7.233 | −4.758 | |$\mathbf { 0.0154 }$| | |$\mathbf { 0.0200 }$| | |$\mathbf { 0.0188 }$| | |$\mathbf { 0.0650 }$| | 0.1341 |
. | . | Untreated . | Treated . | AIPW . | Asymptotic . | Bootstrap . | Permutation . | Worst-case . | Worst-case . |
---|---|---|---|---|---|---|---|---|---|
Variable . | Age . | mean . | mean . | estimate . | p-value . | p-value . | p-value . | max. p . | de Haan p . |
Stanford–Binet IQ | 4 | 83.077 | 94.909 | 8.988 | |$\mathbf { 0.0000 }$| | |$\mathbf { 0.0000 }$| | |$\mathbf { 0.0004 }$| | |$\mathbf { 0.0052 }$| | |$\mathbf { 0.0060 }$| |
Stanford–Binet IQ | 5 | 84.793 | 95.400 | 9.167 | |$\mathbf { 0.0000 }$| | |$\mathbf { 0.0002 }$| | |$\mathbf { 0.0004 }$| | |$\mathbf { 0.0050 }$| | |$\mathbf { 0.0059 }$| |
Stanford–Binet IQ | 6 | 85.821 | 91.485 | 3.056 | |$\mathbf { 0.0557 }$| | |$\mathbf { 0.0512 }$| | |$\mathbf { 0.0712 }$| | |$\mathbf { 0.0838 }$| | 0.2362 |
Stanford–Binet IQ | 7 | 87.711 | 91.121 | 1.576 | 0.2040 | 0.2143 | 0.2104 | 0.1905 | 0.3701 |
Stanford–Binet IQ | 8 | 89.054 | 88.333 | −3.829 | |$\mathbf { 0.0512 }$| | |$\mathbf { 0.0719 }$| | |$\mathbf { 0.0556 }$| | 0.1519 | 0.1894 |
Stanford–Binet IQ | 9 | 89.026 | 88.394 | −4.167 | |$\mathbf { 0.0398 }$| | |$\mathbf { 0.0577 }$| | |$\mathbf { 0.0472 }$| | 0.1147 | 0.3800 |
Stanford–Binet IQ | 10 | 86.026 | 83.697 | −4.722 | |$\mathbf { 0.0225 }$| | |$\mathbf { 0.0412 }$| | |$\mathbf { 0.0292 }$| | |$\mathbf { 0.0678 }$| | 0.1012 |
CAT reading score | 14 | 9.000 | 13.926 | 1.815 | 0.2957 | 0.3221 | 0.3112 | 0.3253 | 0.4273 |
CAT arithmetic score | 14 | 8.107 | 16.000 | 3.095 | 0.2410 | 0.2629 | 0.2608 | 0.2722 | 0.3434 |
CAT language score | 14 | 6.536 | 14.333 | 5.029 | |$\mathbf { 0.0815 }$| | |$\mathbf { 0.0995 }$| | 0.1076 | 0.1764 | 0.3482 |
CAT mechanics score | 14 | 6.964 | 15.556 | 5.979 | |$\mathbf { 0.0538 }$| | |$\mathbf { 0.0638 }$| | |$\mathbf { 0.0712 }$| | 0.1234 | 0.2364 |
CAT spelling score | 14 | 11.536 | 18.519 | 3.171 | 0.2652 | 0.2865 | 0.2600 | 0.2741 | 0.4587 |
High school graduate | 19 | 0.513 | 0.485 | 0.015 | 0.4550 | 0.4540 | 0.4868 | 0.5651 | 0.7190 |
Vocational training | 40 | 0.333 | 0.394 | 0.071 | 0.2762 | 0.2886 | 0.2932 | 0.3619 | 0.4378 |
Highest grade completed | 19 | 11.282 | 11.364 | 0.087 | 0.3902 | 0.3901 | 0.4240 | 0.4583 | 0.6282 |
Grade point average | 19 | 1.794 | 1.814 | −0.035 | 0.4366 | 0.4336 | 0.4328 | 0.5267 | 0.6983 |
Total nonjuvenile arrests | 40 | 11.718 | 7.455 | −3.895 | |$\mathbf { 0.0461 }$| | |$\mathbf { 0.0368 }$| | |$\mathbf { 0.0668 }$| | |$\mathbf { 0.0951 }$| | 1.0000 |
Total crime cost | 40 | 775.901 | 424.679 | −313.263 | 0.1376 | 0.1361 | 0.1764 | 0.2024 | 0.3880 |
Total charges | 40 | 13.385 | 9.000 | −4.132 | |$\mathbf { 0.0678 }$| | |$\mathbf { 0.0579 }$| | |$\mathbf { 0.0920 }$| | 0.1216 | 0.2598 |
Nonvictimless charges | 40 | 3.077 | 1.485 | −1.444 | |$\mathbf { 0.0274 }$| | |$\mathbf { 0.0238 }$| | |$\mathbf { 0.0372 }$| | |$\mathbf { 0.0856 }$| | 0.2792 |
Currently employed | 19 | 0.410 | 0.545 | 0.147 | 0.1263 | 0.1315 | 0.1292 | 0.2999 | 0.4666 |
Unemployed last year | 19 | 0.128 | 0.242 | 0.102 | 0.1817 | 0.1827 | 0.2148 | 0.2861 | 0.6064 |
Jobless months (past 2 yrs) | 19 | 3.816 | 5.281 | 1.367 | 0.2572 | 0.2501 | 0.2928 | 0.3243 | 1.0000 |
Currently employed | 27 | 0.564 | 0.600 | 0.089 | 0.2156 | 0.2259 | 0.2452 | 0.3335 | 0.8446 |
Unemployed last year | 27 | 0.308 | 0.242 | −0.081 | 0.2238 | 0.2190 | 0.2388 | 0.3488 | 0.5882 |
Jobless months (past 2 yrs) | 27 | 8.795 | 5.133 | −3.868 | |$\mathbf { 0.0438 }$| | |$\mathbf { 0.0430 }$| | |$\mathbf { 0.0588 }$| | 0.1115 | 0.3548 |
Currently employed | 40 | 0.500 | 0.700 | 0.266 | |$\mathbf { 0.0089 }$| | |$\mathbf { 0.0075 }$| | |$\mathbf { 0.0204 }$| | |$\mathbf { 0.0484 }$| | |$\mathbf { 0.0971 }$| |
Unemployed last year | 40 | 0.462 | 0.364 | −0.143 | |$\mathbf { 0.0843 }$| | |$\mathbf { 0.0957 }$| | |$\mathbf { 0.0912 }$| | 0.1695 | 0.5219 |
Jobless months (past 2 yrs) | 40 | 10.750 | 7.233 | −4.758 | |$\mathbf { 0.0154 }$| | |$\mathbf { 0.0200 }$| | |$\mathbf { 0.0188 }$| | |$\mathbf { 0.0650 }$| | 0.1341 |
Note: This table reports p-values for single hypothesis tests of treatment effects on various outcomes of male participants at the given ages. The inferences are based on the studentized AIPW test statistic.
. | . | Untreated . | Treated . | AIPW . | Asymptotic . | Bootstrap . | Permutation . | Worst-case . | Worst-case . |
---|---|---|---|---|---|---|---|---|---|
Variable . | Age . | mean . | mean . | estimate . | p-value . | p-value . | p-value . | max. p . | de Haan p . |
Stanford–Binet IQ | 4 | 83.077 | 94.909 | 8.988 | |$\mathbf { 0.0000 }$| | |$\mathbf { 0.0000 }$| | |$\mathbf { 0.0004 }$| | |$\mathbf { 0.0052 }$| | |$\mathbf { 0.0060 }$| |
Stanford–Binet IQ | 5 | 84.793 | 95.400 | 9.167 | |$\mathbf { 0.0000 }$| | |$\mathbf { 0.0002 }$| | |$\mathbf { 0.0004 }$| | |$\mathbf { 0.0050 }$| | |$\mathbf { 0.0059 }$| |
Stanford–Binet IQ | 6 | 85.821 | 91.485 | 3.056 | |$\mathbf { 0.0557 }$| | |$\mathbf { 0.0512 }$| | |$\mathbf { 0.0712 }$| | |$\mathbf { 0.0838 }$| | 0.2362 |
Stanford–Binet IQ | 7 | 87.711 | 91.121 | 1.576 | 0.2040 | 0.2143 | 0.2104 | 0.1905 | 0.3701 |
Stanford–Binet IQ | 8 | 89.054 | 88.333 | −3.829 | |$\mathbf { 0.0512 }$| | |$\mathbf { 0.0719 }$| | |$\mathbf { 0.0556 }$| | 0.1519 | 0.1894 |
Stanford–Binet IQ | 9 | 89.026 | 88.394 | −4.167 | |$\mathbf { 0.0398 }$| | |$\mathbf { 0.0577 }$| | |$\mathbf { 0.0472 }$| | 0.1147 | 0.3800 |
Stanford–Binet IQ | 10 | 86.026 | 83.697 | −4.722 | |$\mathbf { 0.0225 }$| | |$\mathbf { 0.0412 }$| | |$\mathbf { 0.0292 }$| | |$\mathbf { 0.0678 }$| | 0.1012 |
CAT reading score | 14 | 9.000 | 13.926 | 1.815 | 0.2957 | 0.3221 | 0.3112 | 0.3253 | 0.4273 |
CAT arithmetic score | 14 | 8.107 | 16.000 | 3.095 | 0.2410 | 0.2629 | 0.2608 | 0.2722 | 0.3434 |
CAT language score | 14 | 6.536 | 14.333 | 5.029 | |$\mathbf { 0.0815 }$| | |$\mathbf { 0.0995 }$| | 0.1076 | 0.1764 | 0.3482 |
CAT mechanics score | 14 | 6.964 | 15.556 | 5.979 | |$\mathbf { 0.0538 }$| | |$\mathbf { 0.0638 }$| | |$\mathbf { 0.0712 }$| | 0.1234 | 0.2364 |
CAT spelling score | 14 | 11.536 | 18.519 | 3.171 | 0.2652 | 0.2865 | 0.2600 | 0.2741 | 0.4587 |
High school graduate | 19 | 0.513 | 0.485 | 0.015 | 0.4550 | 0.4540 | 0.4868 | 0.5651 | 0.7190 |
Vocational training | 40 | 0.333 | 0.394 | 0.071 | 0.2762 | 0.2886 | 0.2932 | 0.3619 | 0.4378 |
Highest grade completed | 19 | 11.282 | 11.364 | 0.087 | 0.3902 | 0.3901 | 0.4240 | 0.4583 | 0.6282 |
Grade point average | 19 | 1.794 | 1.814 | −0.035 | 0.4366 | 0.4336 | 0.4328 | 0.5267 | 0.6983 |
Total nonjuvenile arrests | 40 | 11.718 | 7.455 | −3.895 | |$\mathbf { 0.0461 }$| | |$\mathbf { 0.0368 }$| | |$\mathbf { 0.0668 }$| | |$\mathbf { 0.0951 }$| | 1.0000 |
Total crime cost | 40 | 775.901 | 424.679 | −313.263 | 0.1376 | 0.1361 | 0.1764 | 0.2024 | 0.3880 |
Total charges | 40 | 13.385 | 9.000 | −4.132 | |$\mathbf { 0.0678 }$| | |$\mathbf { 0.0579 }$| | |$\mathbf { 0.0920 }$| | 0.1216 | 0.2598 |
Nonvictimless charges | 40 | 3.077 | 1.485 | −1.444 | |$\mathbf { 0.0274 }$| | |$\mathbf { 0.0238 }$| | |$\mathbf { 0.0372 }$| | |$\mathbf { 0.0856 }$| | 0.2792 |
Currently employed | 19 | 0.410 | 0.545 | 0.147 | 0.1263 | 0.1315 | 0.1292 | 0.2999 | 0.4666 |
Unemployed last year | 19 | 0.128 | 0.242 | 0.102 | 0.1817 | 0.1827 | 0.2148 | 0.2861 | 0.6064 |
Jobless months (past 2 yrs) | 19 | 3.816 | 5.281 | 1.367 | 0.2572 | 0.2501 | 0.2928 | 0.3243 | 1.0000 |
Currently employed | 27 | 0.564 | 0.600 | 0.089 | 0.2156 | 0.2259 | 0.2452 | 0.3335 | 0.8446 |
Unemployed last year | 27 | 0.308 | 0.242 | −0.081 | 0.2238 | 0.2190 | 0.2388 | 0.3488 | 0.5882 |
Jobless months (past 2 yrs) | 27 | 8.795 | 5.133 | −3.868 | |$\mathbf { 0.0438 }$| | |$\mathbf { 0.0430 }$| | |$\mathbf { 0.0588 }$| | 0.1115 | 0.3548 |
Currently employed | 40 | 0.500 | 0.700 | 0.266 | |$\mathbf { 0.0089 }$| | |$\mathbf { 0.0075 }$| | |$\mathbf { 0.0204 }$| | |$\mathbf { 0.0484 }$| | |$\mathbf { 0.0971 }$| |
Unemployed last year | 40 | 0.462 | 0.364 | −0.143 | |$\mathbf { 0.0843 }$| | |$\mathbf { 0.0957 }$| | |$\mathbf { 0.0912 }$| | 0.1695 | 0.5219 |
Jobless months (past 2 yrs) | 40 | 10.750 | 7.233 | −4.758 | |$\mathbf { 0.0154 }$| | |$\mathbf { 0.0200 }$| | |$\mathbf { 0.0188 }$| | |$\mathbf { 0.0650 }$| | 0.1341 |
. | . | Untreated . | Treated . | AIPW . | Asymptotic . | Bootstrap . | Permutation . | Worst-case . | Worst-case . |
---|---|---|---|---|---|---|---|---|---|
Variable . | Age . | mean . | mean . | estimate . | p-value . | p-value . | p-value . | max. p . | de Haan p . |
Stanford–Binet IQ | 4 | 83.077 | 94.909 | 8.988 | |$\mathbf { 0.0000 }$| | |$\mathbf { 0.0000 }$| | |$\mathbf { 0.0004 }$| | |$\mathbf { 0.0052 }$| | |$\mathbf { 0.0060 }$| |
Stanford–Binet IQ | 5 | 84.793 | 95.400 | 9.167 | |$\mathbf { 0.0000 }$| | |$\mathbf { 0.0002 }$| | |$\mathbf { 0.0004 }$| | |$\mathbf { 0.0050 }$| | |$\mathbf { 0.0059 }$| |
Stanford–Binet IQ | 6 | 85.821 | 91.485 | 3.056 | |$\mathbf { 0.0557 }$| | |$\mathbf { 0.0512 }$| | |$\mathbf { 0.0712 }$| | |$\mathbf { 0.0838 }$| | 0.2362 |
Stanford–Binet IQ | 7 | 87.711 | 91.121 | 1.576 | 0.2040 | 0.2143 | 0.2104 | 0.1905 | 0.3701 |
Stanford–Binet IQ | 8 | 89.054 | 88.333 | −3.829 | |$\mathbf { 0.0512 }$| | |$\mathbf { 0.0719 }$| | |$\mathbf { 0.0556 }$| | 0.1519 | 0.1894 |
Stanford–Binet IQ | 9 | 89.026 | 88.394 | −4.167 | |$\mathbf { 0.0398 }$| | |$\mathbf { 0.0577 }$| | |$\mathbf { 0.0472 }$| | 0.1147 | 0.3800 |
Stanford–Binet IQ | 10 | 86.026 | 83.697 | −4.722 | |$\mathbf { 0.0225 }$| | |$\mathbf { 0.0412 }$| | |$\mathbf { 0.0292 }$| | |$\mathbf { 0.0678 }$| | 0.1012 |
CAT reading score | 14 | 9.000 | 13.926 | 1.815 | 0.2957 | 0.3221 | 0.3112 | 0.3253 | 0.4273 |
CAT arithmetic score | 14 | 8.107 | 16.000 | 3.095 | 0.2410 | 0.2629 | 0.2608 | 0.2722 | 0.3434 |
CAT language score | 14 | 6.536 | 14.333 | 5.029 | |$\mathbf { 0.0815 }$| | |$\mathbf { 0.0995 }$| | 0.1076 | 0.1764 | 0.3482 |
CAT mechanics score | 14 | 6.964 | 15.556 | 5.979 | |$\mathbf { 0.0538 }$| | |$\mathbf { 0.0638 }$| | |$\mathbf { 0.0712 }$| | 0.1234 | 0.2364 |
CAT spelling score | 14 | 11.536 | 18.519 | 3.171 | 0.2652 | 0.2865 | 0.2600 | 0.2741 | 0.4587 |
High school graduate | 19 | 0.513 | 0.485 | 0.015 | 0.4550 | 0.4540 | 0.4868 | 0.5651 | 0.7190 |
Vocational training | 40 | 0.333 | 0.394 | 0.071 | 0.2762 | 0.2886 | 0.2932 | 0.3619 | 0.4378 |
Highest grade completed | 19 | 11.282 | 11.364 | 0.087 | 0.3902 | 0.3901 | 0.4240 | 0.4583 | 0.6282 |
Grade point average | 19 | 1.794 | 1.814 | −0.035 | 0.4366 | 0.4336 | 0.4328 | 0.5267 | 0.6983 |
Total nonjuvenile arrests | 40 | 11.718 | 7.455 | −3.895 | |$\mathbf { 0.0461 }$| | |$\mathbf { 0.0368 }$| | |$\mathbf { 0.0668 }$| | |$\mathbf { 0.0951 }$| | 1.0000 |
Total crime cost | 40 | 775.901 | 424.679 | −313.263 | 0.1376 | 0.1361 | 0.1764 | 0.2024 | 0.3880 |
Total charges | 40 | 13.385 | 9.000 | −4.132 | |$\mathbf { 0.0678 }$| | |$\mathbf { 0.0579 }$| | |$\mathbf { 0.0920 }$| | 0.1216 | 0.2598 |
Nonvictimless charges | 40 | 3.077 | 1.485 | −1.444 | |$\mathbf { 0.0274 }$| | |$\mathbf { 0.0238 }$| | |$\mathbf { 0.0372 }$| | |$\mathbf { 0.0856 }$| | 0.2792 |
Currently employed | 19 | 0.410 | 0.545 | 0.147 | 0.1263 | 0.1315 | 0.1292 | 0.2999 | 0.4666 |
Unemployed last year | 19 | 0.128 | 0.242 | 0.102 | 0.1817 | 0.1827 | 0.2148 | 0.2861 | 0.6064 |
Jobless months (past 2 yrs) | 19 | 3.816 | 5.281 | 1.367 | 0.2572 | 0.2501 | 0.2928 | 0.3243 | 1.0000 |
Currently employed | 27 | 0.564 | 0.600 | 0.089 | 0.2156 | 0.2259 | 0.2452 | 0.3335 | 0.8446 |
Unemployed last year | 27 | 0.308 | 0.242 | −0.081 | 0.2238 | 0.2190 | 0.2388 | 0.3488 | 0.5882 |
Jobless months (past 2 yrs) | 27 | 8.795 | 5.133 | −3.868 | |$\mathbf { 0.0438 }$| | |$\mathbf { 0.0430 }$| | |$\mathbf { 0.0588 }$| | 0.1115 | 0.3548 |
Currently employed | 40 | 0.500 | 0.700 | 0.266 | |$\mathbf { 0.0089 }$| | |$\mathbf { 0.0075 }$| | |$\mathbf { 0.0204 }$| | |$\mathbf { 0.0484 }$| | |$\mathbf { 0.0971 }$| |
Unemployed last year | 40 | 0.462 | 0.364 | −0.143 | |$\mathbf { 0.0843 }$| | |$\mathbf { 0.0957 }$| | |$\mathbf { 0.0912 }$| | 0.1695 | 0.5219 |
Jobless months (past 2 yrs) | 40 | 10.750 | 7.233 | −4.758 | |$\mathbf { 0.0154 }$| | |$\mathbf { 0.0200 }$| | |$\mathbf { 0.0188 }$| | |$\mathbf { 0.0650 }$| | 0.1341 |
Note: This table reports p-values for single hypothesis tests of treatment effects on various outcomes of male participants at the given ages. The inferences are based on the studentized AIPW test statistic.
. | . | Untreated . | Treated . | AIPW . | Asymptotic . | Bootstrap . | Permutation . | Worst-case . | Worst-case . |
---|---|---|---|---|---|---|---|---|---|
Variable . | Age . | mean . | mean . | estimate . | p-value . | p-value . | p-value . | max. p . | de Haan p . |
Stanford–Binet IQ | 4 | 83.692 | 96.360 | 13.425 | |$\mathbf { 0.0000 }$| | |$\mathbf { 0.0000 }$| | |$\mathbf { 0.0004 }$| | |$\mathbf { 0.0053 }$| | |$\mathbf { 0.0055 }$| |
Stanford–Binet IQ | 5 | 81.650 | 94.316 | 14.157 | |$\mathbf { 0.0008 }$| | |$\mathbf { 0.0006 }$| | |$\mathbf { 0.0064 }$| | |$\mathbf { 0.0258 }$| | |$\mathbf { 0.0503 }$| |
Stanford–Binet IQ | 6 | 87.160 | 90.913 | 5.271 | |$\mathbf { 0.0365 }$| | |$\mathbf { 0.0281 }$| | |$\mathbf { 0.0636 }$| | |$\mathbf { 0.0799 }$| | 0.4790 |
Stanford–Binet IQ | 7 | 86.000 | 92.520 | 7.347 | |$\mathbf { 0.0313 }$| | |$\mathbf { 0.0154 }$| | |$\mathbf { 0.0564 }$| | |$\mathbf { 0.0952 }$| | 0.1950 |
Stanford–Binet IQ | 8 | 83.600 | 87.840 | 4.669 | 0.1144 | |$\mathbf { 0.0896 }$| | 0.1704 | 0.2080 | 0.2832 |
Stanford–Binet IQ | 9 | 83.043 | 86.739 | 4.809 | |$\mathbf { 0.0633 }$| | |$\mathbf { 0.0679 }$| | 0.1128 | 0.1578 | 0.2873 |
Stanford–Binet IQ | 10 | 81.789 | 86.750 | 6.480 | |$\mathbf { 0.0277 }$| | |$\mathbf { 0.0323 }$| | |$\mathbf { 0.0596 }$| | 0.1840 | 0.4267 |
CAT reading score | 14 | 8.444 | 16.500 | 7.345 | |$\mathbf { 0.0130 }$| | |$\mathbf { 0.0128 }$| | |$\mathbf { 0.0268 }$| | |$\mathbf { 0.0561 }$| | 0.1125 |
CAT arithmetic score | 14 | 6.889 | 11.818 | 6.227 | |$\mathbf { 0.0102 }$| | |$\mathbf { 0.0138 }$| | |$\mathbf { 0.0284 }$| | |$\mathbf { 0.0624 }$| | |$\mathbf { 0.0731 }$| |
CAT language score | 14 | 7.833 | 19.455 | 11.923 | |$\mathbf { 0.0009 }$| | |$\mathbf { 0.0013 }$| | |$\mathbf { 0.0044 }$| | |$\mathbf { 0.0168 }$| | |$\mathbf { 0.0524 }$| |
CAT mechanics score | 14 | 8.833 | 20.636 | 12.425 | |$\mathbf { 0.0014 }$| | |$\mathbf { 0.0015 }$| | |$\mathbf { 0.0072 }$| | |$\mathbf { 0.0211 }$| | |$\mathbf { 0.0606 }$| |
CAT spelling score | 14 | 10.722 | 29.500 | 18.270 | |$\mathbf { 0.0017 }$| | |$\mathbf { 0.0042 }$| | |$\mathbf { 0.0064 }$| | |$\mathbf { 0.0176 }$| | |$\mathbf { 0.0254 }$| |
High school graduate | 19 | 0.231 | 0.840 | 0.570 | |$\mathbf { 0.0000 }$| | |$\mathbf { 0.0000 }$| | |$\mathbf { 0.0004 }$| | |$\mathbf { 0.0054 }$| | |$\mathbf { 0.0058 }$| |
Vocational training | 40 | 0.077 | 0.240 | 0.183 | |$\mathbf { 0.0286 }$| | |$\mathbf { 0.0494 }$| | |$\mathbf { 0.0420 }$| | 0.1056 | 0.2630 |
Highest grade completed | 19 | 10.750 | 11.760 | 1.202 | |$\mathbf { 0.0023 }$| | |$\mathbf { 0.0106 }$| | |$\mathbf { 0.0120 }$| | |$\mathbf { 0.0345 }$| | |$\mathbf { 0.0935 }$| |
Grade point average | 19 | 1.527 | 2.415 | 0.958 | |$\mathbf { 0.0000 }$| | |$\mathbf { 0.0155 }$| | |$\mathbf { 0.0004 }$| | |$\mathbf { 0.0119 }$| | |$\mathbf { 0.0164 }$| |
Total nonjuvenile arrests | 40 | 4.423 | 2.160 | −1.938 | |$\mathbf { 0.0657 }$| | |$\mathbf { 0.0795 }$| | |$\mathbf { 0.0880 }$| | 0.1695 | 0.4424 |
Total crime cost | 40 | 293.497 | 22.165 | −246.242 | 0.1475 | 0.1227 | 0.2436 | 0.2524 | 0.5508 |
Total charges | 40 | 4.923 | 2.240 | −2.309 | |$\mathbf { 0.0439 }$| | |$\mathbf { 0.0528 }$| | |$\mathbf { 0.0580 }$| | 0.1526 | 0.2963 |
Nonvictimless charges | 40 | 0.308 | 0.040 | −0.249 | |$\mathbf { 0.0365 }$| | |$\mathbf { 0.0263 }$| | |$\mathbf { 0.0612 }$| | |$\mathbf { 0.0906 }$| | 0.2574 |
Currently employed | 19 | 0.154 | 0.440 | 0.297 | |$\mathbf { 0.0054 }$| | |$\mathbf { 0.0048 }$| | |$\mathbf { 0.0152 }$| | |$\mathbf { 0.0578 }$| | 0.2187 |
Unemployed last year | 19 | 0.577 | 0.240 | −0.354 | |$\mathbf { 0.0029 }$| | |$\mathbf { 0.0033 }$| | |$\mathbf { 0.0104 }$| | |$\mathbf { 0.0341 }$| | 0.1878 |
Jobless months (past 2 yrs) | 19 | 10.421 | 5.217 | −4.197 | |$\mathbf { 0.0723 }$| | 0.1386 | 0.1140 | 0.1886 | 0.4619 |
Currently employed | 27 | 0.545 | 0.800 | 0.215 | |$\mathbf { 0.0471 }$| | |$\mathbf { 0.0604 }$| | |$\mathbf { 0.0648 }$| | 0.1048 | 0.2521 |
Unemployed last year | 27 | 0.542 | 0.250 | −0.269 | |$\mathbf { 0.0523 }$| | |$\mathbf { 0.0457 }$| | |$\mathbf { 0.0728 }$| | 0.1721 | 1.0000 |
Jobless months (past 2 yrs) | 27 | 10.455 | 6.240 | −1.298 | 0.3328 | 0.3449 | 0.2916 | 0.4821 | 0.7544 |
Currently employed | 40 | 0.818 | 0.833 | −0.016 | 0.4536 | 0.4586 | 0.4912 | 0.6385 | 1.0000 |
Unemployed last year | 40 | 0.409 | 0.160 | −0.194 | |$\mathbf { 0.0807 }$| | 0.1079 | 0.1324 | 0.1892 | 0.3070 |
Jobless months (past 2 yrs) | 40 | 5.045 | 4.000 | 0.057 | 0.4910 | 0.4927 | 0.4700 | 0.6326 | 1.0000 |
. | . | Untreated . | Treated . | AIPW . | Asymptotic . | Bootstrap . | Permutation . | Worst-case . | Worst-case . |
---|---|---|---|---|---|---|---|---|---|
Variable . | Age . | mean . | mean . | estimate . | p-value . | p-value . | p-value . | max. p . | de Haan p . |
Stanford–Binet IQ | 4 | 83.692 | 96.360 | 13.425 | |$\mathbf { 0.0000 }$| | |$\mathbf { 0.0000 }$| | |$\mathbf { 0.0004 }$| | |$\mathbf { 0.0053 }$| | |$\mathbf { 0.0055 }$| |
Stanford–Binet IQ | 5 | 81.650 | 94.316 | 14.157 | |$\mathbf { 0.0008 }$| | |$\mathbf { 0.0006 }$| | |$\mathbf { 0.0064 }$| | |$\mathbf { 0.0258 }$| | |$\mathbf { 0.0503 }$| |
Stanford–Binet IQ | 6 | 87.160 | 90.913 | 5.271 | |$\mathbf { 0.0365 }$| | |$\mathbf { 0.0281 }$| | |$\mathbf { 0.0636 }$| | |$\mathbf { 0.0799 }$| | 0.4790 |
Stanford–Binet IQ | 7 | 86.000 | 92.520 | 7.347 | |$\mathbf { 0.0313 }$| | |$\mathbf { 0.0154 }$| | |$\mathbf { 0.0564 }$| | |$\mathbf { 0.0952 }$| | 0.1950 |
Stanford–Binet IQ | 8 | 83.600 | 87.840 | 4.669 | 0.1144 | |$\mathbf { 0.0896 }$| | 0.1704 | 0.2080 | 0.2832 |
Stanford–Binet IQ | 9 | 83.043 | 86.739 | 4.809 | |$\mathbf { 0.0633 }$| | |$\mathbf { 0.0679 }$| | 0.1128 | 0.1578 | 0.2873 |
Stanford–Binet IQ | 10 | 81.789 | 86.750 | 6.480 | |$\mathbf { 0.0277 }$| | |$\mathbf { 0.0323 }$| | |$\mathbf { 0.0596 }$| | 0.1840 | 0.4267 |
CAT reading score | 14 | 8.444 | 16.500 | 7.345 | |$\mathbf { 0.0130 }$| | |$\mathbf { 0.0128 }$| | |$\mathbf { 0.0268 }$| | |$\mathbf { 0.0561 }$| | 0.1125 |
CAT arithmetic score | 14 | 6.889 | 11.818 | 6.227 | |$\mathbf { 0.0102 }$| | |$\mathbf { 0.0138 }$| | |$\mathbf { 0.0284 }$| | |$\mathbf { 0.0624 }$| | |$\mathbf { 0.0731 }$| |
CAT language score | 14 | 7.833 | 19.455 | 11.923 | |$\mathbf { 0.0009 }$| | |$\mathbf { 0.0013 }$| | |$\mathbf { 0.0044 }$| | |$\mathbf { 0.0168 }$| | |$\mathbf { 0.0524 }$| |
CAT mechanics score | 14 | 8.833 | 20.636 | 12.425 | |$\mathbf { 0.0014 }$| | |$\mathbf { 0.0015 }$| | |$\mathbf { 0.0072 }$| | |$\mathbf { 0.0211 }$| | |$\mathbf { 0.0606 }$| |
CAT spelling score | 14 | 10.722 | 29.500 | 18.270 | |$\mathbf { 0.0017 }$| | |$\mathbf { 0.0042 }$| | |$\mathbf { 0.0064 }$| | |$\mathbf { 0.0176 }$| | |$\mathbf { 0.0254 }$| |
High school graduate | 19 | 0.231 | 0.840 | 0.570 | |$\mathbf { 0.0000 }$| | |$\mathbf { 0.0000 }$| | |$\mathbf { 0.0004 }$| | |$\mathbf { 0.0054 }$| | |$\mathbf { 0.0058 }$| |
Vocational training | 40 | 0.077 | 0.240 | 0.183 | |$\mathbf { 0.0286 }$| | |$\mathbf { 0.0494 }$| | |$\mathbf { 0.0420 }$| | 0.1056 | 0.2630 |
Highest grade completed | 19 | 10.750 | 11.760 | 1.202 | |$\mathbf { 0.0023 }$| | |$\mathbf { 0.0106 }$| | |$\mathbf { 0.0120 }$| | |$\mathbf { 0.0345 }$| | |$\mathbf { 0.0935 }$| |
Grade point average | 19 | 1.527 | 2.415 | 0.958 | |$\mathbf { 0.0000 }$| | |$\mathbf { 0.0155 }$| | |$\mathbf { 0.0004 }$| | |$\mathbf { 0.0119 }$| | |$\mathbf { 0.0164 }$| |
Total nonjuvenile arrests | 40 | 4.423 | 2.160 | −1.938 | |$\mathbf { 0.0657 }$| | |$\mathbf { 0.0795 }$| | |$\mathbf { 0.0880 }$| | 0.1695 | 0.4424 |
Total crime cost | 40 | 293.497 | 22.165 | −246.242 | 0.1475 | 0.1227 | 0.2436 | 0.2524 | 0.5508 |
Total charges | 40 | 4.923 | 2.240 | −2.309 | |$\mathbf { 0.0439 }$| | |$\mathbf { 0.0528 }$| | |$\mathbf { 0.0580 }$| | 0.1526 | 0.2963 |
Nonvictimless charges | 40 | 0.308 | 0.040 | −0.249 | |$\mathbf { 0.0365 }$| | |$\mathbf { 0.0263 }$| | |$\mathbf { 0.0612 }$| | |$\mathbf { 0.0906 }$| | 0.2574 |
Currently employed | 19 | 0.154 | 0.440 | 0.297 | |$\mathbf { 0.0054 }$| | |$\mathbf { 0.0048 }$| | |$\mathbf { 0.0152 }$| | |$\mathbf { 0.0578 }$| | 0.2187 |
Unemployed last year | 19 | 0.577 | 0.240 | −0.354 | |$\mathbf { 0.0029 }$| | |$\mathbf { 0.0033 }$| | |$\mathbf { 0.0104 }$| | |$\mathbf { 0.0341 }$| | 0.1878 |
Jobless months (past 2 yrs) | 19 | 10.421 | 5.217 | −4.197 | |$\mathbf { 0.0723 }$| | 0.1386 | 0.1140 | 0.1886 | 0.4619 |
Currently employed | 27 | 0.545 | 0.800 | 0.215 | |$\mathbf { 0.0471 }$| | |$\mathbf { 0.0604 }$| | |$\mathbf { 0.0648 }$| | 0.1048 | 0.2521 |
Unemployed last year | 27 | 0.542 | 0.250 | −0.269 | |$\mathbf { 0.0523 }$| | |$\mathbf { 0.0457 }$| | |$\mathbf { 0.0728 }$| | 0.1721 | 1.0000 |
Jobless months (past 2 yrs) | 27 | 10.455 | 6.240 | −1.298 | 0.3328 | 0.3449 | 0.2916 | 0.4821 | 0.7544 |
Currently employed | 40 | 0.818 | 0.833 | −0.016 | 0.4536 | 0.4586 | 0.4912 | 0.6385 | 1.0000 |
Unemployed last year | 40 | 0.409 | 0.160 | −0.194 | |$\mathbf { 0.0807 }$| | 0.1079 | 0.1324 | 0.1892 | 0.3070 |
Jobless months (past 2 yrs) | 40 | 5.045 | 4.000 | 0.057 | 0.4910 | 0.4927 | 0.4700 | 0.6326 | 1.0000 |
Note: This table reports p-values for single hypothesis tests of treatment effects on various outcomes of female participants at the given ages. The inferences are based on the studentized AIPW test statistic.
. | . | Untreated . | Treated . | AIPW . | Asymptotic . | Bootstrap . | Permutation . | Worst-case . | Worst-case . |
---|---|---|---|---|---|---|---|---|---|
Variable . | Age . | mean . | mean . | estimate . | p-value . | p-value . | p-value . | max. p . | de Haan p . |
Stanford–Binet IQ | 4 | 83.692 | 96.360 | 13.425 | |$\mathbf { 0.0000 }$| | |$\mathbf { 0.0000 }$| | |$\mathbf { 0.0004 }$| | |$\mathbf { 0.0053 }$| | |$\mathbf { 0.0055 }$| |
Stanford–Binet IQ | 5 | 81.650 | 94.316 | 14.157 | |$\mathbf { 0.0008 }$| | |$\mathbf { 0.0006 }$| | |$\mathbf { 0.0064 }$| | |$\mathbf { 0.0258 }$| | |$\mathbf { 0.0503 }$| |
Stanford–Binet IQ | 6 | 87.160 | 90.913 | 5.271 | |$\mathbf { 0.0365 }$| | |$\mathbf { 0.0281 }$| | |$\mathbf { 0.0636 }$| | |$\mathbf { 0.0799 }$| | 0.4790 |
Stanford–Binet IQ | 7 | 86.000 | 92.520 | 7.347 | |$\mathbf { 0.0313 }$| | |$\mathbf { 0.0154 }$| | |$\mathbf { 0.0564 }$| | |$\mathbf { 0.0952 }$| | 0.1950 |
Stanford–Binet IQ | 8 | 83.600 | 87.840 | 4.669 | 0.1144 | |$\mathbf { 0.0896 }$| | 0.1704 | 0.2080 | 0.2832 |
Stanford–Binet IQ | 9 | 83.043 | 86.739 | 4.809 | |$\mathbf { 0.0633 }$| | |$\mathbf { 0.0679 }$| | 0.1128 | 0.1578 | 0.2873 |
Stanford–Binet IQ | 10 | 81.789 | 86.750 | 6.480 | |$\mathbf { 0.0277 }$| | |$\mathbf { 0.0323 }$| | |$\mathbf { 0.0596 }$| | 0.1840 | 0.4267 |
CAT reading score | 14 | 8.444 | 16.500 | 7.345 | |$\mathbf { 0.0130 }$| | |$\mathbf { 0.0128 }$| | |$\mathbf { 0.0268 }$| | |$\mathbf { 0.0561 }$| | 0.1125 |
CAT arithmetic score | 14 | 6.889 | 11.818 | 6.227 | |$\mathbf { 0.0102 }$| | |$\mathbf { 0.0138 }$| | |$\mathbf { 0.0284 }$| | |$\mathbf { 0.0624 }$| | |$\mathbf { 0.0731 }$| |
CAT language score | 14 | 7.833 | 19.455 | 11.923 | |$\mathbf { 0.0009 }$| | |$\mathbf { 0.0013 }$| | |$\mathbf { 0.0044 }$| | |$\mathbf { 0.0168 }$| | |$\mathbf { 0.0524 }$| |
CAT mechanics score | 14 | 8.833 | 20.636 | 12.425 | |$\mathbf { 0.0014 }$| | |$\mathbf { 0.0015 }$| | |$\mathbf { 0.0072 }$| | |$\mathbf { 0.0211 }$| | |$\mathbf { 0.0606 }$| |
CAT spelling score | 14 | 10.722 | 29.500 | 18.270 | |$\mathbf { 0.0017 }$| | |$\mathbf { 0.0042 }$| | |$\mathbf { 0.0064 }$| | |$\mathbf { 0.0176 }$| | |$\mathbf { 0.0254 }$| |
High school graduate | 19 | 0.231 | 0.840 | 0.570 | |$\mathbf { 0.0000 }$| | |$\mathbf { 0.0000 }$| | |$\mathbf { 0.0004 }$| | |$\mathbf { 0.0054 }$| | |$\mathbf { 0.0058 }$| |
Vocational training | 40 | 0.077 | 0.240 | 0.183 | |$\mathbf { 0.0286 }$| | |$\mathbf { 0.0494 }$| | |$\mathbf { 0.0420 }$| | 0.1056 | 0.2630 |
Highest grade completed | 19 | 10.750 | 11.760 | 1.202 | |$\mathbf { 0.0023 }$| | |$\mathbf { 0.0106 }$| | |$\mathbf { 0.0120 }$| | |$\mathbf { 0.0345 }$| | |$\mathbf { 0.0935 }$| |
Grade point average | 19 | 1.527 | 2.415 | 0.958 | |$\mathbf { 0.0000 }$| | |$\mathbf { 0.0155 }$| | |$\mathbf { 0.0004 }$| | |$\mathbf { 0.0119 }$| | |$\mathbf { 0.0164 }$| |
Total nonjuvenile arrests | 40 | 4.423 | 2.160 | −1.938 | |$\mathbf { 0.0657 }$| | |$\mathbf { 0.0795 }$| | |$\mathbf { 0.0880 }$| | 0.1695 | 0.4424 |
Total crime cost | 40 | 293.497 | 22.165 | −246.242 | 0.1475 | 0.1227 | 0.2436 | 0.2524 | 0.5508 |
Total charges | 40 | 4.923 | 2.240 | −2.309 | |$\mathbf { 0.0439 }$| | |$\mathbf { 0.0528 }$| | |$\mathbf { 0.0580 }$| | 0.1526 | 0.2963 |
Nonvictimless charges | 40 | 0.308 | 0.040 | −0.249 | |$\mathbf { 0.0365 }$| | |$\mathbf { 0.0263 }$| | |$\mathbf { 0.0612 }$| | |$\mathbf { 0.0906 }$| | 0.2574 |
Currently employed | 19 | 0.154 | 0.440 | 0.297 | |$\mathbf { 0.0054 }$| | |$\mathbf { 0.0048 }$| | |$\mathbf { 0.0152 }$| | |$\mathbf { 0.0578 }$| | 0.2187 |
Unemployed last year | 19 | 0.577 | 0.240 | −0.354 | |$\mathbf { 0.0029 }$| | |$\mathbf { 0.0033 }$| | |$\mathbf { 0.0104 }$| | |$\mathbf { 0.0341 }$| | 0.1878 |
Jobless months (past 2 yrs) | 19 | 10.421 | 5.217 | −4.197 | |$\mathbf { 0.0723 }$| | 0.1386 | 0.1140 | 0.1886 | 0.4619 |
Currently employed | 27 | 0.545 | 0.800 | 0.215 | |$\mathbf { 0.0471 }$| | |$\mathbf { 0.0604 }$| | |$\mathbf { 0.0648 }$| | 0.1048 | 0.2521 |
Unemployed last year | 27 | 0.542 | 0.250 | −0.269 | |$\mathbf { 0.0523 }$| | |$\mathbf { 0.0457 }$| | |$\mathbf { 0.0728 }$| | 0.1721 | 1.0000 |
Jobless months (past 2 yrs) | 27 | 10.455 | 6.240 | −1.298 | 0.3328 | 0.3449 | 0.2916 | 0.4821 | 0.7544 |
Currently employed | 40 | 0.818 | 0.833 | −0.016 | 0.4536 | 0.4586 | 0.4912 | 0.6385 | 1.0000 |
Unemployed last year | 40 | 0.409 | 0.160 | −0.194 | |$\mathbf { 0.0807 }$| | 0.1079 | 0.1324 | 0.1892 | 0.3070 |
Jobless months (past 2 yrs) | 40 | 5.045 | 4.000 | 0.057 | 0.4910 | 0.4927 | 0.4700 | 0.6326 | 1.0000 |
. | . | Untreated . | Treated . | AIPW . | Asymptotic . | Bootstrap . | Permutation . | Worst-case . | Worst-case . |
---|---|---|---|---|---|---|---|---|---|
Variable . | Age . | mean . | mean . | estimate . | p-value . | p-value . | p-value . | max. p . | de Haan p . |
Stanford–Binet IQ | 4 | 83.692 | 96.360 | 13.425 | |$\mathbf { 0.0000 }$| | |$\mathbf { 0.0000 }$| | |$\mathbf { 0.0004 }$| | |$\mathbf { 0.0053 }$| | |$\mathbf { 0.0055 }$| |
Stanford–Binet IQ | 5 | 81.650 | 94.316 | 14.157 | |$\mathbf { 0.0008 }$| | |$\mathbf { 0.0006 }$| | |$\mathbf { 0.0064 }$| | |$\mathbf { 0.0258 }$| | |$\mathbf { 0.0503 }$| |
Stanford–Binet IQ | 6 | 87.160 | 90.913 | 5.271 | |$\mathbf { 0.0365 }$| | |$\mathbf { 0.0281 }$| | |$\mathbf { 0.0636 }$| | |$\mathbf { 0.0799 }$| | 0.4790 |
Stanford–Binet IQ | 7 | 86.000 | 92.520 | 7.347 | |$\mathbf { 0.0313 }$| | |$\mathbf { 0.0154 }$| | |$\mathbf { 0.0564 }$| | |$\mathbf { 0.0952 }$| | 0.1950 |
Stanford–Binet IQ | 8 | 83.600 | 87.840 | 4.669 | 0.1144 | |$\mathbf { 0.0896 }$| | 0.1704 | 0.2080 | 0.2832 |
Stanford–Binet IQ | 9 | 83.043 | 86.739 | 4.809 | |$\mathbf { 0.0633 }$| | |$\mathbf { 0.0679 }$| | 0.1128 | 0.1578 | 0.2873 |
Stanford–Binet IQ | 10 | 81.789 | 86.750 | 6.480 | |$\mathbf { 0.0277 }$| | |$\mathbf { 0.0323 }$| | |$\mathbf { 0.0596 }$| | 0.1840 | 0.4267 |
CAT reading score | 14 | 8.444 | 16.500 | 7.345 | |$\mathbf { 0.0130 }$| | |$\mathbf { 0.0128 }$| | |$\mathbf { 0.0268 }$| | |$\mathbf { 0.0561 }$| | 0.1125 |
CAT arithmetic score | 14 | 6.889 | 11.818 | 6.227 | |$\mathbf { 0.0102 }$| | |$\mathbf { 0.0138 }$| | |$\mathbf { 0.0284 }$| | |$\mathbf { 0.0624 }$| | |$\mathbf { 0.0731 }$| |
CAT language score | 14 | 7.833 | 19.455 | 11.923 | |$\mathbf { 0.0009 }$| | |$\mathbf { 0.0013 }$| | |$\mathbf { 0.0044 }$| | |$\mathbf { 0.0168 }$| | |$\mathbf { 0.0524 }$| |
CAT mechanics score | 14 | 8.833 | 20.636 | 12.425 | |$\mathbf { 0.0014 }$| | |$\mathbf { 0.0015 }$| | |$\mathbf { 0.0072 }$| | |$\mathbf { 0.0211 }$| | |$\mathbf { 0.0606 }$| |
CAT spelling score | 14 | 10.722 | 29.500 | 18.270 | |$\mathbf { 0.0017 }$| | |$\mathbf { 0.0042 }$| | |$\mathbf { 0.0064 }$| | |$\mathbf { 0.0176 }$| | |$\mathbf { 0.0254 }$| |
High school graduate | 19 | 0.231 | 0.840 | 0.570 | |$\mathbf { 0.0000 }$| | |$\mathbf { 0.0000 }$| | |$\mathbf { 0.0004 }$| | |$\mathbf { 0.0054 }$| | |$\mathbf { 0.0058 }$| |
Vocational training | 40 | 0.077 | 0.240 | 0.183 | |$\mathbf { 0.0286 }$| | |$\mathbf { 0.0494 }$| | |$\mathbf { 0.0420 }$| | 0.1056 | 0.2630 |
Highest grade completed | 19 | 10.750 | 11.760 | 1.202 | |$\mathbf { 0.0023 }$| | |$\mathbf { 0.0106 }$| | |$\mathbf { 0.0120 }$| | |$\mathbf { 0.0345 }$| | |$\mathbf { 0.0935 }$| |
Grade point average | 19 | 1.527 | 2.415 | 0.958 | |$\mathbf { 0.0000 }$| | |$\mathbf { 0.0155 }$| | |$\mathbf { 0.0004 }$| | |$\mathbf { 0.0119 }$| | |$\mathbf { 0.0164 }$| |
Total nonjuvenile arrests | 40 | 4.423 | 2.160 | −1.938 | |$\mathbf { 0.0657 }$| | |$\mathbf { 0.0795 }$| | |$\mathbf { 0.0880 }$| | 0.1695 | 0.4424 |
Total crime cost | 40 | 293.497 | 22.165 | −246.242 | 0.1475 | 0.1227 | 0.2436 | 0.2524 | 0.5508 |
Total charges | 40 | 4.923 | 2.240 | −2.309 | |$\mathbf { 0.0439 }$| | |$\mathbf { 0.0528 }$| | |$\mathbf { 0.0580 }$| | 0.1526 | 0.2963 |
Nonvictimless charges | 40 | 0.308 | 0.040 | −0.249 | |$\mathbf { 0.0365 }$| | |$\mathbf { 0.0263 }$| | |$\mathbf { 0.0612 }$| | |$\mathbf { 0.0906 }$| | 0.2574 |
Currently employed | 19 | 0.154 | 0.440 | 0.297 | |$\mathbf { 0.0054 }$| | |$\mathbf { 0.0048 }$| | |$\mathbf { 0.0152 }$| | |$\mathbf { 0.0578 }$| | 0.2187 |
Unemployed last year | 19 | 0.577 | 0.240 | −0.354 | |$\mathbf { 0.0029 }$| | |$\mathbf { 0.0033 }$| | |$\mathbf { 0.0104 }$| | |$\mathbf { 0.0341 }$| | 0.1878 |
Jobless months (past 2 yrs) | 19 | 10.421 | 5.217 | −4.197 | |$\mathbf { 0.0723 }$| | 0.1386 | 0.1140 | 0.1886 | 0.4619 |
Currently employed | 27 | 0.545 | 0.800 | 0.215 | |$\mathbf { 0.0471 }$| | |$\mathbf { 0.0604 }$| | |$\mathbf { 0.0648 }$| | 0.1048 | 0.2521 |
Unemployed last year | 27 | 0.542 | 0.250 | −0.269 | |$\mathbf { 0.0523 }$| | |$\mathbf { 0.0457 }$| | |$\mathbf { 0.0728 }$| | 0.1721 | 1.0000 |
Jobless months (past 2 yrs) | 27 | 10.455 | 6.240 | −1.298 | 0.3328 | 0.3449 | 0.2916 | 0.4821 | 0.7544 |
Currently employed | 40 | 0.818 | 0.833 | −0.016 | 0.4536 | 0.4586 | 0.4912 | 0.6385 | 1.0000 |
Unemployed last year | 40 | 0.409 | 0.160 | −0.194 | |$\mathbf { 0.0807 }$| | 0.1079 | 0.1324 | 0.1892 | 0.3070 |
Jobless months (past 2 yrs) | 40 | 5.045 | 4.000 | 0.057 | 0.4910 | 0.4927 | 0.4700 | 0.6326 | 1.0000 |
Note: This table reports p-values for single hypothesis tests of treatment effects on various outcomes of female participants at the given ages. The inferences are based on the studentized AIPW test statistic.
. | . | Untreated . | Treated . | AIPW . | Asymptotic . | Bootstrap . | Permutation . | Worst-case . | Worst-case . |
---|---|---|---|---|---|---|---|---|---|
Variable . | Age . | mean . | mean . | estimate . | p-value . | p-value . | p-value . | max. p . | de Haan p . |
Stanford–Binet IQ | 4 | 83.077 | 94.909 | 8.988 | |$\mathbf { 0.0001 }$| | |$\mathbf { 0.0002 }$| | |$\mathbf { 0.0028 }$| | |$\mathbf { 0.0348 }$| | |$\mathbf { 0.0415 }$| |
Stanford–Binet IQ | 5 | 84.793 | 95.400 | 9.167 | |$\mathbf { 0.0003 }$| | |$\mathbf { 0.0012 }$| | |$\mathbf { 0.0028 }$| | |$\mathbf { 0.0348 }$| | |$\mathbf { 0.0415 }$| |
Stanford–Binet IQ | 6 | 85.821 | 91.485 | 3.056 | 0.1593 | 0.2058 | 0.1888 | 0.3391 | 0.7574 |
Stanford–Binet IQ | 7 | 87.711 | 91.121 | 1.576 | 0.2040 | 0.2143 | 0.2104 | 0.3440 | 0.7574 |
Stanford–Binet IQ | 8 | 89.054 | 88.333 | −3.829 | 0.1593 | 0.2058 | 0.1888 | 0.3440 | 0.7574 |
Stanford–Binet IQ | 9 | 89.026 | 88.394 | −4.167 | 0.1593 | 0.2058 | 0.1888 | 0.3440 | 0.7574 |
Stanford–Binet IQ | 10 | 86.026 | 83.697 | −4.722 | 0.1126 | 0.2058 | 0.1460 | 0.3391 | 0.5062 |
CAT reading score | 14 | 9.000 | 13.926 | 1.815 | 0.7229 | 0.7886 | 0.7800 | 0.8167 | 1.0000 |
CAT arithmetic score | 14 | 8.107 | 16.000 | 3.095 | 0.7229 | 0.7886 | 0.7800 | 0.8167 | 1.0000 |
CAT language score | 14 | 6.536 | 14.333 | 5.029 | 0.3260 | 0.3980 | 0.4304 | 0.7058 | 1.0000 |
CAT mechanics score | 14 | 6.964 | 15.556 | 5.979 | 0.2690 | 0.3190 | 0.3560 | 0.6171 | 1.0000 |
CAT spelling score | 14 | 11.536 | 18.519 | 3.171 | 0.7229 | 0.7886 | 0.7800 | 0.8167 | 1.0000 |
High school graduate | 19 | 0.513 | 0.485 | 0.015 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 |
Vocational training | 40 | 0.333 | 0.394 | 0.071 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 |
Highest grade completed | 19 | 11.282 | 11.364 | 0.087 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 |
Grade point average | 19 | 1.794 | 1.814 | −0.035 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 |
Total nonjuvenile arrests | 40 | 11.718 | 7.455 | −3.895 | 0.1384 | 0.1103 | 0.2004 | 0.3424 | 1.0000 |
Total crime cost | 40 | 775.901 | 424.679 | −313.263 | 0.1384 | 0.1361 | 0.2004 | 0.3424 | 1.0000 |
Total charges | 40 | 13.385 | 9.000 | −4.132 | 0.1384 | 0.1158 | 0.2004 | 0.3424 | 1.0000 |
Nonvictimless charges | 40 | 3.077 | 1.485 | −1.444 | 0.1096 | |$\mathbf { 0.0952 }$| | 0.1488 | 0.3424 | 1.0000 |
Currently employed | 19 | 0.410 | 0.545 | 0.147 | 0.3790 | 0.3946 | 0.3876 | 0.8583 | 1.0000 |
Unemployed last year | 19 | 0.128 | 0.242 | 0.102 | 0.3790 | 0.3946 | 0.4296 | 0.8583 | 1.0000 |
Jobless months (past 2 yrs) | 19 | 3.816 | 5.281 | 1.367 | 0.3790 | 0.3946 | 0.4296 | 0.8583 | 1.0000 |
Currently employed | 27 | 0.564 | 0.600 | 0.089 | 0.4313 | 0.4380 | 0.4776 | 0.6670 | 1.0000 |
Unemployed last year | 27 | 0.308 | 0.242 | −0.081 | 0.4313 | 0.4380 | 0.4776 | 0.6670 | 1.0000 |
Jobless months (past 2 yrs) | 27 | 8.795 | 5.133 | −3.868 | 0.1313 | 0.1290 | 0.1764 | 0.3344 | 1.0000 |
Currently employed | 40 | 0.500 | 0.700 | 0.266 | |$\mathbf { 0.0268 }$| | |$\mathbf { 0.0225 }$| | |$\mathbf { 0.0564 }$| | 0.1451 | 0.2912 |
Unemployed last year | 40 | 0.462 | 0.364 | −0.143 | |$\mathbf { 0.0843 }$| | |$\mathbf { 0.0957 }$| | |$\mathbf { 0.0912 }$| | 0.1695 | 0.5219 |
Jobless months (past 2 yrs) | 40 | 10.750 | 7.233 | −4.758 | |$\mathbf { 0.0309 }$| | |$\mathbf { 0.0399 }$| | |$\mathbf { 0.0564 }$| | 0.1451 | 0.2912 |
. | . | Untreated . | Treated . | AIPW . | Asymptotic . | Bootstrap . | Permutation . | Worst-case . | Worst-case . |
---|---|---|---|---|---|---|---|---|---|
Variable . | Age . | mean . | mean . | estimate . | p-value . | p-value . | p-value . | max. p . | de Haan p . |
Stanford–Binet IQ | 4 | 83.077 | 94.909 | 8.988 | |$\mathbf { 0.0001 }$| | |$\mathbf { 0.0002 }$| | |$\mathbf { 0.0028 }$| | |$\mathbf { 0.0348 }$| | |$\mathbf { 0.0415 }$| |
Stanford–Binet IQ | 5 | 84.793 | 95.400 | 9.167 | |$\mathbf { 0.0003 }$| | |$\mathbf { 0.0012 }$| | |$\mathbf { 0.0028 }$| | |$\mathbf { 0.0348 }$| | |$\mathbf { 0.0415 }$| |
Stanford–Binet IQ | 6 | 85.821 | 91.485 | 3.056 | 0.1593 | 0.2058 | 0.1888 | 0.3391 | 0.7574 |
Stanford–Binet IQ | 7 | 87.711 | 91.121 | 1.576 | 0.2040 | 0.2143 | 0.2104 | 0.3440 | 0.7574 |
Stanford–Binet IQ | 8 | 89.054 | 88.333 | −3.829 | 0.1593 | 0.2058 | 0.1888 | 0.3440 | 0.7574 |
Stanford–Binet IQ | 9 | 89.026 | 88.394 | −4.167 | 0.1593 | 0.2058 | 0.1888 | 0.3440 | 0.7574 |
Stanford–Binet IQ | 10 | 86.026 | 83.697 | −4.722 | 0.1126 | 0.2058 | 0.1460 | 0.3391 | 0.5062 |
CAT reading score | 14 | 9.000 | 13.926 | 1.815 | 0.7229 | 0.7886 | 0.7800 | 0.8167 | 1.0000 |
CAT arithmetic score | 14 | 8.107 | 16.000 | 3.095 | 0.7229 | 0.7886 | 0.7800 | 0.8167 | 1.0000 |
CAT language score | 14 | 6.536 | 14.333 | 5.029 | 0.3260 | 0.3980 | 0.4304 | 0.7058 | 1.0000 |
CAT mechanics score | 14 | 6.964 | 15.556 | 5.979 | 0.2690 | 0.3190 | 0.3560 | 0.6171 | 1.0000 |
CAT spelling score | 14 | 11.536 | 18.519 | 3.171 | 0.7229 | 0.7886 | 0.7800 | 0.8167 | 1.0000 |
High school graduate | 19 | 0.513 | 0.485 | 0.015 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 |
Vocational training | 40 | 0.333 | 0.394 | 0.071 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 |
Highest grade completed | 19 | 11.282 | 11.364 | 0.087 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 |
Grade point average | 19 | 1.794 | 1.814 | −0.035 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 |
Total nonjuvenile arrests | 40 | 11.718 | 7.455 | −3.895 | 0.1384 | 0.1103 | 0.2004 | 0.3424 | 1.0000 |
Total crime cost | 40 | 775.901 | 424.679 | −313.263 | 0.1384 | 0.1361 | 0.2004 | 0.3424 | 1.0000 |
Total charges | 40 | 13.385 | 9.000 | −4.132 | 0.1384 | 0.1158 | 0.2004 | 0.3424 | 1.0000 |
Nonvictimless charges | 40 | 3.077 | 1.485 | −1.444 | 0.1096 | |$\mathbf { 0.0952 }$| | 0.1488 | 0.3424 | 1.0000 |
Currently employed | 19 | 0.410 | 0.545 | 0.147 | 0.3790 | 0.3946 | 0.3876 | 0.8583 | 1.0000 |
Unemployed last year | 19 | 0.128 | 0.242 | 0.102 | 0.3790 | 0.3946 | 0.4296 | 0.8583 | 1.0000 |
Jobless months (past 2 yrs) | 19 | 3.816 | 5.281 | 1.367 | 0.3790 | 0.3946 | 0.4296 | 0.8583 | 1.0000 |
Currently employed | 27 | 0.564 | 0.600 | 0.089 | 0.4313 | 0.4380 | 0.4776 | 0.6670 | 1.0000 |
Unemployed last year | 27 | 0.308 | 0.242 | −0.081 | 0.4313 | 0.4380 | 0.4776 | 0.6670 | 1.0000 |
Jobless months (past 2 yrs) | 27 | 8.795 | 5.133 | −3.868 | 0.1313 | 0.1290 | 0.1764 | 0.3344 | 1.0000 |
Currently employed | 40 | 0.500 | 0.700 | 0.266 | |$\mathbf { 0.0268 }$| | |$\mathbf { 0.0225 }$| | |$\mathbf { 0.0564 }$| | 0.1451 | 0.2912 |
Unemployed last year | 40 | 0.462 | 0.364 | −0.143 | |$\mathbf { 0.0843 }$| | |$\mathbf { 0.0957 }$| | |$\mathbf { 0.0912 }$| | 0.1695 | 0.5219 |
Jobless months (past 2 yrs) | 40 | 10.750 | 7.233 | −4.758 | |$\mathbf { 0.0309 }$| | |$\mathbf { 0.0399 }$| | |$\mathbf { 0.0564 }$| | 0.1451 | 0.2912 |
Note: This table reports Holm stepdown p-values for multiple hypothesis tests of treatment effects on various outcomes of male participants at the given ages. The inferences are based on the studentized AIPW test statistic. The blocks used for multiple testing are indicated above using divider lines.
. | . | Untreated . | Treated . | AIPW . | Asymptotic . | Bootstrap . | Permutation . | Worst-case . | Worst-case . |
---|---|---|---|---|---|---|---|---|---|
Variable . | Age . | mean . | mean . | estimate . | p-value . | p-value . | p-value . | max. p . | de Haan p . |
Stanford–Binet IQ | 4 | 83.077 | 94.909 | 8.988 | |$\mathbf { 0.0001 }$| | |$\mathbf { 0.0002 }$| | |$\mathbf { 0.0028 }$| | |$\mathbf { 0.0348 }$| | |$\mathbf { 0.0415 }$| |
Stanford–Binet IQ | 5 | 84.793 | 95.400 | 9.167 | |$\mathbf { 0.0003 }$| | |$\mathbf { 0.0012 }$| | |$\mathbf { 0.0028 }$| | |$\mathbf { 0.0348 }$| | |$\mathbf { 0.0415 }$| |
Stanford–Binet IQ | 6 | 85.821 | 91.485 | 3.056 | 0.1593 | 0.2058 | 0.1888 | 0.3391 | 0.7574 |
Stanford–Binet IQ | 7 | 87.711 | 91.121 | 1.576 | 0.2040 | 0.2143 | 0.2104 | 0.3440 | 0.7574 |
Stanford–Binet IQ | 8 | 89.054 | 88.333 | −3.829 | 0.1593 | 0.2058 | 0.1888 | 0.3440 | 0.7574 |
Stanford–Binet IQ | 9 | 89.026 | 88.394 | −4.167 | 0.1593 | 0.2058 | 0.1888 | 0.3440 | 0.7574 |
Stanford–Binet IQ | 10 | 86.026 | 83.697 | −4.722 | 0.1126 | 0.2058 | 0.1460 | 0.3391 | 0.5062 |
CAT reading score | 14 | 9.000 | 13.926 | 1.815 | 0.7229 | 0.7886 | 0.7800 | 0.8167 | 1.0000 |
CAT arithmetic score | 14 | 8.107 | 16.000 | 3.095 | 0.7229 | 0.7886 | 0.7800 | 0.8167 | 1.0000 |
CAT language score | 14 | 6.536 | 14.333 | 5.029 | 0.3260 | 0.3980 | 0.4304 | 0.7058 | 1.0000 |
CAT mechanics score | 14 | 6.964 | 15.556 | 5.979 | 0.2690 | 0.3190 | 0.3560 | 0.6171 | 1.0000 |
CAT spelling score | 14 | 11.536 | 18.519 | 3.171 | 0.7229 | 0.7886 | 0.7800 | 0.8167 | 1.0000 |
High school graduate | 19 | 0.513 | 0.485 | 0.015 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 |
Vocational training | 40 | 0.333 | 0.394 | 0.071 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 |
Highest grade completed | 19 | 11.282 | 11.364 | 0.087 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 |
Grade point average | 19 | 1.794 | 1.814 | −0.035 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 |
Total nonjuvenile arrests | 40 | 11.718 | 7.455 | −3.895 | 0.1384 | 0.1103 | 0.2004 | 0.3424 | 1.0000 |
Total crime cost | 40 | 775.901 | 424.679 | −313.263 | 0.1384 | 0.1361 | 0.2004 | 0.3424 | 1.0000 |
Total charges | 40 | 13.385 | 9.000 | −4.132 | 0.1384 | 0.1158 | 0.2004 | 0.3424 | 1.0000 |
Nonvictimless charges | 40 | 3.077 | 1.485 | −1.444 | 0.1096 | |$\mathbf { 0.0952 }$| | 0.1488 | 0.3424 | 1.0000 |
Currently employed | 19 | 0.410 | 0.545 | 0.147 | 0.3790 | 0.3946 | 0.3876 | 0.8583 | 1.0000 |
Unemployed last year | 19 | 0.128 | 0.242 | 0.102 | 0.3790 | 0.3946 | 0.4296 | 0.8583 | 1.0000 |
Jobless months (past 2 yrs) | 19 | 3.816 | 5.281 | 1.367 | 0.3790 | 0.3946 | 0.4296 | 0.8583 | 1.0000 |
Currently employed | 27 | 0.564 | 0.600 | 0.089 | 0.4313 | 0.4380 | 0.4776 | 0.6670 | 1.0000 |
Unemployed last year | 27 | 0.308 | 0.242 | −0.081 | 0.4313 | 0.4380 | 0.4776 | 0.6670 | 1.0000 |
Jobless months (past 2 yrs) | 27 | 8.795 | 5.133 | −3.868 | 0.1313 | 0.1290 | 0.1764 | 0.3344 | 1.0000 |
Currently employed | 40 | 0.500 | 0.700 | 0.266 | |$\mathbf { 0.0268 }$| | |$\mathbf { 0.0225 }$| | |$\mathbf { 0.0564 }$| | 0.1451 | 0.2912 |
Unemployed last year | 40 | 0.462 | 0.364 | −0.143 | |$\mathbf { 0.0843 }$| | |$\mathbf { 0.0957 }$| | |$\mathbf { 0.0912 }$| | 0.1695 | 0.5219 |
Jobless months (past 2 yrs) | 40 | 10.750 | 7.233 | −4.758 | |$\mathbf { 0.0309 }$| | |$\mathbf { 0.0399 }$| | |$\mathbf { 0.0564 }$| | 0.1451 | 0.2912 |
. | . | Untreated . | Treated . | AIPW . | Asymptotic . | Bootstrap . | Permutation . | Worst-case . | Worst-case . |
---|---|---|---|---|---|---|---|---|---|
Variable . | Age . | mean . | mean . | estimate . | p-value . | p-value . | p-value . | max. p . | de Haan p . |
Stanford–Binet IQ | 4 | 83.077 | 94.909 | 8.988 | |$\mathbf { 0.0001 }$| | |$\mathbf { 0.0002 }$| | |$\mathbf { 0.0028 }$| | |$\mathbf { 0.0348 }$| | |$\mathbf { 0.0415 }$| |
Stanford–Binet IQ | 5 | 84.793 | 95.400 | 9.167 | |$\mathbf { 0.0003 }$| | |$\mathbf { 0.0012 }$| | |$\mathbf { 0.0028 }$| | |$\mathbf { 0.0348 }$| | |$\mathbf { 0.0415 }$| |
Stanford–Binet IQ | 6 | 85.821 | 91.485 | 3.056 | 0.1593 | 0.2058 | 0.1888 | 0.3391 | 0.7574 |
Stanford–Binet IQ | 7 | 87.711 | 91.121 | 1.576 | 0.2040 | 0.2143 | 0.2104 | 0.3440 | 0.7574 |
Stanford–Binet IQ | 8 | 89.054 | 88.333 | −3.829 | 0.1593 | 0.2058 | 0.1888 | 0.3440 | 0.7574 |
Stanford–Binet IQ | 9 | 89.026 | 88.394 | −4.167 | 0.1593 | 0.2058 | 0.1888 | 0.3440 | 0.7574 |
Stanford–Binet IQ | 10 | 86.026 | 83.697 | −4.722 | 0.1126 | 0.2058 | 0.1460 | 0.3391 | 0.5062 |
CAT reading score | 14 | 9.000 | 13.926 | 1.815 | 0.7229 | 0.7886 | 0.7800 | 0.8167 | 1.0000 |
CAT arithmetic score | 14 | 8.107 | 16.000 | 3.095 | 0.7229 | 0.7886 | 0.7800 | 0.8167 | 1.0000 |
CAT language score | 14 | 6.536 | 14.333 | 5.029 | 0.3260 | 0.3980 | 0.4304 | 0.7058 | 1.0000 |
CAT mechanics score | 14 | 6.964 | 15.556 | 5.979 | 0.2690 | 0.3190 | 0.3560 | 0.6171 | 1.0000 |
CAT spelling score | 14 | 11.536 | 18.519 | 3.171 | 0.7229 | 0.7886 | 0.7800 | 0.8167 | 1.0000 |
High school graduate | 19 | 0.513 | 0.485 | 0.015 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 |
Vocational training | 40 | 0.333 | 0.394 | 0.071 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 |
Highest grade completed | 19 | 11.282 | 11.364 | 0.087 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 |
Grade point average | 19 | 1.794 | 1.814 | −0.035 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 |
Total nonjuvenile arrests | 40 | 11.718 | 7.455 | −3.895 | 0.1384 | 0.1103 | 0.2004 | 0.3424 | 1.0000 |
Total crime cost | 40 | 775.901 | 424.679 | −313.263 | 0.1384 | 0.1361 | 0.2004 | 0.3424 | 1.0000 |
Total charges | 40 | 13.385 | 9.000 | −4.132 | 0.1384 | 0.1158 | 0.2004 | 0.3424 | 1.0000 |
Nonvictimless charges | 40 | 3.077 | 1.485 | −1.444 | 0.1096 | |$\mathbf { 0.0952 }$| | 0.1488 | 0.3424 | 1.0000 |
Currently employed | 19 | 0.410 | 0.545 | 0.147 | 0.3790 | 0.3946 | 0.3876 | 0.8583 | 1.0000 |
Unemployed last year | 19 | 0.128 | 0.242 | 0.102 | 0.3790 | 0.3946 | 0.4296 | 0.8583 | 1.0000 |
Jobless months (past 2 yrs) | 19 | 3.816 | 5.281 | 1.367 | 0.3790 | 0.3946 | 0.4296 | 0.8583 | 1.0000 |
Currently employed | 27 | 0.564 | 0.600 | 0.089 | 0.4313 | 0.4380 | 0.4776 | 0.6670 | 1.0000 |
Unemployed last year | 27 | 0.308 | 0.242 | −0.081 | 0.4313 | 0.4380 | 0.4776 | 0.6670 | 1.0000 |
Jobless months (past 2 yrs) | 27 | 8.795 | 5.133 | −3.868 | 0.1313 | 0.1290 | 0.1764 | 0.3344 | 1.0000 |
Currently employed | 40 | 0.500 | 0.700 | 0.266 | |$\mathbf { 0.0268 }$| | |$\mathbf { 0.0225 }$| | |$\mathbf { 0.0564 }$| | 0.1451 | 0.2912 |
Unemployed last year | 40 | 0.462 | 0.364 | −0.143 | |$\mathbf { 0.0843 }$| | |$\mathbf { 0.0957 }$| | |$\mathbf { 0.0912 }$| | 0.1695 | 0.5219 |
Jobless months (past 2 yrs) | 40 | 10.750 | 7.233 | −4.758 | |$\mathbf { 0.0309 }$| | |$\mathbf { 0.0399 }$| | |$\mathbf { 0.0564 }$| | 0.1451 | 0.2912 |
Note: This table reports Holm stepdown p-values for multiple hypothesis tests of treatment effects on various outcomes of male participants at the given ages. The inferences are based on the studentized AIPW test statistic. The blocks used for multiple testing are indicated above using divider lines.
. | . | Untreated . | Treated . | AIPW . | Asymptotic . | Bootstrap . | Permutation . | Worst-case . | Worst-case . |
---|---|---|---|---|---|---|---|---|---|
Variable . | Age . | mean . | mean . | estimate . | p-value . | p-value . | p-value . | max. p . | de Haan p . |
Stanford–Binet IQ | 4 | 83.692 | 96.360 | 13.425 | |$\mathbf { 0.0000 }$| | |$\mathbf { 0.0000 }$| | |$\mathbf { 0.0028 }$| | |$\mathbf { 0.0369 }$| | |$\mathbf { 0.0387 }$| |
Stanford–Binet IQ | 5 | 81.650 | 94.316 | 14.157 | |$\mathbf { 0.0046 }$| | |$\mathbf { 0.0035 }$| | |$\mathbf { 0.0384 }$| | 0.1550 | 0.3020 |
Stanford–Binet IQ | 6 | 87.160 | 90.913 | 5.271 | 0.1387 | 0.1125 | 0.2820 | 0.3997 | 1.0000 |
Stanford–Binet IQ | 7 | 86.000 | 92.520 | 7.347 | 0.1387 | |$\mathbf { 0.0771 }$| | 0.2820 | 0.3997 | 0.9749 |
Stanford–Binet IQ | 8 | 83.600 | 87.840 | 4.669 | 0.1387 | 0.1359 | 0.2820 | 0.4734 | 1.0000 |
Stanford–Binet IQ | 9 | 83.043 | 86.739 | 4.809 | 0.1387 | 0.1359 | 0.2820 | 0.4734 | 1.0000 |
Stanford–Binet IQ | 10 | 81.789 | 86.750 | 6.480 | 0.1387 | 0.1125 | 0.2820 | 0.4734 | 1.0000 |
CAT reading score | 14 | 8.444 | 16.500 | 7.345 | |$\mathbf { 0.0205 }$| | |$\mathbf { 0.0255 }$| | |$\mathbf { 0.0536 }$| | 0.1122 | 0.2097 |
CAT arithmetic score | 14 | 6.889 | 11.818 | 6.227 | |$\mathbf { 0.0205 }$| | |$\mathbf { 0.0255 }$| | |$\mathbf { 0.0536 }$| | 0.1122 | 0.2097 |
CAT language score | 14 | 7.833 | 19.455 | 11.923 | |$\mathbf { 0.0043 }$| | |$\mathbf { 0.0064 }$| | |$\mathbf { 0.0220 }$| | |$\mathbf { 0.0842 }$| | 0.2097 |
CAT mechanics score | 14 | 8.833 | 20.636 | 12.425 | |$\mathbf { 0.0056 }$| | |$\mathbf { 0.0064 }$| | |$\mathbf { 0.0256 }$| | |$\mathbf { 0.0842 }$| | 0.2097 |
CAT spelling score | 14 | 10.722 | 29.500 | 18.270 | |$\mathbf { 0.0056 }$| | |$\mathbf { 0.0127 }$| | |$\mathbf { 0.0256 }$| | |$\mathbf { 0.0842 }$| | 0.1272 |
High school graduate | 19 | 0.231 | 0.840 | 0.570 | |$\mathbf { 0.0000 }$| | |$\mathbf { 0.0000 }$| | |$\mathbf { 0.0016 }$| | |$\mathbf { 0.0218 }$| | |$\mathbf { 0.0232 }$| |
Vocational training | 40 | 0.077 | 0.240 | 0.183 | |$\mathbf { 0.0286 }$| | |$\mathbf { 0.0494 }$| | |$\mathbf { 0.0420 }$| | 0.1056 | 0.2630 |
Highest grade completed | 19 | 10.750 | 11.760 | 1.202 | |$\mathbf { 0.0046 }$| | |$\mathbf { 0.0318 }$| | |$\mathbf { 0.0240 }$| | |$\mathbf { 0.0690 }$| | 0.1871 |
Grade point average | 19 | 1.527 | 2.415 | 0.958 | |$\mathbf { 0.0000 }$| | |$\mathbf { 0.0318 }$| | |$\mathbf { 0.0016 }$| | |$\mathbf { 0.0357 }$| | |$\mathbf { 0.0493 }$| |
Total nonjuvenile arrests | 40 | 4.423 | 2.160 | −1.938 | 0.1461 | 0.1589 | 0.2320 | 0.4578 | 1.0000 |
Total crime cost | 40 | 293.497 | 22.165 | −246.242 | 0.1475 | 0.1589 | 0.2436 | 0.4578 | 1.0000 |
Total charges | 40 | 4.923 | 2.240 | −2.309 | 0.1461 | 0.1585 | 0.2320 | 0.4578 | 1.0000 |
Nonvictimless charges | 40 | 0.308 | 0.040 | −0.249 | 0.1461 | 0.1051 | 0.2320 | 0.3625 | 1.0000 |
Currently employed | 19 | 0.154 | 0.440 | 0.297 | |$\mathbf { 0.0107 }$| | |$\mathbf { 0.0099 }$| | |$\mathbf { 0.0312 }$| | 0.1155 | 0.5635 |
Unemployed last year | 19 | 0.577 | 0.240 | −0.354 | |$\mathbf { 0.0088 }$| | |$\mathbf { 0.0099 }$| | |$\mathbf { 0.0312 }$| | 0.1022 | 0.5635 |
Jobless months (past 2 yrs) | 19 | 10.421 | 5.217 | −4.197 | |$\mathbf { 0.0723 }$| | 0.1386 | 0.1140 | 0.1886 | 0.5635 |
Currently employed | 27 | 0.545 | 0.800 | 0.215 | 0.1412 | 0.1371 | 0.1944 | 0.3143 | 0.7562 |
Unemployed last year | 27 | 0.542 | 0.250 | −0.269 | 0.1412 | 0.1371 | 0.1944 | 0.3443 | 1.0000 |
Jobless months (past 2 yrs) | 27 | 10.455 | 6.240 | −1.298 | 0.3328 | 0.3449 | 0.2916 | 0.4821 | 1.0000 |
Currently employed | 40 | 0.818 | 0.833 | −0.016 | 0.9072 | 0.9173 | 0.9400 | 1.0000 | 1.0000 |
Unemployed last year | 40 | 0.409 | 0.160 | −0.194 | 0.2421 | 0.3237 | 0.3972 | 0.5675 | 0.9211 |
Jobless months (past 2 yrs) | 40 | 5.045 | 4.000 | 0.057 | 0.9072 | 0.9173 | 0.9400 | 1.0000 | 1.0000 |
. | . | Untreated . | Treated . | AIPW . | Asymptotic . | Bootstrap . | Permutation . | Worst-case . | Worst-case . |
---|---|---|---|---|---|---|---|---|---|
Variable . | Age . | mean . | mean . | estimate . | p-value . | p-value . | p-value . | max. p . | de Haan p . |
Stanford–Binet IQ | 4 | 83.692 | 96.360 | 13.425 | |$\mathbf { 0.0000 }$| | |$\mathbf { 0.0000 }$| | |$\mathbf { 0.0028 }$| | |$\mathbf { 0.0369 }$| | |$\mathbf { 0.0387 }$| |
Stanford–Binet IQ | 5 | 81.650 | 94.316 | 14.157 | |$\mathbf { 0.0046 }$| | |$\mathbf { 0.0035 }$| | |$\mathbf { 0.0384 }$| | 0.1550 | 0.3020 |
Stanford–Binet IQ | 6 | 87.160 | 90.913 | 5.271 | 0.1387 | 0.1125 | 0.2820 | 0.3997 | 1.0000 |
Stanford–Binet IQ | 7 | 86.000 | 92.520 | 7.347 | 0.1387 | |$\mathbf { 0.0771 }$| | 0.2820 | 0.3997 | 0.9749 |
Stanford–Binet IQ | 8 | 83.600 | 87.840 | 4.669 | 0.1387 | 0.1359 | 0.2820 | 0.4734 | 1.0000 |
Stanford–Binet IQ | 9 | 83.043 | 86.739 | 4.809 | 0.1387 | 0.1359 | 0.2820 | 0.4734 | 1.0000 |
Stanford–Binet IQ | 10 | 81.789 | 86.750 | 6.480 | 0.1387 | 0.1125 | 0.2820 | 0.4734 | 1.0000 |
CAT reading score | 14 | 8.444 | 16.500 | 7.345 | |$\mathbf { 0.0205 }$| | |$\mathbf { 0.0255 }$| | |$\mathbf { 0.0536 }$| | 0.1122 | 0.2097 |
CAT arithmetic score | 14 | 6.889 | 11.818 | 6.227 | |$\mathbf { 0.0205 }$| | |$\mathbf { 0.0255 }$| | |$\mathbf { 0.0536 }$| | 0.1122 | 0.2097 |
CAT language score | 14 | 7.833 | 19.455 | 11.923 | |$\mathbf { 0.0043 }$| | |$\mathbf { 0.0064 }$| | |$\mathbf { 0.0220 }$| | |$\mathbf { 0.0842 }$| | 0.2097 |
CAT mechanics score | 14 | 8.833 | 20.636 | 12.425 | |$\mathbf { 0.0056 }$| | |$\mathbf { 0.0064 }$| | |$\mathbf { 0.0256 }$| | |$\mathbf { 0.0842 }$| | 0.2097 |
CAT spelling score | 14 | 10.722 | 29.500 | 18.270 | |$\mathbf { 0.0056 }$| | |$\mathbf { 0.0127 }$| | |$\mathbf { 0.0256 }$| | |$\mathbf { 0.0842 }$| | 0.1272 |
High school graduate | 19 | 0.231 | 0.840 | 0.570 | |$\mathbf { 0.0000 }$| | |$\mathbf { 0.0000 }$| | |$\mathbf { 0.0016 }$| | |$\mathbf { 0.0218 }$| | |$\mathbf { 0.0232 }$| |
Vocational training | 40 | 0.077 | 0.240 | 0.183 | |$\mathbf { 0.0286 }$| | |$\mathbf { 0.0494 }$| | |$\mathbf { 0.0420 }$| | 0.1056 | 0.2630 |
Highest grade completed | 19 | 10.750 | 11.760 | 1.202 | |$\mathbf { 0.0046 }$| | |$\mathbf { 0.0318 }$| | |$\mathbf { 0.0240 }$| | |$\mathbf { 0.0690 }$| | 0.1871 |
Grade point average | 19 | 1.527 | 2.415 | 0.958 | |$\mathbf { 0.0000 }$| | |$\mathbf { 0.0318 }$| | |$\mathbf { 0.0016 }$| | |$\mathbf { 0.0357 }$| | |$\mathbf { 0.0493 }$| |
Total nonjuvenile arrests | 40 | 4.423 | 2.160 | −1.938 | 0.1461 | 0.1589 | 0.2320 | 0.4578 | 1.0000 |
Total crime cost | 40 | 293.497 | 22.165 | −246.242 | 0.1475 | 0.1589 | 0.2436 | 0.4578 | 1.0000 |
Total charges | 40 | 4.923 | 2.240 | −2.309 | 0.1461 | 0.1585 | 0.2320 | 0.4578 | 1.0000 |
Nonvictimless charges | 40 | 0.308 | 0.040 | −0.249 | 0.1461 | 0.1051 | 0.2320 | 0.3625 | 1.0000 |
Currently employed | 19 | 0.154 | 0.440 | 0.297 | |$\mathbf { 0.0107 }$| | |$\mathbf { 0.0099 }$| | |$\mathbf { 0.0312 }$| | 0.1155 | 0.5635 |
Unemployed last year | 19 | 0.577 | 0.240 | −0.354 | |$\mathbf { 0.0088 }$| | |$\mathbf { 0.0099 }$| | |$\mathbf { 0.0312 }$| | 0.1022 | 0.5635 |
Jobless months (past 2 yrs) | 19 | 10.421 | 5.217 | −4.197 | |$\mathbf { 0.0723 }$| | 0.1386 | 0.1140 | 0.1886 | 0.5635 |
Currently employed | 27 | 0.545 | 0.800 | 0.215 | 0.1412 | 0.1371 | 0.1944 | 0.3143 | 0.7562 |
Unemployed last year | 27 | 0.542 | 0.250 | −0.269 | 0.1412 | 0.1371 | 0.1944 | 0.3443 | 1.0000 |
Jobless months (past 2 yrs) | 27 | 10.455 | 6.240 | −1.298 | 0.3328 | 0.3449 | 0.2916 | 0.4821 | 1.0000 |
Currently employed | 40 | 0.818 | 0.833 | −0.016 | 0.9072 | 0.9173 | 0.9400 | 1.0000 | 1.0000 |
Unemployed last year | 40 | 0.409 | 0.160 | −0.194 | 0.2421 | 0.3237 | 0.3972 | 0.5675 | 0.9211 |
Jobless months (past 2 yrs) | 40 | 5.045 | 4.000 | 0.057 | 0.9072 | 0.9173 | 0.9400 | 1.0000 | 1.0000 |
Note: This table reports Holm stepdown p-values for multiple hypothesis tests of treatment effects on various outcomes of female participants at the given ages. The inferences are based on the studentized AIPW test statistic. The blocks used for multiple testing are indicated above using divider lines.
. | . | Untreated . | Treated . | AIPW . | Asymptotic . | Bootstrap . | Permutation . | Worst-case . | Worst-case . |
---|---|---|---|---|---|---|---|---|---|
Variable . | Age . | mean . | mean . | estimate . | p-value . | p-value . | p-value . | max. p . | de Haan p . |
Stanford–Binet IQ | 4 | 83.692 | 96.360 | 13.425 | |$\mathbf { 0.0000 }$| | |$\mathbf { 0.0000 }$| | |$\mathbf { 0.0028 }$| | |$\mathbf { 0.0369 }$| | |$\mathbf { 0.0387 }$| |
Stanford–Binet IQ | 5 | 81.650 | 94.316 | 14.157 | |$\mathbf { 0.0046 }$| | |$\mathbf { 0.0035 }$| | |$\mathbf { 0.0384 }$| | 0.1550 | 0.3020 |
Stanford–Binet IQ | 6 | 87.160 | 90.913 | 5.271 | 0.1387 | 0.1125 | 0.2820 | 0.3997 | 1.0000 |
Stanford–Binet IQ | 7 | 86.000 | 92.520 | 7.347 | 0.1387 | |$\mathbf { 0.0771 }$| | 0.2820 | 0.3997 | 0.9749 |
Stanford–Binet IQ | 8 | 83.600 | 87.840 | 4.669 | 0.1387 | 0.1359 | 0.2820 | 0.4734 | 1.0000 |
Stanford–Binet IQ | 9 | 83.043 | 86.739 | 4.809 | 0.1387 | 0.1359 | 0.2820 | 0.4734 | 1.0000 |
Stanford–Binet IQ | 10 | 81.789 | 86.750 | 6.480 | 0.1387 | 0.1125 | 0.2820 | 0.4734 | 1.0000 |
CAT reading score | 14 | 8.444 | 16.500 | 7.345 | |$\mathbf { 0.0205 }$| | |$\mathbf { 0.0255 }$| | |$\mathbf { 0.0536 }$| | 0.1122 | 0.2097 |
CAT arithmetic score | 14 | 6.889 | 11.818 | 6.227 | |$\mathbf { 0.0205 }$| | |$\mathbf { 0.0255 }$| | |$\mathbf { 0.0536 }$| | 0.1122 | 0.2097 |
CAT language score | 14 | 7.833 | 19.455 | 11.923 | |$\mathbf { 0.0043 }$| | |$\mathbf { 0.0064 }$| | |$\mathbf { 0.0220 }$| | |$\mathbf { 0.0842 }$| | 0.2097 |
CAT mechanics score | 14 | 8.833 | 20.636 | 12.425 | |$\mathbf { 0.0056 }$| | |$\mathbf { 0.0064 }$| | |$\mathbf { 0.0256 }$| | |$\mathbf { 0.0842 }$| | 0.2097 |
CAT spelling score | 14 | 10.722 | 29.500 | 18.270 | |$\mathbf { 0.0056 }$| | |$\mathbf { 0.0127 }$| | |$\mathbf { 0.0256 }$| | |$\mathbf { 0.0842 }$| | 0.1272 |
High school graduate | 19 | 0.231 | 0.840 | 0.570 | |$\mathbf { 0.0000 }$| | |$\mathbf { 0.0000 }$| | |$\mathbf { 0.0016 }$| | |$\mathbf { 0.0218 }$| | |$\mathbf { 0.0232 }$| |
Vocational training | 40 | 0.077 | 0.240 | 0.183 | |$\mathbf { 0.0286 }$| | |$\mathbf { 0.0494 }$| | |$\mathbf { 0.0420 }$| | 0.1056 | 0.2630 |
Highest grade completed | 19 | 10.750 | 11.760 | 1.202 | |$\mathbf { 0.0046 }$| | |$\mathbf { 0.0318 }$| | |$\mathbf { 0.0240 }$| | |$\mathbf { 0.0690 }$| | 0.1871 |
Grade point average | 19 | 1.527 | 2.415 | 0.958 | |$\mathbf { 0.0000 }$| | |$\mathbf { 0.0318 }$| | |$\mathbf { 0.0016 }$| | |$\mathbf { 0.0357 }$| | |$\mathbf { 0.0493 }$| |
Total nonjuvenile arrests | 40 | 4.423 | 2.160 | −1.938 | 0.1461 | 0.1589 | 0.2320 | 0.4578 | 1.0000 |
Total crime cost | 40 | 293.497 | 22.165 | −246.242 | 0.1475 | 0.1589 | 0.2436 | 0.4578 | 1.0000 |
Total charges | 40 | 4.923 | 2.240 | −2.309 | 0.1461 | 0.1585 | 0.2320 | 0.4578 | 1.0000 |
Nonvictimless charges | 40 | 0.308 | 0.040 | −0.249 | 0.1461 | 0.1051 | 0.2320 | 0.3625 | 1.0000 |
Currently employed | 19 | 0.154 | 0.440 | 0.297 | |$\mathbf { 0.0107 }$| | |$\mathbf { 0.0099 }$| | |$\mathbf { 0.0312 }$| | 0.1155 | 0.5635 |
Unemployed last year | 19 | 0.577 | 0.240 | −0.354 | |$\mathbf { 0.0088 }$| | |$\mathbf { 0.0099 }$| | |$\mathbf { 0.0312 }$| | 0.1022 | 0.5635 |
Jobless months (past 2 yrs) | 19 | 10.421 | 5.217 | −4.197 | |$\mathbf { 0.0723 }$| | 0.1386 | 0.1140 | 0.1886 | 0.5635 |
Currently employed | 27 | 0.545 | 0.800 | 0.215 | 0.1412 | 0.1371 | 0.1944 | 0.3143 | 0.7562 |
Unemployed last year | 27 | 0.542 | 0.250 | −0.269 | 0.1412 | 0.1371 | 0.1944 | 0.3443 | 1.0000 |
Jobless months (past 2 yrs) | 27 | 10.455 | 6.240 | −1.298 | 0.3328 | 0.3449 | 0.2916 | 0.4821 | 1.0000 |
Currently employed | 40 | 0.818 | 0.833 | −0.016 | 0.9072 | 0.9173 | 0.9400 | 1.0000 | 1.0000 |
Unemployed last year | 40 | 0.409 | 0.160 | −0.194 | 0.2421 | 0.3237 | 0.3972 | 0.5675 | 0.9211 |
Jobless months (past 2 yrs) | 40 | 5.045 | 4.000 | 0.057 | 0.9072 | 0.9173 | 0.9400 | 1.0000 | 1.0000 |
. | . | Untreated . | Treated . | AIPW . | Asymptotic . | Bootstrap . | Permutation . | Worst-case . | Worst-case . |
---|---|---|---|---|---|---|---|---|---|
Variable . | Age . | mean . | mean . | estimate . | p-value . | p-value . | p-value . | max. p . | de Haan p . |
Stanford–Binet IQ | 4 | 83.692 | 96.360 | 13.425 | |$\mathbf { 0.0000 }$| | |$\mathbf { 0.0000 }$| | |$\mathbf { 0.0028 }$| | |$\mathbf { 0.0369 }$| | |$\mathbf { 0.0387 }$| |
Stanford–Binet IQ | 5 | 81.650 | 94.316 | 14.157 | |$\mathbf { 0.0046 }$| | |$\mathbf { 0.0035 }$| | |$\mathbf { 0.0384 }$| | 0.1550 | 0.3020 |
Stanford–Binet IQ | 6 | 87.160 | 90.913 | 5.271 | 0.1387 | 0.1125 | 0.2820 | 0.3997 | 1.0000 |
Stanford–Binet IQ | 7 | 86.000 | 92.520 | 7.347 | 0.1387 | |$\mathbf { 0.0771 }$| | 0.2820 | 0.3997 | 0.9749 |
Stanford–Binet IQ | 8 | 83.600 | 87.840 | 4.669 | 0.1387 | 0.1359 | 0.2820 | 0.4734 | 1.0000 |
Stanford–Binet IQ | 9 | 83.043 | 86.739 | 4.809 | 0.1387 | 0.1359 | 0.2820 | 0.4734 | 1.0000 |
Stanford–Binet IQ | 10 | 81.789 | 86.750 | 6.480 | 0.1387 | 0.1125 | 0.2820 | 0.4734 | 1.0000 |
CAT reading score | 14 | 8.444 | 16.500 | 7.345 | |$\mathbf { 0.0205 }$| | |$\mathbf { 0.0255 }$| | |$\mathbf { 0.0536 }$| | 0.1122 | 0.2097 |
CAT arithmetic score | 14 | 6.889 | 11.818 | 6.227 | |$\mathbf { 0.0205 }$| | |$\mathbf { 0.0255 }$| | |$\mathbf { 0.0536 }$| | 0.1122 | 0.2097 |
CAT language score | 14 | 7.833 | 19.455 | 11.923 | |$\mathbf { 0.0043 }$| | |$\mathbf { 0.0064 }$| | |$\mathbf { 0.0220 }$| | |$\mathbf { 0.0842 }$| | 0.2097 |
CAT mechanics score | 14 | 8.833 | 20.636 | 12.425 | |$\mathbf { 0.0056 }$| | |$\mathbf { 0.0064 }$| | |$\mathbf { 0.0256 }$| | |$\mathbf { 0.0842 }$| | 0.2097 |
CAT spelling score | 14 | 10.722 | 29.500 | 18.270 | |$\mathbf { 0.0056 }$| | |$\mathbf { 0.0127 }$| | |$\mathbf { 0.0256 }$| | |$\mathbf { 0.0842 }$| | 0.1272 |
High school graduate | 19 | 0.231 | 0.840 | 0.570 | |$\mathbf { 0.0000 }$| | |$\mathbf { 0.0000 }$| | |$\mathbf { 0.0016 }$| | |$\mathbf { 0.0218 }$| | |$\mathbf { 0.0232 }$| |
Vocational training | 40 | 0.077 | 0.240 | 0.183 | |$\mathbf { 0.0286 }$| | |$\mathbf { 0.0494 }$| | |$\mathbf { 0.0420 }$| | 0.1056 | 0.2630 |
Highest grade completed | 19 | 10.750 | 11.760 | 1.202 | |$\mathbf { 0.0046 }$| | |$\mathbf { 0.0318 }$| | |$\mathbf { 0.0240 }$| | |$\mathbf { 0.0690 }$| | 0.1871 |
Grade point average | 19 | 1.527 | 2.415 | 0.958 | |$\mathbf { 0.0000 }$| | |$\mathbf { 0.0318 }$| | |$\mathbf { 0.0016 }$| | |$\mathbf { 0.0357 }$| | |$\mathbf { 0.0493 }$| |
Total nonjuvenile arrests | 40 | 4.423 | 2.160 | −1.938 | 0.1461 | 0.1589 | 0.2320 | 0.4578 | 1.0000 |
Total crime cost | 40 | 293.497 | 22.165 | −246.242 | 0.1475 | 0.1589 | 0.2436 | 0.4578 | 1.0000 |
Total charges | 40 | 4.923 | 2.240 | −2.309 | 0.1461 | 0.1585 | 0.2320 | 0.4578 | 1.0000 |
Nonvictimless charges | 40 | 0.308 | 0.040 | −0.249 | 0.1461 | 0.1051 | 0.2320 | 0.3625 | 1.0000 |
Currently employed | 19 | 0.154 | 0.440 | 0.297 | |$\mathbf { 0.0107 }$| | |$\mathbf { 0.0099 }$| | |$\mathbf { 0.0312 }$| | 0.1155 | 0.5635 |
Unemployed last year | 19 | 0.577 | 0.240 | −0.354 | |$\mathbf { 0.0088 }$| | |$\mathbf { 0.0099 }$| | |$\mathbf { 0.0312 }$| | 0.1022 | 0.5635 |
Jobless months (past 2 yrs) | 19 | 10.421 | 5.217 | −4.197 | |$\mathbf { 0.0723 }$| | 0.1386 | 0.1140 | 0.1886 | 0.5635 |
Currently employed | 27 | 0.545 | 0.800 | 0.215 | 0.1412 | 0.1371 | 0.1944 | 0.3143 | 0.7562 |
Unemployed last year | 27 | 0.542 | 0.250 | −0.269 | 0.1412 | 0.1371 | 0.1944 | 0.3443 | 1.0000 |
Jobless months (past 2 yrs) | 27 | 10.455 | 6.240 | −1.298 | 0.3328 | 0.3449 | 0.2916 | 0.4821 | 1.0000 |
Currently employed | 40 | 0.818 | 0.833 | −0.016 | 0.9072 | 0.9173 | 0.9400 | 1.0000 | 1.0000 |
Unemployed last year | 40 | 0.409 | 0.160 | −0.194 | 0.2421 | 0.3237 | 0.3972 | 0.5675 | 0.9211 |
Jobless months (past 2 yrs) | 40 | 5.045 | 4.000 | 0.057 | 0.9072 | 0.9173 | 0.9400 | 1.0000 | 1.0000 |
Note: This table reports Holm stepdown p-values for multiple hypothesis tests of treatment effects on various outcomes of female participants at the given ages. The inferences are based on the studentized AIPW test statistic. The blocks used for multiple testing are indicated above using divider lines.
In Tables 6 and 7, we reproduce Heckman et al.’s (2020) results and provide a side-by-side comparison of their inferences with our own. The most stringent (max-U) single p-values they report for the effects on the California Achievement Test (CAT) reading, arithmetic, language, mechanics, and spelling scores at age 14 in the male sample using the studentized DIM test statistic are 0.036, 0.086, 0.012, 0.023, and 0.012, respectively, which are lower than the asymptotic p-values we report in Table 2. After adjusting for multiple testing, their adjusted max-|$U\, p$|-values are no more than 0.086, based on which they conclude that these effects are statistically significant. In contrast, using our approach, the worst-case maximum (single) p-values using studentized DIM test statistic are 0.144, 0.119, 0.069, 0.046, and 0.114, respectively. As shown in our Table 2, using the studentized AIPW test statistic, our worst-case maximum p-values are 0.325, 0.272, 0.176, 0.123, and 0.274, respectively,59 implying that the effects on the CAT scores for males are not statistically significant. Of course, the stepdown p-values for these outcomes shown in Table 4 are also insignificant. Our inference for the female sample is qualitatively similar to theirs. As shown in Table 3, most of the block related to CAT scores for females is statistically significant at the 10% level. However, the multiplicity-adjusted stepdown worst-case de Haan p-values in Table 5 are 0.13 or larger.
. | . | Heckman et al.’s (2020)p-values . | Worst-casep-values using our method . | ||||||
---|---|---|---|---|---|---|---|---|---|
. | . | U= 0 . | U= 0 . | Max-U . | Max-U . | Worst-case . | Worst-case . | Worst-case . | Worst-case . |
. | . | p-value . | p-value . | p-value . | p-value . | max. p . | max. p . | de Haan p . | de Haan p . |
Variable . | Age . | (unadj.) . | (adjusted) . | (unadj.) . | (adjusted) . | (unadjusted) . | (adjusted) . | (unadjusted) . | (adjusted) . |
Stanford–Binet IQ | 4 | 0.001 | 0.001 | 0.008 | 0.008 | |$\mathbf { 0.0035 }$| | |$\mathbf { 0.0246 }$| | |$\mathbf { 0.0051 }$| | |$\mathbf { 0.0358 }$| |
Stanford–Binet IQ | 5 | 0.022 | 0.691 | 0.077 | 0.800 | |$\mathbf { 0.0053 }$| | |$\mathbf { 0.0319 }$| | 0.1314 | 0.6571 |
Stanford–Binet IQ | 6 | 0.033 | 0.034 | 0.094 | 0.102 | |$\mathbf { 0.0289 }$| | 0.1447 | |$\mathbf { 0.0975 }$| | 0.5848 |
Stanford–Binet IQ | 7 | 0.103 | 0.172 | 0.247 | 0.374 | |$\mathbf { 0.0858 }$| | 0.3433 | 0.2259 | 0.9034 |
Stanford–Binet IQ | 8 | 0.599 | 0.691 | 0.733 | 0.800 | 0.5501 | 1.0000 | 0.7234 | 1.0000 |
Stanford–Binet IQ | 9 | 0.450 | 0.548 | 0.631 | 0.680 | 0.5635 | 1.0000 | 0.9429 | 1.0000 |
Stanford–Binet IQ | 10 | 0.684 | 0.691 | 0.790 | 0.800 | 0.2529 | 0.7588 | 0.3615 | 1.0000 |
CAT reading score | 14 | 0.017 | 0.035 | 0.036 | 0.086 | 0.1444 | 0.3410 | 0.1975 | 0.6537 |
CAT arithmetic score | 14 | 0.032 | 0.035 | 0.086 | 0.086 | 0.1185 | 0.3410 | 0.3746 | 0.6537 |
CAT language score | 14 | 0.001 | 0.004 | 0.012 | 0.027 | |$\mathbf { 0.0686 }$| | 0.2743 | 0.1592 | 0.6537 |
CAT mechanics score | 14 | 0.006 | 0.007 | 0.023 | 0.035 | |$\mathbf { 0.0464 }$| | 0.2320 | 0.1307 | 0.6537 |
CAT spelling score | 14 | 0.003 | 0.035 | 0.012 | 0.086 | 0.1137 | 0.3410 | 0.2247 | 0.6537 |
High school graduate | 19 | 0.614 | 0.674 | 0.704 | 0.716 | 0.6373 | 1.0000 | 0.8990 | 1.0000 |
Vocational training | 40 | 0.341 | 0.567 | 0.547 | 0.608 | 0.3582 | 1.0000 | 0.4612 | 1.0000 |
Highest grade completed | 19 | 0.383 | 0.622 | 0.410 | 0.669 | 0.3526 | 1.0000 | 0.5578 | 1.0000 |
Grade point average | 19 | 0.457 | 0.674 | 0.567 | 0.716 | 0.5132 | 1.0000 | 0.8153 | 1.0000 |
Total nonjuvenile arrests | 40 | 0.036 | 0.038 | 0.100 | 0.115 | |$\mathbf { 0.0713 }$| | 0.2752 | 0.3823 | 1.0000 |
Total crime cost | 40 | 0.037 | 0.049 | 0.042 | 0.143 | 0.1746 | 0.2752 | 0.3494 | 1.0000 |
Total charges | 40 | 0.049 | 0.049 | 0.143 | 0.143 | 0.1136 | 0.2752 | 0.2610 | 1.0000 |
Nonvictimless charges | 40 | 0.025 | 0.037 | 0.063 | 0.091 | |$\mathbf { 0.0688 }$| | 0.2752 | 0.3433 | 1.0000 |
Currently employed | 19 | 0.050 | 0.164 | 0.224 | 0.290 | 0.2763 | 0.7413 | 0.7335 | 1.0000 |
Unemployed last year | 19 | 0.901 | 0.901 | 0.922 | 0.922 | 0.2471 | 0.7413 | 0.5214 | 1.0000 |
Jobless months (past 2 yrs) | 19 | 0.821 | 0.849 | 0.873 | 0.890 | 0.3161 | 0.7413 | 0.5369 | 1.0000 |
Currently employed | 27 | 0.268 | 0.295 | 0.485 | 0.512 | 0.3304 | 0.6608 | 0.6281 | 0.9799 |
Unemployed last year | 27 | 0.235 | 0.295 | 0.360 | 0.512 | 0.3393 | 0.6608 | 0.4793 | 0.9799 |
Jobless months (past 2 yrs) | 27 | 0.020 | 0.020 | 0.036 | 0.051 | |$\mathbf { 0.0866 }$| | 0.2599 | 0.3266 | 0.9799 |
Currently employed | 40 | 0.103 | 0.116 | 0.130 | 0.146 | |$\mathbf { 0.0595 }$| | 0.1784 | 0.1149 | 0.3446 |
Unemployed last year | 40 | 0.154 | 0.154 | 0.216 | 0.216 | 0.1694 | 0.1784 | 1.0000 | 1.0000 |
Jobless months (past 2 yrs) | 40 | 0.064 | 0.116 | 0.070 | 0.146 | |$\mathbf { 0.0779 }$| | 0.1784 | 0.1220 | 0.3446 |
. | . | Heckman et al.’s (2020)p-values . | Worst-casep-values using our method . | ||||||
---|---|---|---|---|---|---|---|---|---|
. | . | U= 0 . | U= 0 . | Max-U . | Max-U . | Worst-case . | Worst-case . | Worst-case . | Worst-case . |
. | . | p-value . | p-value . | p-value . | p-value . | max. p . | max. p . | de Haan p . | de Haan p . |
Variable . | Age . | (unadj.) . | (adjusted) . | (unadj.) . | (adjusted) . | (unadjusted) . | (adjusted) . | (unadjusted) . | (adjusted) . |
Stanford–Binet IQ | 4 | 0.001 | 0.001 | 0.008 | 0.008 | |$\mathbf { 0.0035 }$| | |$\mathbf { 0.0246 }$| | |$\mathbf { 0.0051 }$| | |$\mathbf { 0.0358 }$| |
Stanford–Binet IQ | 5 | 0.022 | 0.691 | 0.077 | 0.800 | |$\mathbf { 0.0053 }$| | |$\mathbf { 0.0319 }$| | 0.1314 | 0.6571 |
Stanford–Binet IQ | 6 | 0.033 | 0.034 | 0.094 | 0.102 | |$\mathbf { 0.0289 }$| | 0.1447 | |$\mathbf { 0.0975 }$| | 0.5848 |
Stanford–Binet IQ | 7 | 0.103 | 0.172 | 0.247 | 0.374 | |$\mathbf { 0.0858 }$| | 0.3433 | 0.2259 | 0.9034 |
Stanford–Binet IQ | 8 | 0.599 | 0.691 | 0.733 | 0.800 | 0.5501 | 1.0000 | 0.7234 | 1.0000 |
Stanford–Binet IQ | 9 | 0.450 | 0.548 | 0.631 | 0.680 | 0.5635 | 1.0000 | 0.9429 | 1.0000 |
Stanford–Binet IQ | 10 | 0.684 | 0.691 | 0.790 | 0.800 | 0.2529 | 0.7588 | 0.3615 | 1.0000 |
CAT reading score | 14 | 0.017 | 0.035 | 0.036 | 0.086 | 0.1444 | 0.3410 | 0.1975 | 0.6537 |
CAT arithmetic score | 14 | 0.032 | 0.035 | 0.086 | 0.086 | 0.1185 | 0.3410 | 0.3746 | 0.6537 |
CAT language score | 14 | 0.001 | 0.004 | 0.012 | 0.027 | |$\mathbf { 0.0686 }$| | 0.2743 | 0.1592 | 0.6537 |
CAT mechanics score | 14 | 0.006 | 0.007 | 0.023 | 0.035 | |$\mathbf { 0.0464 }$| | 0.2320 | 0.1307 | 0.6537 |
CAT spelling score | 14 | 0.003 | 0.035 | 0.012 | 0.086 | 0.1137 | 0.3410 | 0.2247 | 0.6537 |
High school graduate | 19 | 0.614 | 0.674 | 0.704 | 0.716 | 0.6373 | 1.0000 | 0.8990 | 1.0000 |
Vocational training | 40 | 0.341 | 0.567 | 0.547 | 0.608 | 0.3582 | 1.0000 | 0.4612 | 1.0000 |
Highest grade completed | 19 | 0.383 | 0.622 | 0.410 | 0.669 | 0.3526 | 1.0000 | 0.5578 | 1.0000 |
Grade point average | 19 | 0.457 | 0.674 | 0.567 | 0.716 | 0.5132 | 1.0000 | 0.8153 | 1.0000 |
Total nonjuvenile arrests | 40 | 0.036 | 0.038 | 0.100 | 0.115 | |$\mathbf { 0.0713 }$| | 0.2752 | 0.3823 | 1.0000 |
Total crime cost | 40 | 0.037 | 0.049 | 0.042 | 0.143 | 0.1746 | 0.2752 | 0.3494 | 1.0000 |
Total charges | 40 | 0.049 | 0.049 | 0.143 | 0.143 | 0.1136 | 0.2752 | 0.2610 | 1.0000 |
Nonvictimless charges | 40 | 0.025 | 0.037 | 0.063 | 0.091 | |$\mathbf { 0.0688 }$| | 0.2752 | 0.3433 | 1.0000 |
Currently employed | 19 | 0.050 | 0.164 | 0.224 | 0.290 | 0.2763 | 0.7413 | 0.7335 | 1.0000 |
Unemployed last year | 19 | 0.901 | 0.901 | 0.922 | 0.922 | 0.2471 | 0.7413 | 0.5214 | 1.0000 |
Jobless months (past 2 yrs) | 19 | 0.821 | 0.849 | 0.873 | 0.890 | 0.3161 | 0.7413 | 0.5369 | 1.0000 |
Currently employed | 27 | 0.268 | 0.295 | 0.485 | 0.512 | 0.3304 | 0.6608 | 0.6281 | 0.9799 |
Unemployed last year | 27 | 0.235 | 0.295 | 0.360 | 0.512 | 0.3393 | 0.6608 | 0.4793 | 0.9799 |
Jobless months (past 2 yrs) | 27 | 0.020 | 0.020 | 0.036 | 0.051 | |$\mathbf { 0.0866 }$| | 0.2599 | 0.3266 | 0.9799 |
Currently employed | 40 | 0.103 | 0.116 | 0.130 | 0.146 | |$\mathbf { 0.0595 }$| | 0.1784 | 0.1149 | 0.3446 |
Unemployed last year | 40 | 0.154 | 0.154 | 0.216 | 0.216 | 0.1694 | 0.1784 | 1.0000 | 1.0000 |
Jobless months (past 2 yrs) | 40 | 0.064 | 0.116 | 0.070 | 0.146 | |$\mathbf { 0.0779 }$| | 0.1784 | 0.1220 | 0.3446 |
Note: This table compares inferences reported by Heckman et al. (2020) with the inferences obtained using our worst-case tests. The first two columns list the blocks of outcomes analysed by Heckman et al. (2020). The next four columns reproduce their zero-U (U = 0) p-values and max-|$U\, p$|-values before and after adjusting for multiplicity of hypotheses. Since all of their tests are based on studentized DIM estimate, we report our inferences (using the studentized DIM test statistic) side by side for comparison. The last four columns report our worst-case maximum p-values and worst-case de Haan p-values before and after adjusting for multiplicity of hypotheses. The unadjusted p-values refer to single p-values that are unadjusted for multiplicity of hypotheses. The adjusted p-values refer to stepdown p-values after adjusting for multiple testing.
. | . | Heckman et al.’s (2020)p-values . | Worst-casep-values using our method . | ||||||
---|---|---|---|---|---|---|---|---|---|
. | . | U= 0 . | U= 0 . | Max-U . | Max-U . | Worst-case . | Worst-case . | Worst-case . | Worst-case . |
. | . | p-value . | p-value . | p-value . | p-value . | max. p . | max. p . | de Haan p . | de Haan p . |
Variable . | Age . | (unadj.) . | (adjusted) . | (unadj.) . | (adjusted) . | (unadjusted) . | (adjusted) . | (unadjusted) . | (adjusted) . |
Stanford–Binet IQ | 4 | 0.001 | 0.001 | 0.008 | 0.008 | |$\mathbf { 0.0035 }$| | |$\mathbf { 0.0246 }$| | |$\mathbf { 0.0051 }$| | |$\mathbf { 0.0358 }$| |
Stanford–Binet IQ | 5 | 0.022 | 0.691 | 0.077 | 0.800 | |$\mathbf { 0.0053 }$| | |$\mathbf { 0.0319 }$| | 0.1314 | 0.6571 |
Stanford–Binet IQ | 6 | 0.033 | 0.034 | 0.094 | 0.102 | |$\mathbf { 0.0289 }$| | 0.1447 | |$\mathbf { 0.0975 }$| | 0.5848 |
Stanford–Binet IQ | 7 | 0.103 | 0.172 | 0.247 | 0.374 | |$\mathbf { 0.0858 }$| | 0.3433 | 0.2259 | 0.9034 |
Stanford–Binet IQ | 8 | 0.599 | 0.691 | 0.733 | 0.800 | 0.5501 | 1.0000 | 0.7234 | 1.0000 |
Stanford–Binet IQ | 9 | 0.450 | 0.548 | 0.631 | 0.680 | 0.5635 | 1.0000 | 0.9429 | 1.0000 |
Stanford–Binet IQ | 10 | 0.684 | 0.691 | 0.790 | 0.800 | 0.2529 | 0.7588 | 0.3615 | 1.0000 |
CAT reading score | 14 | 0.017 | 0.035 | 0.036 | 0.086 | 0.1444 | 0.3410 | 0.1975 | 0.6537 |
CAT arithmetic score | 14 | 0.032 | 0.035 | 0.086 | 0.086 | 0.1185 | 0.3410 | 0.3746 | 0.6537 |
CAT language score | 14 | 0.001 | 0.004 | 0.012 | 0.027 | |$\mathbf { 0.0686 }$| | 0.2743 | 0.1592 | 0.6537 |
CAT mechanics score | 14 | 0.006 | 0.007 | 0.023 | 0.035 | |$\mathbf { 0.0464 }$| | 0.2320 | 0.1307 | 0.6537 |
CAT spelling score | 14 | 0.003 | 0.035 | 0.012 | 0.086 | 0.1137 | 0.3410 | 0.2247 | 0.6537 |
High school graduate | 19 | 0.614 | 0.674 | 0.704 | 0.716 | 0.6373 | 1.0000 | 0.8990 | 1.0000 |
Vocational training | 40 | 0.341 | 0.567 | 0.547 | 0.608 | 0.3582 | 1.0000 | 0.4612 | 1.0000 |
Highest grade completed | 19 | 0.383 | 0.622 | 0.410 | 0.669 | 0.3526 | 1.0000 | 0.5578 | 1.0000 |
Grade point average | 19 | 0.457 | 0.674 | 0.567 | 0.716 | 0.5132 | 1.0000 | 0.8153 | 1.0000 |
Total nonjuvenile arrests | 40 | 0.036 | 0.038 | 0.100 | 0.115 | |$\mathbf { 0.0713 }$| | 0.2752 | 0.3823 | 1.0000 |
Total crime cost | 40 | 0.037 | 0.049 | 0.042 | 0.143 | 0.1746 | 0.2752 | 0.3494 | 1.0000 |
Total charges | 40 | 0.049 | 0.049 | 0.143 | 0.143 | 0.1136 | 0.2752 | 0.2610 | 1.0000 |
Nonvictimless charges | 40 | 0.025 | 0.037 | 0.063 | 0.091 | |$\mathbf { 0.0688 }$| | 0.2752 | 0.3433 | 1.0000 |
Currently employed | 19 | 0.050 | 0.164 | 0.224 | 0.290 | 0.2763 | 0.7413 | 0.7335 | 1.0000 |
Unemployed last year | 19 | 0.901 | 0.901 | 0.922 | 0.922 | 0.2471 | 0.7413 | 0.5214 | 1.0000 |
Jobless months (past 2 yrs) | 19 | 0.821 | 0.849 | 0.873 | 0.890 | 0.3161 | 0.7413 | 0.5369 | 1.0000 |
Currently employed | 27 | 0.268 | 0.295 | 0.485 | 0.512 | 0.3304 | 0.6608 | 0.6281 | 0.9799 |
Unemployed last year | 27 | 0.235 | 0.295 | 0.360 | 0.512 | 0.3393 | 0.6608 | 0.4793 | 0.9799 |
Jobless months (past 2 yrs) | 27 | 0.020 | 0.020 | 0.036 | 0.051 | |$\mathbf { 0.0866 }$| | 0.2599 | 0.3266 | 0.9799 |
Currently employed | 40 | 0.103 | 0.116 | 0.130 | 0.146 | |$\mathbf { 0.0595 }$| | 0.1784 | 0.1149 | 0.3446 |
Unemployed last year | 40 | 0.154 | 0.154 | 0.216 | 0.216 | 0.1694 | 0.1784 | 1.0000 | 1.0000 |
Jobless months (past 2 yrs) | 40 | 0.064 | 0.116 | 0.070 | 0.146 | |$\mathbf { 0.0779 }$| | 0.1784 | 0.1220 | 0.3446 |
. | . | Heckman et al.’s (2020)p-values . | Worst-casep-values using our method . | ||||||
---|---|---|---|---|---|---|---|---|---|
. | . | U= 0 . | U= 0 . | Max-U . | Max-U . | Worst-case . | Worst-case . | Worst-case . | Worst-case . |
. | . | p-value . | p-value . | p-value . | p-value . | max. p . | max. p . | de Haan p . | de Haan p . |
Variable . | Age . | (unadj.) . | (adjusted) . | (unadj.) . | (adjusted) . | (unadjusted) . | (adjusted) . | (unadjusted) . | (adjusted) . |
Stanford–Binet IQ | 4 | 0.001 | 0.001 | 0.008 | 0.008 | |$\mathbf { 0.0035 }$| | |$\mathbf { 0.0246 }$| | |$\mathbf { 0.0051 }$| | |$\mathbf { 0.0358 }$| |
Stanford–Binet IQ | 5 | 0.022 | 0.691 | 0.077 | 0.800 | |$\mathbf { 0.0053 }$| | |$\mathbf { 0.0319 }$| | 0.1314 | 0.6571 |
Stanford–Binet IQ | 6 | 0.033 | 0.034 | 0.094 | 0.102 | |$\mathbf { 0.0289 }$| | 0.1447 | |$\mathbf { 0.0975 }$| | 0.5848 |
Stanford–Binet IQ | 7 | 0.103 | 0.172 | 0.247 | 0.374 | |$\mathbf { 0.0858 }$| | 0.3433 | 0.2259 | 0.9034 |
Stanford–Binet IQ | 8 | 0.599 | 0.691 | 0.733 | 0.800 | 0.5501 | 1.0000 | 0.7234 | 1.0000 |
Stanford–Binet IQ | 9 | 0.450 | 0.548 | 0.631 | 0.680 | 0.5635 | 1.0000 | 0.9429 | 1.0000 |
Stanford–Binet IQ | 10 | 0.684 | 0.691 | 0.790 | 0.800 | 0.2529 | 0.7588 | 0.3615 | 1.0000 |
CAT reading score | 14 | 0.017 | 0.035 | 0.036 | 0.086 | 0.1444 | 0.3410 | 0.1975 | 0.6537 |
CAT arithmetic score | 14 | 0.032 | 0.035 | 0.086 | 0.086 | 0.1185 | 0.3410 | 0.3746 | 0.6537 |
CAT language score | 14 | 0.001 | 0.004 | 0.012 | 0.027 | |$\mathbf { 0.0686 }$| | 0.2743 | 0.1592 | 0.6537 |
CAT mechanics score | 14 | 0.006 | 0.007 | 0.023 | 0.035 | |$\mathbf { 0.0464 }$| | 0.2320 | 0.1307 | 0.6537 |
CAT spelling score | 14 | 0.003 | 0.035 | 0.012 | 0.086 | 0.1137 | 0.3410 | 0.2247 | 0.6537 |
High school graduate | 19 | 0.614 | 0.674 | 0.704 | 0.716 | 0.6373 | 1.0000 | 0.8990 | 1.0000 |
Vocational training | 40 | 0.341 | 0.567 | 0.547 | 0.608 | 0.3582 | 1.0000 | 0.4612 | 1.0000 |
Highest grade completed | 19 | 0.383 | 0.622 | 0.410 | 0.669 | 0.3526 | 1.0000 | 0.5578 | 1.0000 |
Grade point average | 19 | 0.457 | 0.674 | 0.567 | 0.716 | 0.5132 | 1.0000 | 0.8153 | 1.0000 |
Total nonjuvenile arrests | 40 | 0.036 | 0.038 | 0.100 | 0.115 | |$\mathbf { 0.0713 }$| | 0.2752 | 0.3823 | 1.0000 |
Total crime cost | 40 | 0.037 | 0.049 | 0.042 | 0.143 | 0.1746 | 0.2752 | 0.3494 | 1.0000 |
Total charges | 40 | 0.049 | 0.049 | 0.143 | 0.143 | 0.1136 | 0.2752 | 0.2610 | 1.0000 |
Nonvictimless charges | 40 | 0.025 | 0.037 | 0.063 | 0.091 | |$\mathbf { 0.0688 }$| | 0.2752 | 0.3433 | 1.0000 |
Currently employed | 19 | 0.050 | 0.164 | 0.224 | 0.290 | 0.2763 | 0.7413 | 0.7335 | 1.0000 |
Unemployed last year | 19 | 0.901 | 0.901 | 0.922 | 0.922 | 0.2471 | 0.7413 | 0.5214 | 1.0000 |
Jobless months (past 2 yrs) | 19 | 0.821 | 0.849 | 0.873 | 0.890 | 0.3161 | 0.7413 | 0.5369 | 1.0000 |
Currently employed | 27 | 0.268 | 0.295 | 0.485 | 0.512 | 0.3304 | 0.6608 | 0.6281 | 0.9799 |
Unemployed last year | 27 | 0.235 | 0.295 | 0.360 | 0.512 | 0.3393 | 0.6608 | 0.4793 | 0.9799 |
Jobless months (past 2 yrs) | 27 | 0.020 | 0.020 | 0.036 | 0.051 | |$\mathbf { 0.0866 }$| | 0.2599 | 0.3266 | 0.9799 |
Currently employed | 40 | 0.103 | 0.116 | 0.130 | 0.146 | |$\mathbf { 0.0595 }$| | 0.1784 | 0.1149 | 0.3446 |
Unemployed last year | 40 | 0.154 | 0.154 | 0.216 | 0.216 | 0.1694 | 0.1784 | 1.0000 | 1.0000 |
Jobless months (past 2 yrs) | 40 | 0.064 | 0.116 | 0.070 | 0.146 | |$\mathbf { 0.0779 }$| | 0.1784 | 0.1220 | 0.3446 |
Note: This table compares inferences reported by Heckman et al. (2020) with the inferences obtained using our worst-case tests. The first two columns list the blocks of outcomes analysed by Heckman et al. (2020). The next four columns reproduce their zero-U (U = 0) p-values and max-|$U\, p$|-values before and after adjusting for multiplicity of hypotheses. Since all of their tests are based on studentized DIM estimate, we report our inferences (using the studentized DIM test statistic) side by side for comparison. The last four columns report our worst-case maximum p-values and worst-case de Haan p-values before and after adjusting for multiplicity of hypotheses. The unadjusted p-values refer to single p-values that are unadjusted for multiplicity of hypotheses. The adjusted p-values refer to stepdown p-values after adjusting for multiple testing.
. | . | Heckman et al.’s (2020)p-values . | Worst-casep-values using our method . | ||||||
---|---|---|---|---|---|---|---|---|---|
. | . | U= 0 . | U= 0 . | Max-U . | Max-U . | Worst-case . | Worst-case . | Worst-case . | Worst-case . |
. | . | p-value . | p-value . | p-value . | p-value . | max. p . | max. p . | de Haan p . | de Haan p . |
Variable . | Age . | (unadj.) . | (adjusted) . | (unadj.) . | (adjusted) . | (unadjusted) . | (adjusted) . | (unadjusted) . | (adjusted) . |
Stanford–Binet IQ | 4 | 0.008 | 0.008 | 0.020 | 0.020 | |$\mathbf { 0.0025 }$| | |$\mathbf { 0.0174 }$| | |$\mathbf { 0.0026 }$| | |$\mathbf { 0.0181 }$| |
Stanford–Binet IQ | 5 | 0.012 | 0.203 | 0.014 | 0.354 | |$\mathbf { 0.0208 }$| | 0.1246 | |$\mathbf { 0.0635 }$| | 0.3810 |
Stanford–Binet IQ | 6 | 0.094 | 0.164 | 0.160 | 0.346 | 0.1285 | 0.5141 | 0.3200 | 1.0000 |
Stanford–Binet IQ | 7 | 0.133 | 0.137 | 0.191 | 0.222 | |$\mathbf { 0.0796 }$| | 0.3982 | 0.3847 | 1.0000 |
Stanford–Binet IQ | 8 | 0.152 | 0.164 | 0.339 | 0.346 | 0.1514 | 0.5141 | 0.6181 | 1.0000 |
Stanford–Binet IQ | 9 | 0.203 | 0.203 | 0.354 | 0.354 | 0.2102 | 0.5141 | 0.3197 | 1.0000 |
Stanford–Binet IQ | 10 | 0.203 | 0.203 | 0.267 | 0.354 | 0.1301 | 0.5141 | 0.7997 | 1.0000 |
CAT reading score | 14 | 0.078 | 0.082 | 0.136 | 0.167 | |$\mathbf { 0.0358 }$| | |$\mathbf { 0.0715 }$| | 0.1081 | 0.3244 |
CAT arithmetic score | 14 | 0.035 | 0.082 | 0.074 | 0.167 | 0.1046 | 0.1046 | 0.1704 | 0.3407 |
CAT language score | 14 | 0.008 | 0.070 | 0.020 | 0.144 | |$\mathbf { 0.0113 }$| | |$\mathbf { 0.0566 }$| | |$\mathbf { 0.0328 }$| | 0.1640 |
CAT mechanics score | 14 | 0.047 | 0.082 | 0.097 | 0.167 | |$\mathbf { 0.0137 }$| | |$\mathbf { 0.0566 }$| | 0.1974 | 0.3407 |
CAT spelling score | 14 | 0.043 | 0.082 | 0.082 | 0.167 | |$\mathbf { 0.0115 }$| | |$\mathbf { 0.0566 }$| | |$\mathbf { 0.0434 }$| | 0.1736 |
High school graduate | 19 | 0.008 | 0.008 | 0.020 | 0.020 | |$\mathbf { 0.0037 }$| | |$\mathbf { 0.0148 }$| | |$\mathbf { 0.0236 }$| | |$\mathbf { 0.0709 }$| |
Vocational training | 40 | 0.078 | 0.078 | 0.144 | 0.144 | 0.1085 | 0.1085 | 0.1872 | 0.1872 |
Highest grade completed | 19 | 0.070 | 0.070 | 0.113 | 0.113 | |$\mathbf { 0.0297 }$| | |$\mathbf { 0.0593 }$| | |$\mathbf { 0.0585 }$| | 0.1169 |
Grade point average | 19 | 0.039 | 0.039 | 0.082 | 0.082 | |$\mathbf { 0.0086 }$| | |$\mathbf { 0.0259 }$| | |$\mathbf { 0.0151 }$| | |$\mathbf { 0.0603 }$| |
Total nonjuvenile arrests | 40 | 0.020 | 0.133 | 0.121 | 0.158 | 0.1245 | 0.2403 | 0.1625 | 0.4874 |
Total crime cost | 40 | 0.024 | 0.133 | 0.082 | 0.158 | |$\mathbf { 0.0601 }$| | 0.2403 | |$\mathbf { 0.0983 }$| | 0.3932 |
Total charges | 40 | 0.020 | 0.067 | 0.043 | 0.090 | 0.1005 | 0.2403 | 0.1661 | 0.4874 |
Nonvictimless charges | 40 | 0.125 | 0.133 | 0.158 | 0.158 | |$\mathbf { 0.0677 }$| | 0.2403 | 0.2141 | 0.4874 |
Currently employed | 19 | 0.008 | 0.031 | 0.035 | 0.090 | |$\mathbf { 0.0562 }$| | 0.1323 | |$\mathbf { 0.0899 }$| | 0.2697 |
Unemployed last year | 19 | 0.024 | 0.031 | 0.074 | 0.090 | |$\mathbf { 0.0441 }$| | 0.1323 | 0.1037 | 0.2697 |
Jobless months (past 2 yrs) | 19 | 0.125 | 0.125 | 0.206 | 0.206 | |$\mathbf { 0.0858 }$| | 0.1323 | 0.2354 | 0.2697 |
Currently employed | 27 | 0.110 | 0.149 | 0.175 | 0.198 | |$\mathbf { 0.0760 }$| | 0.2281 | 0.1810 | 0.3969 |
Unemployed last year | 27 | 0.078 | 0.149 | 0.128 | 0.175 | |$\mathbf { 0.0962 }$| | 0.2281 | 0.1323 | 0.3969 |
Jobless months (past 2 yrs) | 27 | 0.110 | 0.149 | 0.166 | 0.198 | 0.1970 | 0.2281 | 0.2889 | 0.3969 |
Currently employed | 40 | 0.442 | 0.442 | 0.567 | 0.567 | 0.4818 | 0.8816 | 1.0000 | 1.0000 |
Unemployed last year | 40 | 0.047 | 0.070 | 0.113 | 0.160 | 0.1000 | 0.3001 | 0.2254 | 0.6761 |
Jobless months (past 2 yrs) | 40 | 0.352 | 0.367 | 0.540 | 0.540 | 0.4408 | 0.8816 | 0.5519 | 1.0000 |
. | . | Heckman et al.’s (2020)p-values . | Worst-casep-values using our method . | ||||||
---|---|---|---|---|---|---|---|---|---|
. | . | U= 0 . | U= 0 . | Max-U . | Max-U . | Worst-case . | Worst-case . | Worst-case . | Worst-case . |
. | . | p-value . | p-value . | p-value . | p-value . | max. p . | max. p . | de Haan p . | de Haan p . |
Variable . | Age . | (unadj.) . | (adjusted) . | (unadj.) . | (adjusted) . | (unadjusted) . | (adjusted) . | (unadjusted) . | (adjusted) . |
Stanford–Binet IQ | 4 | 0.008 | 0.008 | 0.020 | 0.020 | |$\mathbf { 0.0025 }$| | |$\mathbf { 0.0174 }$| | |$\mathbf { 0.0026 }$| | |$\mathbf { 0.0181 }$| |
Stanford–Binet IQ | 5 | 0.012 | 0.203 | 0.014 | 0.354 | |$\mathbf { 0.0208 }$| | 0.1246 | |$\mathbf { 0.0635 }$| | 0.3810 |
Stanford–Binet IQ | 6 | 0.094 | 0.164 | 0.160 | 0.346 | 0.1285 | 0.5141 | 0.3200 | 1.0000 |
Stanford–Binet IQ | 7 | 0.133 | 0.137 | 0.191 | 0.222 | |$\mathbf { 0.0796 }$| | 0.3982 | 0.3847 | 1.0000 |
Stanford–Binet IQ | 8 | 0.152 | 0.164 | 0.339 | 0.346 | 0.1514 | 0.5141 | 0.6181 | 1.0000 |
Stanford–Binet IQ | 9 | 0.203 | 0.203 | 0.354 | 0.354 | 0.2102 | 0.5141 | 0.3197 | 1.0000 |
Stanford–Binet IQ | 10 | 0.203 | 0.203 | 0.267 | 0.354 | 0.1301 | 0.5141 | 0.7997 | 1.0000 |
CAT reading score | 14 | 0.078 | 0.082 | 0.136 | 0.167 | |$\mathbf { 0.0358 }$| | |$\mathbf { 0.0715 }$| | 0.1081 | 0.3244 |
CAT arithmetic score | 14 | 0.035 | 0.082 | 0.074 | 0.167 | 0.1046 | 0.1046 | 0.1704 | 0.3407 |
CAT language score | 14 | 0.008 | 0.070 | 0.020 | 0.144 | |$\mathbf { 0.0113 }$| | |$\mathbf { 0.0566 }$| | |$\mathbf { 0.0328 }$| | 0.1640 |
CAT mechanics score | 14 | 0.047 | 0.082 | 0.097 | 0.167 | |$\mathbf { 0.0137 }$| | |$\mathbf { 0.0566 }$| | 0.1974 | 0.3407 |
CAT spelling score | 14 | 0.043 | 0.082 | 0.082 | 0.167 | |$\mathbf { 0.0115 }$| | |$\mathbf { 0.0566 }$| | |$\mathbf { 0.0434 }$| | 0.1736 |
High school graduate | 19 | 0.008 | 0.008 | 0.020 | 0.020 | |$\mathbf { 0.0037 }$| | |$\mathbf { 0.0148 }$| | |$\mathbf { 0.0236 }$| | |$\mathbf { 0.0709 }$| |
Vocational training | 40 | 0.078 | 0.078 | 0.144 | 0.144 | 0.1085 | 0.1085 | 0.1872 | 0.1872 |
Highest grade completed | 19 | 0.070 | 0.070 | 0.113 | 0.113 | |$\mathbf { 0.0297 }$| | |$\mathbf { 0.0593 }$| | |$\mathbf { 0.0585 }$| | 0.1169 |
Grade point average | 19 | 0.039 | 0.039 | 0.082 | 0.082 | |$\mathbf { 0.0086 }$| | |$\mathbf { 0.0259 }$| | |$\mathbf { 0.0151 }$| | |$\mathbf { 0.0603 }$| |
Total nonjuvenile arrests | 40 | 0.020 | 0.133 | 0.121 | 0.158 | 0.1245 | 0.2403 | 0.1625 | 0.4874 |
Total crime cost | 40 | 0.024 | 0.133 | 0.082 | 0.158 | |$\mathbf { 0.0601 }$| | 0.2403 | |$\mathbf { 0.0983 }$| | 0.3932 |
Total charges | 40 | 0.020 | 0.067 | 0.043 | 0.090 | 0.1005 | 0.2403 | 0.1661 | 0.4874 |
Nonvictimless charges | 40 | 0.125 | 0.133 | 0.158 | 0.158 | |$\mathbf { 0.0677 }$| | 0.2403 | 0.2141 | 0.4874 |
Currently employed | 19 | 0.008 | 0.031 | 0.035 | 0.090 | |$\mathbf { 0.0562 }$| | 0.1323 | |$\mathbf { 0.0899 }$| | 0.2697 |
Unemployed last year | 19 | 0.024 | 0.031 | 0.074 | 0.090 | |$\mathbf { 0.0441 }$| | 0.1323 | 0.1037 | 0.2697 |
Jobless months (past 2 yrs) | 19 | 0.125 | 0.125 | 0.206 | 0.206 | |$\mathbf { 0.0858 }$| | 0.1323 | 0.2354 | 0.2697 |
Currently employed | 27 | 0.110 | 0.149 | 0.175 | 0.198 | |$\mathbf { 0.0760 }$| | 0.2281 | 0.1810 | 0.3969 |
Unemployed last year | 27 | 0.078 | 0.149 | 0.128 | 0.175 | |$\mathbf { 0.0962 }$| | 0.2281 | 0.1323 | 0.3969 |
Jobless months (past 2 yrs) | 27 | 0.110 | 0.149 | 0.166 | 0.198 | 0.1970 | 0.2281 | 0.2889 | 0.3969 |
Currently employed | 40 | 0.442 | 0.442 | 0.567 | 0.567 | 0.4818 | 0.8816 | 1.0000 | 1.0000 |
Unemployed last year | 40 | 0.047 | 0.070 | 0.113 | 0.160 | 0.1000 | 0.3001 | 0.2254 | 0.6761 |
Jobless months (past 2 yrs) | 40 | 0.352 | 0.367 | 0.540 | 0.540 | 0.4408 | 0.8816 | 0.5519 | 1.0000 |
Note: This table compares inferences reported by Heckman et al. (2020) with the inferences obtained using our worst-case tests. The first two columns list the blocks of outcomes analysed by Heckman et al. (2020). The next four columns reproduce their zero-U (U = 0) p-values and max-|$U\, p$|-values before and after adjusting for multiplicity of hypotheses. Since all of their tests are based on studentized DIM estimate, we report our inferences (using the studentized DIM test statistic) side by side for comparison. The last four columns report our worst-case maximum p-values and worst-case de Haan p-values before and after adjusting for multiplicity of hypotheses. The unadjusted p-values refer to single p-values that are unadjusted for multiplicity of hypotheses. The adjusted p-values refer to stepdown p-values after adjusting for multiple testing.
. | . | Heckman et al.’s (2020)p-values . | Worst-casep-values using our method . | ||||||
---|---|---|---|---|---|---|---|---|---|
. | . | U= 0 . | U= 0 . | Max-U . | Max-U . | Worst-case . | Worst-case . | Worst-case . | Worst-case . |
. | . | p-value . | p-value . | p-value . | p-value . | max. p . | max. p . | de Haan p . | de Haan p . |
Variable . | Age . | (unadj.) . | (adjusted) . | (unadj.) . | (adjusted) . | (unadjusted) . | (adjusted) . | (unadjusted) . | (adjusted) . |
Stanford–Binet IQ | 4 | 0.008 | 0.008 | 0.020 | 0.020 | |$\mathbf { 0.0025 }$| | |$\mathbf { 0.0174 }$| | |$\mathbf { 0.0026 }$| | |$\mathbf { 0.0181 }$| |
Stanford–Binet IQ | 5 | 0.012 | 0.203 | 0.014 | 0.354 | |$\mathbf { 0.0208 }$| | 0.1246 | |$\mathbf { 0.0635 }$| | 0.3810 |
Stanford–Binet IQ | 6 | 0.094 | 0.164 | 0.160 | 0.346 | 0.1285 | 0.5141 | 0.3200 | 1.0000 |
Stanford–Binet IQ | 7 | 0.133 | 0.137 | 0.191 | 0.222 | |$\mathbf { 0.0796 }$| | 0.3982 | 0.3847 | 1.0000 |
Stanford–Binet IQ | 8 | 0.152 | 0.164 | 0.339 | 0.346 | 0.1514 | 0.5141 | 0.6181 | 1.0000 |
Stanford–Binet IQ | 9 | 0.203 | 0.203 | 0.354 | 0.354 | 0.2102 | 0.5141 | 0.3197 | 1.0000 |
Stanford–Binet IQ | 10 | 0.203 | 0.203 | 0.267 | 0.354 | 0.1301 | 0.5141 | 0.7997 | 1.0000 |
CAT reading score | 14 | 0.078 | 0.082 | 0.136 | 0.167 | |$\mathbf { 0.0358 }$| | |$\mathbf { 0.0715 }$| | 0.1081 | 0.3244 |
CAT arithmetic score | 14 | 0.035 | 0.082 | 0.074 | 0.167 | 0.1046 | 0.1046 | 0.1704 | 0.3407 |
CAT language score | 14 | 0.008 | 0.070 | 0.020 | 0.144 | |$\mathbf { 0.0113 }$| | |$\mathbf { 0.0566 }$| | |$\mathbf { 0.0328 }$| | 0.1640 |
CAT mechanics score | 14 | 0.047 | 0.082 | 0.097 | 0.167 | |$\mathbf { 0.0137 }$| | |$\mathbf { 0.0566 }$| | 0.1974 | 0.3407 |
CAT spelling score | 14 | 0.043 | 0.082 | 0.082 | 0.167 | |$\mathbf { 0.0115 }$| | |$\mathbf { 0.0566 }$| | |$\mathbf { 0.0434 }$| | 0.1736 |
High school graduate | 19 | 0.008 | 0.008 | 0.020 | 0.020 | |$\mathbf { 0.0037 }$| | |$\mathbf { 0.0148 }$| | |$\mathbf { 0.0236 }$| | |$\mathbf { 0.0709 }$| |
Vocational training | 40 | 0.078 | 0.078 | 0.144 | 0.144 | 0.1085 | 0.1085 | 0.1872 | 0.1872 |
Highest grade completed | 19 | 0.070 | 0.070 | 0.113 | 0.113 | |$\mathbf { 0.0297 }$| | |$\mathbf { 0.0593 }$| | |$\mathbf { 0.0585 }$| | 0.1169 |
Grade point average | 19 | 0.039 | 0.039 | 0.082 | 0.082 | |$\mathbf { 0.0086 }$| | |$\mathbf { 0.0259 }$| | |$\mathbf { 0.0151 }$| | |$\mathbf { 0.0603 }$| |
Total nonjuvenile arrests | 40 | 0.020 | 0.133 | 0.121 | 0.158 | 0.1245 | 0.2403 | 0.1625 | 0.4874 |
Total crime cost | 40 | 0.024 | 0.133 | 0.082 | 0.158 | |$\mathbf { 0.0601 }$| | 0.2403 | |$\mathbf { 0.0983 }$| | 0.3932 |
Total charges | 40 | 0.020 | 0.067 | 0.043 | 0.090 | 0.1005 | 0.2403 | 0.1661 | 0.4874 |
Nonvictimless charges | 40 | 0.125 | 0.133 | 0.158 | 0.158 | |$\mathbf { 0.0677 }$| | 0.2403 | 0.2141 | 0.4874 |
Currently employed | 19 | 0.008 | 0.031 | 0.035 | 0.090 | |$\mathbf { 0.0562 }$| | 0.1323 | |$\mathbf { 0.0899 }$| | 0.2697 |
Unemployed last year | 19 | 0.024 | 0.031 | 0.074 | 0.090 | |$\mathbf { 0.0441 }$| | 0.1323 | 0.1037 | 0.2697 |
Jobless months (past 2 yrs) | 19 | 0.125 | 0.125 | 0.206 | 0.206 | |$\mathbf { 0.0858 }$| | 0.1323 | 0.2354 | 0.2697 |
Currently employed | 27 | 0.110 | 0.149 | 0.175 | 0.198 | |$\mathbf { 0.0760 }$| | 0.2281 | 0.1810 | 0.3969 |
Unemployed last year | 27 | 0.078 | 0.149 | 0.128 | 0.175 | |$\mathbf { 0.0962 }$| | 0.2281 | 0.1323 | 0.3969 |
Jobless months (past 2 yrs) | 27 | 0.110 | 0.149 | 0.166 | 0.198 | 0.1970 | 0.2281 | 0.2889 | 0.3969 |
Currently employed | 40 | 0.442 | 0.442 | 0.567 | 0.567 | 0.4818 | 0.8816 | 1.0000 | 1.0000 |
Unemployed last year | 40 | 0.047 | 0.070 | 0.113 | 0.160 | 0.1000 | 0.3001 | 0.2254 | 0.6761 |
Jobless months (past 2 yrs) | 40 | 0.352 | 0.367 | 0.540 | 0.540 | 0.4408 | 0.8816 | 0.5519 | 1.0000 |
. | . | Heckman et al.’s (2020)p-values . | Worst-casep-values using our method . | ||||||
---|---|---|---|---|---|---|---|---|---|
. | . | U= 0 . | U= 0 . | Max-U . | Max-U . | Worst-case . | Worst-case . | Worst-case . | Worst-case . |
. | . | p-value . | p-value . | p-value . | p-value . | max. p . | max. p . | de Haan p . | de Haan p . |
Variable . | Age . | (unadj.) . | (adjusted) . | (unadj.) . | (adjusted) . | (unadjusted) . | (adjusted) . | (unadjusted) . | (adjusted) . |
Stanford–Binet IQ | 4 | 0.008 | 0.008 | 0.020 | 0.020 | |$\mathbf { 0.0025 }$| | |$\mathbf { 0.0174 }$| | |$\mathbf { 0.0026 }$| | |$\mathbf { 0.0181 }$| |
Stanford–Binet IQ | 5 | 0.012 | 0.203 | 0.014 | 0.354 | |$\mathbf { 0.0208 }$| | 0.1246 | |$\mathbf { 0.0635 }$| | 0.3810 |
Stanford–Binet IQ | 6 | 0.094 | 0.164 | 0.160 | 0.346 | 0.1285 | 0.5141 | 0.3200 | 1.0000 |
Stanford–Binet IQ | 7 | 0.133 | 0.137 | 0.191 | 0.222 | |$\mathbf { 0.0796 }$| | 0.3982 | 0.3847 | 1.0000 |
Stanford–Binet IQ | 8 | 0.152 | 0.164 | 0.339 | 0.346 | 0.1514 | 0.5141 | 0.6181 | 1.0000 |
Stanford–Binet IQ | 9 | 0.203 | 0.203 | 0.354 | 0.354 | 0.2102 | 0.5141 | 0.3197 | 1.0000 |
Stanford–Binet IQ | 10 | 0.203 | 0.203 | 0.267 | 0.354 | 0.1301 | 0.5141 | 0.7997 | 1.0000 |
CAT reading score | 14 | 0.078 | 0.082 | 0.136 | 0.167 | |$\mathbf { 0.0358 }$| | |$\mathbf { 0.0715 }$| | 0.1081 | 0.3244 |
CAT arithmetic score | 14 | 0.035 | 0.082 | 0.074 | 0.167 | 0.1046 | 0.1046 | 0.1704 | 0.3407 |
CAT language score | 14 | 0.008 | 0.070 | 0.020 | 0.144 | |$\mathbf { 0.0113 }$| | |$\mathbf { 0.0566 }$| | |$\mathbf { 0.0328 }$| | 0.1640 |
CAT mechanics score | 14 | 0.047 | 0.082 | 0.097 | 0.167 | |$\mathbf { 0.0137 }$| | |$\mathbf { 0.0566 }$| | 0.1974 | 0.3407 |
CAT spelling score | 14 | 0.043 | 0.082 | 0.082 | 0.167 | |$\mathbf { 0.0115 }$| | |$\mathbf { 0.0566 }$| | |$\mathbf { 0.0434 }$| | 0.1736 |
High school graduate | 19 | 0.008 | 0.008 | 0.020 | 0.020 | |$\mathbf { 0.0037 }$| | |$\mathbf { 0.0148 }$| | |$\mathbf { 0.0236 }$| | |$\mathbf { 0.0709 }$| |
Vocational training | 40 | 0.078 | 0.078 | 0.144 | 0.144 | 0.1085 | 0.1085 | 0.1872 | 0.1872 |
Highest grade completed | 19 | 0.070 | 0.070 | 0.113 | 0.113 | |$\mathbf { 0.0297 }$| | |$\mathbf { 0.0593 }$| | |$\mathbf { 0.0585 }$| | 0.1169 |
Grade point average | 19 | 0.039 | 0.039 | 0.082 | 0.082 | |$\mathbf { 0.0086 }$| | |$\mathbf { 0.0259 }$| | |$\mathbf { 0.0151 }$| | |$\mathbf { 0.0603 }$| |
Total nonjuvenile arrests | 40 | 0.020 | 0.133 | 0.121 | 0.158 | 0.1245 | 0.2403 | 0.1625 | 0.4874 |
Total crime cost | 40 | 0.024 | 0.133 | 0.082 | 0.158 | |$\mathbf { 0.0601 }$| | 0.2403 | |$\mathbf { 0.0983 }$| | 0.3932 |
Total charges | 40 | 0.020 | 0.067 | 0.043 | 0.090 | 0.1005 | 0.2403 | 0.1661 | 0.4874 |
Nonvictimless charges | 40 | 0.125 | 0.133 | 0.158 | 0.158 | |$\mathbf { 0.0677 }$| | 0.2403 | 0.2141 | 0.4874 |
Currently employed | 19 | 0.008 | 0.031 | 0.035 | 0.090 | |$\mathbf { 0.0562 }$| | 0.1323 | |$\mathbf { 0.0899 }$| | 0.2697 |
Unemployed last year | 19 | 0.024 | 0.031 | 0.074 | 0.090 | |$\mathbf { 0.0441 }$| | 0.1323 | 0.1037 | 0.2697 |
Jobless months (past 2 yrs) | 19 | 0.125 | 0.125 | 0.206 | 0.206 | |$\mathbf { 0.0858 }$| | 0.1323 | 0.2354 | 0.2697 |
Currently employed | 27 | 0.110 | 0.149 | 0.175 | 0.198 | |$\mathbf { 0.0760 }$| | 0.2281 | 0.1810 | 0.3969 |
Unemployed last year | 27 | 0.078 | 0.149 | 0.128 | 0.175 | |$\mathbf { 0.0962 }$| | 0.2281 | 0.1323 | 0.3969 |
Jobless months (past 2 yrs) | 27 | 0.110 | 0.149 | 0.166 | 0.198 | 0.1970 | 0.2281 | 0.2889 | 0.3969 |
Currently employed | 40 | 0.442 | 0.442 | 0.567 | 0.567 | 0.4818 | 0.8816 | 1.0000 | 1.0000 |
Unemployed last year | 40 | 0.047 | 0.070 | 0.113 | 0.160 | 0.1000 | 0.3001 | 0.2254 | 0.6761 |
Jobless months (past 2 yrs) | 40 | 0.352 | 0.367 | 0.540 | 0.540 | 0.4408 | 0.8816 | 0.5519 | 1.0000 |
Note: This table compares inferences reported by Heckman et al. (2020) with the inferences obtained using our worst-case tests. The first two columns list the blocks of outcomes analysed by Heckman et al. (2020). The next four columns reproduce their zero-U (U = 0) p-values and max-|$U\, p$|-values before and after adjusting for multiplicity of hypotheses. Since all of their tests are based on studentized DIM estimate, we report our inferences (using the studentized DIM test statistic) side by side for comparison. The last four columns report our worst-case maximum p-values and worst-case de Haan p-values before and after adjusting for multiplicity of hypotheses. The unadjusted p-values refer to single p-values that are unadjusted for multiplicity of hypotheses. The adjusted p-values refer to stepdown p-values after adjusting for multiple testing.
Table 4 reports stepdown p-values for male outcomes. No estimated effect (after age 5) remains statistically significant at the 10% level after adjusting for multiple hypothesis testing using the worst-case maximum or worst-case de Haan p-values. However, in Table 5, which presents stepdown analysis of female outcomes, the treatment effects on post-programme outcomes (related to some CAT scores and educational outcomes) are statistically significant at the 10% level using our worst-case maximum p-value. Nevertheless, all of these effects on female outcomes, except for two (high school graduation and grade point average), disappear when worst-case de Haan p-values are used.
Tables 2 to 5 use the studentized AIPW test statistic for inference. Heckman et al. (2020) use the studentized DIM test statistic instead. Tables 6 and 7 compare their inferences with ours using the same test statistic. The effects for males on post-programme outcomes remain statistically insignificant at the 10% level using stepdown worst-case de Haan p-values, whereas treatment effects on CAT scores are statistically significant in Heckman et al.’s (2020) analysis.
Heckman et al. (2020) do not analyse the Perry treatment effects on convictions for violent crime, which are substantial and play an important role in cost–benefit analyses of early childhood programmes (see Heckman et al., 2010b). Using administrative data on the criminal activity of participants, we illustrate their importance and, at the same time, the importance of long-term follow-up. Tables 8, 9, and 10 provide estimates and measures of statistical significance of treatments effects in the pooled sample (of all participants) on cumulative convictions for violent misdemeanors and felonies at various ages. Online Appendix S2 presents the expanded versions of these tables reporting inference for various estimators and test statistics for the pooled sample as well as the male and female subsamples. As shown in Table 9, the AIPW estimates of the treatment effect on cumulative violent misdemeanor convictions are below −0.5 at ages 30 and 40. These estimates of treatment effects on violent misdemeanor convictions are statistically significant at the 5% and 10% levels before and after multiple hypothesis testing, respectively.
. | . | Untreated . | Treated . | DIM . | Asymptotic . | Bootstrap . | Permutation . | Worst-case . | Worst-case . |
---|---|---|---|---|---|---|---|---|---|
Type . | Age . | mean . | mean . | estimate . | p-value . | p-value . | p-value . | max. p . | de Haan p . |
Misdemeanor | 30 | 0.5231 | 0.0517 | −0.4714 | |$\mathbf { 0.0109 }$| | |$\mathbf { 0.0021 }$| | |$\mathbf { 0.0036 }$| | |$\mathbf { 0.0135 }$| | 0.1002 |
Misdemeanor | 40 | 0.6825 | 0.0877 | −0.5948 | |$\mathbf { 0.0033 }$| | |$\mathbf { 0.0005 }$| | |$\mathbf { 0.0004 }$| | |$\mathbf { 0.0054 }$| | |$\mathbf { 0.0092 }$| |
Felony | 30 | 0.2846 | 0.1897 | −0.0950 | 0.2301 | 0.2263 | 0.2624 | 0.3867 | 0.6691 |
Felony | 40 | 0.4762 | 0.1930 | −0.2832 | |$\mathbf { 0.0333 }$| | |$\mathbf { 0.0332 }$| | |$\mathbf { 0.0384 }$| | |$\mathbf { 0.0792 }$| | 0.1362 |
. | . | Untreated . | Treated . | DIM . | Asymptotic . | Bootstrap . | Permutation . | Worst-case . | Worst-case . |
---|---|---|---|---|---|---|---|---|---|
Type . | Age . | mean . | mean . | estimate . | p-value . | p-value . | p-value . | max. p . | de Haan p . |
Misdemeanor | 30 | 0.5231 | 0.0517 | −0.4714 | |$\mathbf { 0.0109 }$| | |$\mathbf { 0.0021 }$| | |$\mathbf { 0.0036 }$| | |$\mathbf { 0.0135 }$| | 0.1002 |
Misdemeanor | 40 | 0.6825 | 0.0877 | −0.5948 | |$\mathbf { 0.0033 }$| | |$\mathbf { 0.0005 }$| | |$\mathbf { 0.0004 }$| | |$\mathbf { 0.0054 }$| | |$\mathbf { 0.0092 }$| |
Felony | 30 | 0.2846 | 0.1897 | −0.0950 | 0.2301 | 0.2263 | 0.2624 | 0.3867 | 0.6691 |
Felony | 40 | 0.4762 | 0.1930 | −0.2832 | |$\mathbf { 0.0333 }$| | |$\mathbf { 0.0332 }$| | |$\mathbf { 0.0384 }$| | |$\mathbf { 0.0792 }$| | 0.1362 |
Note: This table reports p-values for single hypothesis tests of treatment effects on cumulative misdemeanor and felony convictions for violent crime at ages 30 and 40, using the pooled sample of participants. The inferences are based on the studentized DIM (difference-in-means) test statistic.
. | . | Untreated . | Treated . | DIM . | Asymptotic . | Bootstrap . | Permutation . | Worst-case . | Worst-case . |
---|---|---|---|---|---|---|---|---|---|
Type . | Age . | mean . | mean . | estimate . | p-value . | p-value . | p-value . | max. p . | de Haan p . |
Misdemeanor | 30 | 0.5231 | 0.0517 | −0.4714 | |$\mathbf { 0.0109 }$| | |$\mathbf { 0.0021 }$| | |$\mathbf { 0.0036 }$| | |$\mathbf { 0.0135 }$| | 0.1002 |
Misdemeanor | 40 | 0.6825 | 0.0877 | −0.5948 | |$\mathbf { 0.0033 }$| | |$\mathbf { 0.0005 }$| | |$\mathbf { 0.0004 }$| | |$\mathbf { 0.0054 }$| | |$\mathbf { 0.0092 }$| |
Felony | 30 | 0.2846 | 0.1897 | −0.0950 | 0.2301 | 0.2263 | 0.2624 | 0.3867 | 0.6691 |
Felony | 40 | 0.4762 | 0.1930 | −0.2832 | |$\mathbf { 0.0333 }$| | |$\mathbf { 0.0332 }$| | |$\mathbf { 0.0384 }$| | |$\mathbf { 0.0792 }$| | 0.1362 |
. | . | Untreated . | Treated . | DIM . | Asymptotic . | Bootstrap . | Permutation . | Worst-case . | Worst-case . |
---|---|---|---|---|---|---|---|---|---|
Type . | Age . | mean . | mean . | estimate . | p-value . | p-value . | p-value . | max. p . | de Haan p . |
Misdemeanor | 30 | 0.5231 | 0.0517 | −0.4714 | |$\mathbf { 0.0109 }$| | |$\mathbf { 0.0021 }$| | |$\mathbf { 0.0036 }$| | |$\mathbf { 0.0135 }$| | 0.1002 |
Misdemeanor | 40 | 0.6825 | 0.0877 | −0.5948 | |$\mathbf { 0.0033 }$| | |$\mathbf { 0.0005 }$| | |$\mathbf { 0.0004 }$| | |$\mathbf { 0.0054 }$| | |$\mathbf { 0.0092 }$| |
Felony | 30 | 0.2846 | 0.1897 | −0.0950 | 0.2301 | 0.2263 | 0.2624 | 0.3867 | 0.6691 |
Felony | 40 | 0.4762 | 0.1930 | −0.2832 | |$\mathbf { 0.0333 }$| | |$\mathbf { 0.0332 }$| | |$\mathbf { 0.0384 }$| | |$\mathbf { 0.0792 }$| | 0.1362 |
Note: This table reports p-values for single hypothesis tests of treatment effects on cumulative misdemeanor and felony convictions for violent crime at ages 30 and 40, using the pooled sample of participants. The inferences are based on the studentized DIM (difference-in-means) test statistic.
. | . | Untreated . | Treated . | AIPW . | Asymptotic . | Bootstrap . | Permutation . | Worst-case . | Worst-case . |
---|---|---|---|---|---|---|---|---|---|
Type . | Age . | mean . | mean . | estimate . | p-value . | p-value . | p-value . | max. p . | de Haan p . |
Misdemeanor | 30 | 0.5231 | 0.0517 | −0.5300 | |$\mathbf { 0.0064 }$| | |$\mathbf { 0.0020 }$| | |$\mathbf { 0.0024 }$| | |$\mathbf { 0.0102 }$| | |$\mathbf { 0.0267 }$| |
Misdemeanor | 40 | 0.6825 | 0.0877 | −0.6491 | |$\mathbf { 0.0021 }$| | |$\mathbf { 0.0010 }$| | |$\mathbf { 0.0008 }$| | |$\mathbf { 0.0051 }$| | |$\mathbf { 0.0052 }$| |
Felony | 30 | 0.2846 | 0.1897 | −0.0561 | 0.3174 | 0.3217 | 0.3488 | 0.4809 | 0.7310 |
Felony | 40 | 0.4762 | 0.1930 | −0.2052 | |$\mathbf { 0.0664 }$| | |$\mathbf { 0.0778 }$| | |$\mathbf { 0.0708 }$| | 0.1376 | 0.2412 |
. | . | Untreated . | Treated . | AIPW . | Asymptotic . | Bootstrap . | Permutation . | Worst-case . | Worst-case . |
---|---|---|---|---|---|---|---|---|---|
Type . | Age . | mean . | mean . | estimate . | p-value . | p-value . | p-value . | max. p . | de Haan p . |
Misdemeanor | 30 | 0.5231 | 0.0517 | −0.5300 | |$\mathbf { 0.0064 }$| | |$\mathbf { 0.0020 }$| | |$\mathbf { 0.0024 }$| | |$\mathbf { 0.0102 }$| | |$\mathbf { 0.0267 }$| |
Misdemeanor | 40 | 0.6825 | 0.0877 | −0.6491 | |$\mathbf { 0.0021 }$| | |$\mathbf { 0.0010 }$| | |$\mathbf { 0.0008 }$| | |$\mathbf { 0.0051 }$| | |$\mathbf { 0.0052 }$| |
Felony | 30 | 0.2846 | 0.1897 | −0.0561 | 0.3174 | 0.3217 | 0.3488 | 0.4809 | 0.7310 |
Felony | 40 | 0.4762 | 0.1930 | −0.2052 | |$\mathbf { 0.0664 }$| | |$\mathbf { 0.0778 }$| | |$\mathbf { 0.0708 }$| | 0.1376 | 0.2412 |
Note: This table reports p-values for single hypothesis tests of treatment effects on cumulative misdemeanor and felony convictions for violent crime at ages 30 and 40, using the pooled sample of participants. The inferences are based on the studentized AIPW test statistic.
. | . | Untreated . | Treated . | AIPW . | Asymptotic . | Bootstrap . | Permutation . | Worst-case . | Worst-case . |
---|---|---|---|---|---|---|---|---|---|
Type . | Age . | mean . | mean . | estimate . | p-value . | p-value . | p-value . | max. p . | de Haan p . |
Misdemeanor | 30 | 0.5231 | 0.0517 | −0.5300 | |$\mathbf { 0.0064 }$| | |$\mathbf { 0.0020 }$| | |$\mathbf { 0.0024 }$| | |$\mathbf { 0.0102 }$| | |$\mathbf { 0.0267 }$| |
Misdemeanor | 40 | 0.6825 | 0.0877 | −0.6491 | |$\mathbf { 0.0021 }$| | |$\mathbf { 0.0010 }$| | |$\mathbf { 0.0008 }$| | |$\mathbf { 0.0051 }$| | |$\mathbf { 0.0052 }$| |
Felony | 30 | 0.2846 | 0.1897 | −0.0561 | 0.3174 | 0.3217 | 0.3488 | 0.4809 | 0.7310 |
Felony | 40 | 0.4762 | 0.1930 | −0.2052 | |$\mathbf { 0.0664 }$| | |$\mathbf { 0.0778 }$| | |$\mathbf { 0.0708 }$| | 0.1376 | 0.2412 |
. | . | Untreated . | Treated . | AIPW . | Asymptotic . | Bootstrap . | Permutation . | Worst-case . | Worst-case . |
---|---|---|---|---|---|---|---|---|---|
Type . | Age . | mean . | mean . | estimate . | p-value . | p-value . | p-value . | max. p . | de Haan p . |
Misdemeanor | 30 | 0.5231 | 0.0517 | −0.5300 | |$\mathbf { 0.0064 }$| | |$\mathbf { 0.0020 }$| | |$\mathbf { 0.0024 }$| | |$\mathbf { 0.0102 }$| | |$\mathbf { 0.0267 }$| |
Misdemeanor | 40 | 0.6825 | 0.0877 | −0.6491 | |$\mathbf { 0.0021 }$| | |$\mathbf { 0.0010 }$| | |$\mathbf { 0.0008 }$| | |$\mathbf { 0.0051 }$| | |$\mathbf { 0.0052 }$| |
Felony | 30 | 0.2846 | 0.1897 | −0.0561 | 0.3174 | 0.3217 | 0.3488 | 0.4809 | 0.7310 |
Felony | 40 | 0.4762 | 0.1930 | −0.2052 | |$\mathbf { 0.0664 }$| | |$\mathbf { 0.0778 }$| | |$\mathbf { 0.0708 }$| | 0.1376 | 0.2412 |
Note: This table reports p-values for single hypothesis tests of treatment effects on cumulative misdemeanor and felony convictions for violent crime at ages 30 and 40, using the pooled sample of participants. The inferences are based on the studentized AIPW test statistic.
. | . | Untreated . | Treated . | AIPW . | Asymptotic . | Bootstrap . | Permutation . | Worst-case . | Worst-case . |
---|---|---|---|---|---|---|---|---|---|
Type . | Age . | mean . | mean . | estimate . | p-value . | p-value . | p-value . | max. p . | de Haan p . |
Misdemeanor | 30 | 0.5231 | 0.0517 | −0.5300 | |$\mathbf { 0.0192 }$| | |$\mathbf { 0.0059 }$| | |$\mathbf { 0.0072 }$| | |$\mathbf { 0.0306 }$| | |$\mathbf { 0.0800 }$| |
Misdemeanor | 40 | 0.6825 | 0.0877 | −0.6491 | |$\mathbf { 0.0085 }$| | |$\mathbf { 0.0039 }$| | |$\mathbf { 0.0032 }$| | |$\mathbf { 0.0204 }$| | |$\mathbf { 0.0208 }$| |
Felony | 30 | 0.2846 | 0.1897 | −0.0561 | 0.3174 | 0.3217 | 0.3488 | 0.4809 | 0.7310 |
Felony | 40 | 0.4762 | 0.1930 | −0.2052 | 0.1327 | 0.1556 | 0.1416 | 0.2752 | 0.4824 |
. | . | Untreated . | Treated . | AIPW . | Asymptotic . | Bootstrap . | Permutation . | Worst-case . | Worst-case . |
---|---|---|---|---|---|---|---|---|---|
Type . | Age . | mean . | mean . | estimate . | p-value . | p-value . | p-value . | max. p . | de Haan p . |
Misdemeanor | 30 | 0.5231 | 0.0517 | −0.5300 | |$\mathbf { 0.0192 }$| | |$\mathbf { 0.0059 }$| | |$\mathbf { 0.0072 }$| | |$\mathbf { 0.0306 }$| | |$\mathbf { 0.0800 }$| |
Misdemeanor | 40 | 0.6825 | 0.0877 | −0.6491 | |$\mathbf { 0.0085 }$| | |$\mathbf { 0.0039 }$| | |$\mathbf { 0.0032 }$| | |$\mathbf { 0.0204 }$| | |$\mathbf { 0.0208 }$| |
Felony | 30 | 0.2846 | 0.1897 | −0.0561 | 0.3174 | 0.3217 | 0.3488 | 0.4809 | 0.7310 |
Felony | 40 | 0.4762 | 0.1930 | −0.2052 | 0.1327 | 0.1556 | 0.1416 | 0.2752 | 0.4824 |
Note: This table reports Holm stepdown p-values for multiple hypothesis tests of treatment effects on cumulative misdemeanor and felony convictions for violent crime at ages 30 and 40, using the pooled sample of participants. The inferences are based on the studentized AIPW test statistic. All the above four variables, which represent cumulative crime outcomes at different ages, are treated as a block for multiple testing.
. | . | Untreated . | Treated . | AIPW . | Asymptotic . | Bootstrap . | Permutation . | Worst-case . | Worst-case . |
---|---|---|---|---|---|---|---|---|---|
Type . | Age . | mean . | mean . | estimate . | p-value . | p-value . | p-value . | max. p . | de Haan p . |
Misdemeanor | 30 | 0.5231 | 0.0517 | −0.5300 | |$\mathbf { 0.0192 }$| | |$\mathbf { 0.0059 }$| | |$\mathbf { 0.0072 }$| | |$\mathbf { 0.0306 }$| | |$\mathbf { 0.0800 }$| |
Misdemeanor | 40 | 0.6825 | 0.0877 | −0.6491 | |$\mathbf { 0.0085 }$| | |$\mathbf { 0.0039 }$| | |$\mathbf { 0.0032 }$| | |$\mathbf { 0.0204 }$| | |$\mathbf { 0.0208 }$| |
Felony | 30 | 0.2846 | 0.1897 | −0.0561 | 0.3174 | 0.3217 | 0.3488 | 0.4809 | 0.7310 |
Felony | 40 | 0.4762 | 0.1930 | −0.2052 | 0.1327 | 0.1556 | 0.1416 | 0.2752 | 0.4824 |
. | . | Untreated . | Treated . | AIPW . | Asymptotic . | Bootstrap . | Permutation . | Worst-case . | Worst-case . |
---|---|---|---|---|---|---|---|---|---|
Type . | Age . | mean . | mean . | estimate . | p-value . | p-value . | p-value . | max. p . | de Haan p . |
Misdemeanor | 30 | 0.5231 | 0.0517 | −0.5300 | |$\mathbf { 0.0192 }$| | |$\mathbf { 0.0059 }$| | |$\mathbf { 0.0072 }$| | |$\mathbf { 0.0306 }$| | |$\mathbf { 0.0800 }$| |
Misdemeanor | 40 | 0.6825 | 0.0877 | −0.6491 | |$\mathbf { 0.0085 }$| | |$\mathbf { 0.0039 }$| | |$\mathbf { 0.0032 }$| | |$\mathbf { 0.0204 }$| | |$\mathbf { 0.0208 }$| |
Felony | 30 | 0.2846 | 0.1897 | −0.0561 | 0.3174 | 0.3217 | 0.3488 | 0.4809 | 0.7310 |
Felony | 40 | 0.4762 | 0.1930 | −0.2052 | 0.1327 | 0.1556 | 0.1416 | 0.2752 | 0.4824 |
Note: This table reports Holm stepdown p-values for multiple hypothesis tests of treatment effects on cumulative misdemeanor and felony convictions for violent crime at ages 30 and 40, using the pooled sample of participants. The inferences are based on the studentized AIPW test statistic. All the above four variables, which represent cumulative crime outcomes at different ages, are treated as a block for multiple testing.
The choice of inferential method becomes more important in analysing treatment effects on cumulative convictions for felonies. At age 30, there are no statistically significant treatment effects. At age 40, as shown in Table 9, the magnitude of the treatment effect is higher at about −0.21, which represents more than a four-tenths reduction in the control mean. However, using simple difference-in-means estimates and conventional p-values can be misleading. Using conventional p-values, the effect at age 40 appears to be statistically significant at the 10% level, as shown in Table 8. However, the design-based worst-case p-values, especially those associated with the AIPW estimate, are much higher. The worst-case de Haan p-values for the studentized DIM and AIPW estimates are about 0.136 and 0.241, respectively.
The four variables at ages 30 and 40 considered in Tables 8 and 9 are conceptually related, since they are cumulative crime outcomes measured at different ages. To account for this, we treat these outcomes as a single block of variables and conduct multiple hypothesis testing using the more conservative Holm stepdown procedure, producing results in Table 10. After multiple testing, the effects on cumulative convictions for violent misdemeanors remain statistically significant at the 10% level at both ages 30 and 40, whereas the the effects on violent felonies are insignificant at both ages. These analyses show that use of small-sample inference and the method used to account for compromised randomization matter in analysing the data. Failure to account for either can give a very positive spin to the Perry programme. Accounting for them qualifies such conclusions. We have not, however, established the superiority of our approach. We have established that a very cautious design-based approach produces conservative inference, which by itself is not surprising. Our reanalysis of Heckman et al. (2020) is very conservative. Nonetheless, a few conclusions survive. We test Fisher’s sharp null hypothesis |$\mathcal {H}_\mathcal {F}$| of no treatment effect for each participant. It may in fact be the case that there are treatment effects for many participants and yet we do not reject the sharp null hypothesis because of our worst-case approach.
6. CONCLUSION
In this paper, we develop and apply a design-based finite-sample inferential method for analysing social experiments with compromised randomization. Compromises come in many forms. They include incompletely documented rerandomization procedures used to improve baseline covariate balance between treatment and control groups. They also include reassignment of treatment status due to administrative constraints.
We build a behavioural model of satisficing experimenters who seek balance in baseline covariates across treatments and controls and who provide readers of their reports qualitative, and sometimes conflicting, summaries of the actual experimental protocols used. We model the randomization protocol as only partially known to the user of experimental data. The empirical researcher recognizes and tries to account for the guiding principles experimenters used in the reassignment of treatment status for balancing baseline covariates while operating under administrative constraints. We show how to partially identify model parameters and construct worst-case (least favourable) randomization tests over a set of possibilities for the actual treatment assignment mechanism.
Our analysis of the Perry programme serves as a proof-of-concept of the usefulness of our worst-case finite-sample testing approaches, which are applicable to other compromised experiments, such as those discussed by Bruhn and McKenzie (2009). Our approach is more portable than that of Heckman et al. (2020), which utilizes very specific features of the Perry randomization protocol. Application of our procedures result in conservative finite-sample inferences.
ACKNOWLEDGEMENTS
This paper was delivered as the 2019 Sargan Lecture at the Royal Economic Society Annual Conference at the University of Warwick, England. It has been subject to the usual refereeing standards of this journal. We thank the editor and anonymous referees for useful comments. We also thank Juan Pantano and Azeem Shaikh for comments on early drafts of this paper. We are grateful to the HighScope Educational Research Foundation for access to study data and source materials. This research was supported in part by: the Buffett Early Childhood Fund; NIH Grants R01AG042390, R01AG05334301, and R37HD065072. The views expressed in this paper are solely those of the authors and do not necessarily represent those of the funders or the official views of the National Institutes of Health.
Footnotes
See Schweinhart et al. (1985; 1993; 2005), Heckman et al. (2010a), and Appendix A for more background.
See Obama (2013).
These percentages are calculated by weighting each survey respondent by the number of experiments in which the respondent had participated.
See, e.g., Morgan and Rubin (2012; 2015) and Li et al. (2018). Morgan and Rubin (2012) state that they “only advocate rerandomization if the decision to rerandomize or not is based on a pre-specified criterion.” Their inferential methods require knowledge of such pre-specified criteria. Although rerandomization methods have the property that they reduce variance of the null distribution asymptotically in certain settings (Morgan and Rubin, 2012; 2015; Li et al., 2018), this property is not guaranteed in the finite-sample setting we consider.
According to Schweinhart et al. (2005), “4 children did not complete the preschool programme because they moved away and 1 child died [in a fire accident] shortly after the study began.” We are missing the following data (on some of these children) that are necessary for inference procedures. We do not know the mother’s working status at baseline of a subject in wave 0 (who has a sibling in wave 1) among the five children who dropped out of the original sample of 128 for extraneous reasons. We also do not know the gender of a subject in wave 1. (We use the Perry convention that wave 0 is the first wave and wave 4 is the last one.) The baseline information on these subjects is important in our formal model of the randomization protocol. We do not make assumptions regarding the mother’s working status at baseline of the subject in wave 0 and the gender of the other subject in wave 1. We run our testing procedures for each of the possible values of the variables. While we use the data on the five dropped children in our simulations of the randomization protocol for our worst-case tests, we treat the five participants as ignorable in our estimation of the treatment effects. Thus, our effective sample for estimation and inference is the core sample of 123 children.
Those in the treatment group of the first entry cohort (wave 0) were provided with the intervention for only one year, starting at age 4, and thus constitute an exception. Our estimates of treatment effects pool all five cohorts, even though the lower programme intensity in the first cohort might in principle attenuate the magnitudes of the effects downward.
See Appendix B. According to Schweinhart et al. (1993), “[The staff] exchanged several similarly ranked pair members so the two groups would be matched on [the baseline variables].” Even though the phrase “similarly ranked pair members” might suggest consecutively ranked members, this is not necessarily the case. In Appendix B, we use Perry data from wave 4 to demonstrate that the exchanges were not necessarily between consecutively ranked pairs.
This is also manifested in the observed data. For example, as explained later in Section 3.2, the number of singletons in wave 2 is 22, with 12 in the control group and 10 in the treatment group. If there were exchanges between the initial experimental groups instead of one-way transfers to the control group, there would have been 11 singletons in both the control and treatment groups instead of 12 and 10, respectively.
See Simon (1955), an early paper in behavioural economics that analyses satisficing behaviour.
Each of the cohorts corresponds to one of the five waves (labelled 0 to 4) of study participants recruited from the autumn seasons of 1962 to 1965. Waves 0 and 1 were randomized in the autumn of 1962, while the waves 2, 3, and 4 were randomized in the autumn of 1963, 1964, and 1965, respectively. We follow the labelling convention for the cohorts by the Perry analysts who designate the first cohort as “0.”
Note that the other participants in cohort c who are not singletons have older siblings already enrolled in the Perry experiment in a previous wave. The nonsingletons are not randomized but, rather, assigned to the same treatment status as their elder siblings already enrolled in the study.
Note that ⌈ · ⌉ ≡ ceil( · ) is the ceiling function and ⌊ · ⌋ ≡ floor( · ) is the floor function. They assign the least upper integer bound and greatest lower integer bound to the argument in the function, respectively.
An exchange means a swap between two participants belonging to different undesignated groups. Since the Perry experiment did not use a matched pair design, an exchange or swap is not restricted to occur between participants with consecutive IQ ranks. Exchanges between participants with nonconsecutive IQ ranks can occur. See Appendix B.
The Hotelling’s multivariate two-sample t-squared statistic |$\tau ^2_c$| maps a partition |$(\mathcal {A}, \mathcal {B})$| of |$\mathcal {S}_c$| (such that |$|\mathcal {A}| = \lceil |\mathcal {S}_c|/2 \rceil$| and |$\mathcal {B} = \mathcal {S}_c \setminus \mathcal {A}$|) to |$\mathbb {R}_{\ge 0}$| and is given by |$\tau ^2_c(\mathcal {A}, \mathcal {B}) = \left(\bar{Z}_{\mathcal {A}} - \bar{Z}_{\mathcal {B}} \right)^{\prime }(|\mathcal {A}|^{-1} \hat{\Sigma }_{\mathcal {A}} + |\mathcal {B}|^{-1} \hat{\Sigma }_{\mathcal {B}})^{-1}\left(\bar{Z}_{\mathcal {A}} - \bar{Z}_{\mathcal {B}} \right),$| where |$\bar{Z}_{\mathcal {A}} = |\mathcal {A}|^{-1}\sum _{i \in \mathcal {A}}Z_i$|, with Zi as the pre-programme covariate vector containing the i-th participant’s IQ, SES index, gender, and mother’s working status, |$\, \bar{Z}_{\mathcal {B}} = |\mathcal {B}|^{-1}\sum _{i \in \mathcal {B}}Z_i$|, and |$\hat{\Sigma }_{\mathcal {A}} = (|\mathcal {A}| - 1)^{-1}\sum _{i \in \mathcal {A}} (Z_i - \bar{Z}_{\mathcal {A}})(Z_i - \bar{Z}_{\mathcal {A}})^{\prime }$|, while |$\hat{\Sigma }_{\mathcal {B}} = (|\mathcal {B}| - 1)^{-1}\sum _{i \in \mathcal {B}} (Z_i - \bar{Z}_{\mathcal {B}})(Z_i - \bar{Z}_{\mathcal {B}})^{\prime }$|. We use this metric for dimensionality reduction and computational feasibility. Chung and Romano (2016) show, without assuming normality, that the permutation distribution of |$\tau ^2_c$| is asymptotically chi-squared. If adequate computational power were available, we could also incorporate into our model the raw mean differences in the four variables, their studentized versions, or other measures of mean differences between two groups. Of course, it is possible that the Perry staff were just looking at mean differences and did not use any formal metric.
For cohort 0, the proportion of possible group formations with a lower Hotelling statistic is at least 29.24%. The corresponding numbers for cohorts 1, 2, 3, and 4 are 64.51%, 14.79%, 9.76%, and 75.56%, respectively. Similarly, the raw mean differences in baseline covariates for the initial groups also do not correspond to their minimum possible values.
The satisficing threshold δc is the maximum level of covariate imbalance that satisficed Perry staff. The threshold δc is unknown to the analyst but can be partially identified, as explained later. We assume a uniform probability over |$\mathbb {U}_c$| for the choice of the partition |$(\mathcal {A}^*_c, \mathcal {B}^*_c)$| for the purpose of keeping the model simple and computationally feasible. In general, we might suspect the following: given two partitions of |$\mathcal {S}_c$| with the same level of Hotelling’s statistic, there might have been a higher probability mass on the partition closer to the initial grouping based on odd and even IQ ranks. In addition, the staff might have also preferred not to make additional exchanges if they expected relatively insignificant reductions in covariate imbalance. In other words, the probability that the Perry staff chose a particular partition |$(\mathcal {A}^*_c, \mathcal {B}^*_c)$| could have depended on their preferences over substitution between two things: similarity of |$(\mathcal {A}^*_c, \mathcal {B}^*_c)$| to the initial IQ rank-based grouping; and the level of covariate imbalance (as measured by Hotelling’s statistic) resulting from the partition |$(\mathcal {A}^*_c, \mathcal {B}^*_c)$|. However, there is no unique way to formalize this notion. Such a general model may not even be computationally feasible.
The Perry teachers conducted special home visits for working mothers at times other than weekday afternoons, when they visited the homes of nonworking mothers. Because of logistical and financial constraints, the teachers were able to visit the homes of only a limited number of working mothers at times other than weekday afternoons. Thus, the children of working mothers in the preliminary treatment group for whom these special arrangements could not be made were transferred to the control group.
Thus, ηc can be thought of as slots available for special visits to the homes of working mothers. Equivalently, it is the number of children of working mothers who would remain in the final treatment group if all of them were placed in the preliminary treatment group.
Since cohorts 0 and 1 had a common set of teachers, they share the number of slots available for the special home visits. Thus, we pool these two cohorts while defining m0,1 and η0,1. However, cohorts 2 to 5 have separate parameters for the slots available for special home visits.
It is possible that the Perry staff engaged in another round of satisficing at this step. In principle, this could be incorporated into our model but would increase its dimensionality. Since the published accounts do not mention another round of balancing, we do not add this feature to our model to keep it computationally feasible.
We are implicitly assuming that all working mothers would be able to send their children to preschool and participate in weekly home visits if special arrangements could be made for them. A model allowing for heterogeneity in availability of working mothers (for special arrangements) does not appear to be computationally feasible.
In other words, Vi,c = 0 for the participants who were either initially placed in the control group or placed in the initial treatment group but have nonworking mothers.
Note that ωm,d ≡ ωm, d, c for all (m, d) ∈ {0, 1}2 but we suppress the subscript c for simplicity.
Specifically, |$\eta _{0,1} \in \lbrace \eta \in \lbrace 0,\dots ,\sum _{i \in \mathcal {S}_0 \cup \mathcal {S}_1} M_i\rbrace : \min (\eta ,\chi _0 + \chi _1 + \omega _{1,1}^{0,1})=\omega _{1,1}^{0,1}, \chi _0 \in \mathcal {C}_0, \chi _1 \in \mathcal {C}_1\rbrace$|, where |$\omega _{1,1}^{0,1} = \sum _{i \in \mathcal {S}_0 \cup \mathcal {S}_1} M_i\, D_i$| and |$\mathcal {C}_c = \lbrace \lceil |\mathcal {S}_c|/2 \rceil - \omega _{*,1,c},\max \lbrace 0,\lfloor |\mathcal {S}_c|/2 \rfloor - \omega _{*,1,c}\rbrace \rbrace$| for c ∈ {0, 1}. In our application, η0,1 ∈ {3}. Since we do not make assumptions on the missing mother’s working status at baseline for a subject in wave 0 and the missing gender of another subject in wave 1 (among the five who dropped out of the initial sample of 128 for extraneous reasons), our partial identification of δ0 and δ1 depends on the values in the partially identified set for the missing variables. Since we do not make assumptions on the two missing binary variables, this is a strength of our analysis, despite quadrupling the computational cost. We also use known information that there was at least one transfer in wave 0 (Weikart et al., 1964) to narrow the partially identified set for that cohort.
In a set of 53 studies of randomized controlled trials published in some leading economics journals, Young (2019) also finds that experimental results obtained using asymptotic theory are misleading, relative to results based on randomization tests.
However, unless the permutation method reflects the method used for random assignment of the treatment, permutation tests do not in general allow us to test hypotheses about counterfactual outcomes of the individual Perry participants.
In practice, their approach relies on large-sample methods in using regression analysis to condition on covariates.
This is attributed to Neyman (1923).
While this formulation states that each individual treatment effect τi is zero, the analyst may fix each τi at a desired value for hypothesis testing. Such a hypothesis is often called sharp because it specifies one set of counterfactual outcomes for the participants.
Note that we observe either |$Y^1_i$| or |$Y^0_i$| for each participant |$i \in \mathcal {P}$|. Thus, under the null model (4.3), the other counterfactual outcome can be imputed according to the fact that |$Y^1_i = Y^0_i$|. In general, if τi is hypothesized to be equal to a number |$\tau ^\circ _i$|, the counterfactual outcomes |$(Y^1_i, Y^0_i)$| under the null model are equal to |$(Y_i + \tau ^\circ _i, Y_i)$| if Di = 0 and is equal to |$(Y_i, Y_i - \tau ^\circ _i)$| if Di = 1 for all |$i \in \mathcal {P}$|.
These tests are not strictly exact because our model simplifies the actual randomization procedure and can at best be considered a useful approximation of the true model of the protocol.
Since our randomization tests follow the standard Fisherian framework, they are conditional tests that exploit random variation in the treatment status but fix the other observed data. See Lehmann (1993).
Note that Ξ is a sharp identified set because we follow the Fisherian framework where the observed outcomes and baseline covariates in our sample are treated as fixed.
Specifically, |$\Omega _{Q,V_{\gamma ^*}} = \lbrace 0,1\rbrace ^5 \times \left(\times _{c \in \lbrace (0,1),2,3,4\rbrace } \times _{m = 1}^{M_c} \lbrace v \in \lbrace 0,1\rbrace ^m:||v||_1 = \min (\eta ^*_c,m)\rbrace \right)$|, where |$M_{0,1} = \sum _{i \in \mathcal {S}_0 \bigcup \mathcal {S}_1} M_i$| and |$M_c = \sum _{i \in \mathcal {S}_c}M_i$| for all c ∈ {2, 3, 4}.
We use 500,000 Monte Carlo draws from |$\mathbb {U}(\infty ,\dots ,\infty ) = \times _{c = 0}^4 \mathbb {U}_c(\infty )$|, a very large set, to approximate |$x(\gamma ^*)$|.
We use 400 Monte Carlo draws from |$\Lambda ^\mathcal {X}_{\gamma ^*}$| to approximate |$\mathbb {P}_{\Lambda ^\mathcal {X}_{\gamma ^*}}\lbrace T(\tilde{D}^\mathcal {X}_{\gamma ^*}) \ge T(D)\rbrace$|. This is effectively importance sampling. In addition, we use 2,600 Monte Carlo draws from |$\Lambda ^\mathcal {Y}_{\gamma ^\infty }$|, where |$\gamma ^\infty = (\infty ,\dots ,\infty ,\eta ^*_{0,1},\eta ^*_2,\eta ^*_3,\eta ^*_4)$|, and use rejection sampling to draw random samples from |$\Lambda ^\mathcal {Y}_{\gamma ^*}$| for approximating |$\mathbb {P}_{\Lambda ^\mathcal {Y}_{\gamma ^*}}\lbrace T(\tilde{D}^\mathcal {Y}_{\gamma ^*}) \ge T(D)\rbrace$|. It takes much longer to compute these tail probabilities than to compute |$x(\gamma ^*)$|. Limited computational power restricted the number of Monte Carlo draws.
Since the randomly sampled treatment status vectors are i.i.d. and uniformly distributed on corresponding sample spaces, for a given |$\gamma ^*$| the associated p-value stochastic approximations can be used to construct valid tests. For details, see section 4 of Romano (1989), section 3.2 of Romano and Wolf (2005), or section 15.2.1 of Lehmann and Romano (2005). Although this holds when |$\gamma ^*$| is taken as given, our main object of interest is the worst-case p-value in equation (4.4). Since it is infeasible to compute a p-value for each |$\gamma ^* \in \Xi$|, we also resort to stochastic approximations of the supremum in equation (4.4). In Section 4.3.2, we discuss how we account for uncertainty in the stochastic approximation of the worst-case p-value.
Specifically, |$\Xi = \times _{c = 0}^4 [\delta ^\dagger _c, \infty ) \times \vartheta ^\eta _{0,1}\times \times _{c=2}^4\vartheta ^\eta _c$|, where |$\delta ^\dagger _c$| is the lower bound for the satisficing threshold δc, and |$\vartheta ^\eta _c$| is the finite partially identified set for the capacity constraint ηc.
In fact, we can further simplify the worst-case tail probability. Let |$\Gamma _c = \lbrace \tau ^2_c(\mathcal {A}, \mathcal {B}): (\mathcal {A}, \mathcal {B}) \in \mathbb {U}_c(\infty )\rbrace$|, which is a finite set, for all c ∈ {0, …, 4}, and let |$\Xi ^\Gamma = \lbrace \tilde{\gamma } \equiv (\tilde{\delta }_0, \dots , \tilde{\delta }_4,\tilde{\eta }_{0,1},\tilde{\eta }_2,\tilde{\eta }_3,\tilde{\eta }_4) \in \Xi ^\circ : \tilde{\delta }_c \in \Gamma _c \, \forall c\rbrace$|, which is also a finite set. Then, we have that |$p_{w}(D) = \max _{\gamma ^* \in \Xi ^\Gamma } \mathbb {P}_{\Lambda _{\gamma ^*}}\lbrace T(\tilde{D}_{\gamma ^*}) \ge T(D)\rbrace$|. However, even though the set |$\Xi ^\Gamma$| is finite, its size is too large in practice, making stochastic approximations still necessary.
Note that in our application, η0,1, η2, and η3 are point-identified while η4 is partially identified to be in the set {0, …, 4}. Thus, (η0,1, η2, η3, η4) has 5 possible values. In addition, since we do not know the mother’s working status at baseline for a subject in wave 0 and the gender of a subject in wave 1 (both of whom are among the 5 participants who dropped out of the study for extraneous reasons), there are 4 possible configurations of the two missing binary variables. Thus, in total there are L = 5 × 4 = 20 hyper-rectangles that make up Ξ°.
To ensure that we are covering Ξ° and its edges well when sampling the random points, we use a normalization. We use the distribution |$F_{\tau ^2_c}$| of Hotelling statistics on |$\mathbb {U}_c(\infty )$| to normalize δc so that |$F_{\tau ^2_c}(\delta _c) \in [F_{\tau ^2_c}(\delta ^\dagger _c),\, 1]$|, a compact set, for all c ∈ {0, …, 4}. Thus, γ and |$\Xi ^\circ _l$| are monotonically transformed accordingly in practice. We can do this because |$\mathbb {U}_c(\infty )$| is a finite set and |$\mathbb {U}_c(\delta _c) \equiv \lbrace (\mathcal {A}, \mathcal {B}): \mathcal {A} \subset \mathcal {S}_c,\, \, \mathcal {B} = \mathcal {S}_c \setminus \mathcal {A},\, \, |\mathcal {A}| = \lceil |\mathcal {S}_c|/2 \rceil ,\, \, \tau ^2_c(\mathcal {A}, \mathcal {B}) \le \delta _c\rbrace$| is equivalent to the set |$\lbrace (\mathcal {A}, \mathcal {B}): \mathcal {A} \subset \mathcal {S}_c,\, \, \mathcal {B} = \mathcal {S}_c \setminus \mathcal {A},\, \, |\mathcal {A}| = \lceil |\mathcal {S}_c|/2 \rceil ,\, \, F_{\tau ^2_c}(\tau ^2_c(\mathcal {A}, \mathcal {B})) \le F_{\tau ^2_c}(\delta _c)\rbrace$|.
Specifically, |$K^l_{dH} = \left[0.9^{\upsilon ^l_{dH}} - 1\right]^{-1}$|, where |$\upsilon ^l_{dH} = -\ln [(p^l_{(3)} - p^l_{(\sqrt{S})})/(p^l_{(2)} - p^l_{(3)})]/\ln (\sqrt{S})$|, based on de Haan’s (1981) result. In the context of estimating the minimum of a function over a compact set using order statistics, de Haan (1981) proposes construction of a confidence band for the minimum. We apply this result without loss of generality in our context (estimation of the maximum rather than the minimum).
It is only partially observed in their model.
Our model is limited in the sense that it does not allow for heterogeneity among working mothers in their availability for special arrangements. We assume that the Perry administrators choose with equal probability which working mothers get special arrangements.
This is Step 4′ in their paper. Accordingly, their tests involve “permuting treatment status among those families with the same observed and unobserved characteristics (defined by the characteristics of the eldest child in the case of families with multiple children).” In practice, they discretize SES into a binary indicator of above-median SES.
In the Perry context, it consists of the four pre-programme covariates used during the randomization phase, i.e., Stanford–Binet IQ, index of SES, gender, and mother’s working status.
Both OLS and DIM estimators can be studentized using their cluster-robust asymptotic standard errors, allowing for correlation between error terms of the participant-siblings in the Perry experiment.
We estimate the propensity scores using a logit specification and the penalized maximum likelihood method of Greenland and Mansournia (2015), which circumvents the issue of separation in small samples.
The AIPW estimator also assumes conditional independence of the counterfactual outcomes and the treatment status, i.e., |$(Y^1_i,Y^0_i) \, {\perp \!\!\!\perp }\, D_i\, |\, Z_i$|, which is valid because of the random assignment of the treatment status conditional on pre-programme variables. Note that the propensity score model used in the AIPW estimator is a direct consequence of the law of conditional probability: |$\mathrm{Pr}(R_i = 1, D_i = d\, |\, Z_i) = \mathrm{Pr}(R_i = 1\, |\, Z_i, D_i = d)\, \mathrm{Pr}(D_i = d \, |\, Z_i)$| for d ∈ {0, 1}. In the econometrics literature, the AIPW estimator is better known as a type of efficient influence function (EIF) estimator (Cattaneo, 2010). The estimator given by equation (5.3) can be studentized using the empirical sandwich standard error (Lunceford and Davidian, 2004). For studentization, we use a cluster-robust version of this asymptotic standard error, given by the following formula: |$\frac{1}{N_\mathcal {P}} [\sum _{j \in J} (\sum _{i \in \mathcal {F}_j} \hat{\pi }^1_i - \hat{\pi }^0_i - \hat{\Pi }_\mathrm{AIPW})^2]^{1/2}[|J|/(|J| - 1)]^{1/2}$|, where |$\mathcal {F}_j$| represents a cluster of participant-siblings in the set J of clusters. Our studentized test statistics are based on the asymptotic standard error mainly for computational ease, but studentization based on the bootstrap standard error would be superior in theory.
See Robins et al. (1994), Lunceford and Davidian (2004), and Kang and Schafer (2007). The double robustness property (consistency despite certain forms of misspecification) is easier to understand by rewriting equation (5.4) as follows: |$\hat{\pi }^d_i = Y^d_i + (\hat{\lambda }^d_{i}\, \hat{\phi }^d_i)^{-1}(\mathbb {I}\lbrace R_i = 1,\, D_i = d\rbrace - \hat{\lambda }^d_{i}\, \hat{\phi }^d_i)(Y^d_i - \hat{Y}^d_i)$| for d ∈ {0, 1}. If the propensity score models or the the counterfactual outcome model are correctly specified, sample average of the whole second term (in the rewritten expression for |$\hat{\pi }^d_i$|) converges in probability to zero. Thus, the AIPW estimator remains consistent for the average treatment effect even if either the propensity score models or the counterfactual outcome models are misspecified.
However, we present estimates from all of these procedures in the online appendices as a form of sensitivity analysis. The AIPW estimator can become unstable if both the propensity score models and the counterfactual outcome models are misspecified (Kang and Schafer, 2007). Thus, we do not solely rely on the AIPW estimator but use it in conjunction with the DIM and OLS estimators.
Since AIPW clearly has an asymptotic justification, it is not strictly a small-sample procedure from an estimation perspective. Nevertheless, we can conduct inference using its finite-sample worst-case randomization null distribution using our design-based methods.
In theory, we could bound the LATE estimate by considering all possible values for each observation’s initial treatment status, and then we could use the LATE bound as a test statistic for inference. However, this is very demanding computationally and thus not feasible in practice.
In these online appendices, for each outcome we include the conventional p-values (i.e., asymptotic, bootstrap, and permutation p-values) and design-based p-values (i.e., worst-case maximum and worst-case de Haan p-values) associated with each of the DIM, OLS, and AIPW estimators of treatment effects. We also include permutation and worst-case p-values based on both nonstudentized and studentized test statistics. In addition, we include stepdown versions of the worst-case p-values.
The corresponding worst-case de Haan (single) p-values are 0.427, 0.343, 0.348, 0.236, and 0.459, respectively.
Those in the treatment group of the first entry cohort (wave 0) were provided with the intervention for only one year, starting at age 4, and thus were an exception. In our estimation of treatment effects, we pool all five cohorts, even though the lower programme intensity in the first cohort might in principle attenuate the magnitudes of the effects downward.
The initial eligibility criteria specified that the IQs, as measured by the Stanford–Binet IQ test according to 1960's norming, be between 70 and 85, which was one standard deviation below the population average. However, in practice, the IQ range was 61 to 88. Only about two-thirds of the participants had IQs in the range specified initially.
REFERENCES
SUPPORTING INFORMATION
Additional Supporting Information may be found in the online version of this article at the publisher’s website:
Online Appendix
Replication Package
Notes
Managing editor Jaap Abbring handled this manuscript.
APPENDIX A
BACKGROUND AND ELIGIBILITY CRITERIA OF PERRY PROGRAMME
The Perry Preschool Project was carried out in five waves between autumn 1962 and autumn 1965 near a public school—the Perry Elementary School in Ypsilanti, a small city near Detroit in Michigan. Data collection took place at the baseline age of 3 years and through surveys that were administered annually until age 15. The participants were additionally followed up around ages 19, 27, 40, and 55. Various measures were obtained over the years, including information on education, crime, and other economic outcomes.
Intensity of the programme was low relative to several later early education programmes.60 Starting at age 3, treatment in the following two years included preschool for 2.5 hours per day on weekdays during the academic year. Another major component of the programme consisted of 1.5-hour weekly home visits by the Perry teachers to promote parental engagement with the child.61 The Perry curriculum fostered active child-centered learning through intensive interactions between the children and programme teachers (Weikart et al., 1978; Schweinhart et al., 1993).
Door-to-door canvassing and referrals were used to survey and identify disadvantaged families among those of the Perry Elementary School students. To be eligible for participation in the Perry Preschool Project, the children had to: (i) be African American; (ii) have low Stanford–Binet IQ scores at baseline;62 and (iii) be socioeconomically disadvantaged according to an index of socioeconomic status based on employment and education levels of the parents as well as the number of persons per room at home. The Perry families were more disadvantaged relative to a majority of African American families at that time in the United States. However, the Perry families were, by and large, representative of a substantial fraction of the underprivileged African American population (Heckman et al., 2010a).
Even when compared with the children living in the area surrounding the Perry Elementary School, the Perry participants were especially disadvantaged (Heckman et al., 2010a). Since the parents of all children eligible for the programme participated in the study (Weikart et al., 1978), issues of noncompliance are not a concern. As there were no substitutes to the Perry programme, such as Head Start, available when the Perry experiment was implemented, control group contamination is also not a problem in our experimental setting.
APPENDIX B
EXCHANGES WERE NOT BASED ON CONSECUTIVE IQ SCORES
We use Perry data from wave 4 as an example to conclude that the exchanges were not necessarily between consecutively ranked pairs. In wave 4, there were 19 participants, excluding any younger siblings in the programme. The IQs of these 19 people were: 61, 71, 75, 76, 76, 76, 78, 78, 79, 79, 80, 80, 81, 82, 83, 83, 83, 85, 88, involving many ties. Regardless of which method was used to break the ties, from a pure ranking procedure the staff would have obtained two initial groups: one with IQs {61, 75, 76, 78, 79, 80, 81, 83, 83, 88} and another group with IQs {71, 76, 76, 78, 79, 80, 82, 83, 85}. The final observed treatment group has IQs in the set: {61, 75, 76, 78, 80, 81, 83, 83, 83, 88}. Note that the person with IQ 79 is replaced by a person with IQ 83. The final observed control group has IQs in the set: {71, 76, 76, 78, 79, 79, 80, 82, 85}. Note that the person with IQ 83 is replaced by a person with IQ 79. These are the same as the initial treatment and control groups, since there were no transfers in the fifth step of the protocol, as explained in Example 3 of the paper. Thus, we can conclude that an exchange happened between participants with IQs 79 and 83, who do not comprise a consecutively ranked pair. Thus, after the IQ rank ordering, the exchanges between the two initial groups were not always between consecutively ranked IQ pairs. Thus, the Perry staff did not strictly implement a matched pair design.