We look at conventional methods for removing endogeneity bias in regression models, including the linear model and the probit model. It is known that the usual Heckman two-step procedure should not be used in the probit model: from a theoretical perspective, it is unsatisfactory, and likelihood methods are superior. However, serious numerical problems occur when standard software packages try to maximize the biprobit likelihood function, even if the number of covariates is small. We draw conclusions for statistical practice. Finally, we prove the conditions under which parameters in the model are identifiable. The conditions for identification are delicate; we believe these results are new.
Suppose a linear regression model describes responses to treatment and to covariates. If subjects self-select into treatment, the process being dependent on the error term in the model, endogeneity bias is likely. Similarly, we may have a linear model that is to be estimated on sample data; if subjects self-select into the sample, endogeneity becomes an issue.
Heckman (1978, 1979) suggested a simple and ingenious two-step method for taking care of endogeneity, which works under the conditions described in those papers. This method is widely used. Some researchers have applied the method to probit response models. However, the extension is unsatisfactory. The nonlinearity in the probit model is an essential difficulty for the two-step correction, which will often make bias worse. It is well-known that likelihood techniques are to be preferred—although, as we show here, the numerics are delicate.
In the balance of this article, we define models for (1) self-selection into treatment or control and (2) self-selection into the sample, with simulation results to delineate the statistical issues. In the simulations, the models are correct. Thus, anomalies in the behavior of estimators are not to be explained by specification error. Numerical issues are explored. We explain the motivation for the two-step estimator and draw conclusions for statistical practice. We derive the conditions under which parameters in the models are identifiable; we believe these results are new. The literature on models for self-selection is huge, and so is the literature on probits; we conclude with a brief review of a few salient papers.
To define the models and estimation procedures, consider n subjects, indexed by . Subjects are assumed to be independent and identically distributed. For each subject, there are two manifest variables Xi, Zi and two latent variables Ui, Vi. Assume that (Ui, Vi) are bivariate normal, with mean 0, variance 1, and correlation ρ. Assume further that (Xi, Zi) is independent of (Ui, Vi), that is, the manifest variables are exogenous. For ease of exposition, we take (Xi, Zi) as bivariate normal, although that is not essential. Until further notice, we set the means to 0, the variances to 1, the correlation between Xi and Zi to 0.40, and sample size n to 1000.
A Probit Response Model with an Endogenous Regressor
Endogeneity bias is likely in (2). Indeed, Ci is endogenous due to the correlation ρ between the latent variables Ui and Vi. A two-step correction for endogeneity is sometimes used (although it should not be):
Step 1. Estimate the probit model (1) by likelihood techniques.
Step 2. To estimate (2), fit the expanded probit model
The operating characteristics of the two-step correction were determined in a simulation study which draws 500 independent samples of size n = 1000. Each sample was constructed as described above. We set a = 0.50, b = 1, and ρ = 0.60. These choices create an environment favorable to correction.
Endogeneity is moderately strong: ρ = 0.60. So there should be some advantage to removing endogeneity bias. The dummy variable Ci is 1 with probability about 0.64, so it has appreciable variance. Furthermore, half the variance on the right hand side of (1) can be explained: var(bXi) = var(Ui). The correlation between the regressors is only 0.40: making that correlation higher exposes the correction to well-known instabilities.
The sample is large: n = 1000. Regressors are exogenous by construction. Subjects are independent and identically distributed. Somewhat arbitrarily, we set the true value of c in the response equation (2) to −1, whereas d = 0.75 and e = 0.50. As it turned out, these choices were favorable too.
Table 1 summarizes results for three kinds of estimates:For each kind of estimate and each parameter, the table reports the mean of the estimates across the 500 repetitions. Subtracting the true value of the parameter measures the bias in the estimator. Similarly, the standard deviation (SD) across the repetitions, also shown in the table, measures the likely size of the random error.
raw (ignoring endogeneity);
the two-step correction; and
full maximum likelihood.
Notes. Correcting endogeneity bias when the response is binary probit. There are 500 repetitions. The sample size is 1000. The correlation between latents is ρ = 0.60. The parameters in the selection equation (1) are set at a = 0.50 and b = 1. The parameters in the response equation (2) are set at c = –1, d = 0.75, and e = 0.50. The response equation includes the endogenous dummy Ci defined by (1). The correlation between the exogenous regressors is 0.40. MLE computed by VGAM 0.7-6.
The “raw estimates” in Table 1 are obtained by fitting the probit model
The two-step estimates are obtained via (3–4), with and obtained by fitting (1). We focus on d and e, as the parameters in equation (2) that may be given causal interpretations. Without correction, averages about 0.72, with correction 0.83 (see Table 1). Correction doubles the bias. Without correction, averages 1.33, with correction 0.54. Correction helps a great deal, but some bias remains.
With the two-step correction, the SD of is about 0.21. Thus, random error in the estimates is appreciable, even with n = 1000. On the other hand, the standard error (SE) across the 500 repetitions is . The bias in cannot be explained in terms of random error in the simulation: increasing the number of repetitions will not make any appreciable change in the estimated biases.
Heckman (1978) also suggested the possibility of fitting the full model—equations (1) and (2)—by maximum likelihood. The full model is a “bivariate probit” or “biprobit” model. Results are shown in the last two lines of Table 1. The MLE is essentially unbiased. The MLE is better than the two-step correction, although random error remains a concern.
We turn to some variations on the setup described in Table 1. The simulations reported there generated new versions of the regressors on each repetition. Freezing the regressors makes almost no difference in the results: SDs would be smaller in the third decimal place.
The results in Table 1 depend on ρ, the correlation between the latent variables in the selection equation and the response equation. If ρ is increased from 0.60 to 0.80, say, the performance of the two-step correction is substantially degraded. Likewise, increasing the correlation between the exogenous regressors degrades the performance.
When ρ = 0.80 and the correlation between the regressors is 0.60, the bias in the two-step correction (3–4) for is about 0.15; for , about 0.20. Figure 1 plots the bias in against ρ, with the correlation between regressors set at 0.40 or 0.60, other parameters being fixed at their values for Table 1. The wiggles in the graph reflect variance in the Monte Carlo (there are “only” 500 replicates). The MLE is less sensitive to increasing correlations (data not shown).
Results are also sensitive to the distribution of the exogenous regressors. As the variance in the regressors goes down, bias goes up—in the two-step estimates and in the MLE. Furthermore, numerical issues become acute. There is some explanation: dividing the SD of X by 10, say, is equivalent to dividing b by 10 in equation (1); similarly for Z and d in equation (2). For small values of b and d, parameters are barely identifiable.
Figure 2 plots the bias in against the common SD of X and Z, which is set to values ranging from 0.1 to 1.0 (other parameters are set as in Table 1). The light line represents the MLE. Some of the “bias” in the MLE is indeed small-sample bias—when the SD is 0.1, a sample with n = 1000 is a small sample. Some of the bias, however, reflects a tendency of likelihood maximizers to quit before finding the global maximum.
The heavy line represents the two-step correction. (With an SD of 0.1, data for the two-step correction are not shown, because there are huge outliers; even the median bias is quite changeable from one set of 500 repetitions to another, but 0.2 may be a representative figure.) Curiously, the two-step correction is better than the MLE when the SD of the exogenous regressors is set to 0.2 or to 0.3. This is probably due to numerical issues in maximizing the likelihood functions.
We believe the bias in the two-step correction (Figs 1 and 2) reflects the operating characteristics of the estimator, rather than operating characteristics of the software. Beyond 1.0, the bias in the MLE seems to be negligible. Beyond 1.5, the bias in the two-step estimator for e is minimal, but d continues to be a little problematic.
As noted above, changing the scale of X is equivalent to changing b. Similarly, changing the scale of Z is equivalent to changing d (see equations (1) and (2)). Thus, in Fig. 2, we could leave the SDs at 1 and run through a series of (b, d) pairs:
The number of regressors should also be considered. With a sample size of 1000, practitioners would often use a substantial number of covariates. Increasing the number of regressors is likely to have a negative impact on performance.
A Probit Model with Endogenous Sample Selection
Fitting (6) to the observed data raises the question of endogeneity bias. Sample subjects have relatively high values of Ui; hence, high values of Vi. (This assumes ρ > 0.) Again, there is a proposed solution that involves two steps.
Step 1. Estimate the probit model (5) by likelihood techniques.
Step 2. Fit the expanded probit model
Notes. Correcting endogeneity bias in sample selection when the response is binary probit. There are 500 repetitions. The sample size is 1000. The correlation between latents is ρ = 0.60. The parameters in the selection equation (5) are set at a = 0.50 and b = 1. The parameters in the response equation (6) are set at c = −1, and d = 0.75. Response data are observed only when Ci = 1, as determined by the selection equation. This will occur for about 64% of the subjects. The correlation between the exogenous regressors is 0.40. MLE computed using Stata 9.2.
Increasing the sample size from 1000 to 5000 in the simulations barely changes the averages, but reduces the SDs by a factor of about , as might be expected. This comment applies both to Tables 1 and 2 (data not shown), but not to the MLE results in Table 2. Increasing n would have made the STATA code prohibitively slow to run.
Many applications of Heckman's method feature a continuous response variable rather than a binary variable. Here, the two-step correction is on firmer ground, and parallel simulations (data not shown) indicate that the correction removes most of the endogeneity bias when the parameters are set as in Tables 1 and 2. However, residual bias is large when the SD of the regressors is set to 0.1 and the sample size is “only” 1000; the issues resolve when n = 10, 000. The problem with n = 1000 is created by (1) large random errors in , coupled with (2) poorly conditioned design matrices. In more complicated situations, there may be additional problems.
Exploratory computations were done in several versions of Matlab, R, and Stata. In the end, to avoid confusion and chance capitalization, we redid the computations in a more unified way, with R 2.7 for the raw estimates, the two-step correction; VGAM 0.7-6 for the MLE in (1–2); and Stata 9.2 for the MLE in (5–6). Why do we focus on the behavior of R and Stata? R is widely used in the statistical community, and Stata is almost the lingua franca of quantitative social scientists.
Let b0 and d0 be the default values of b and d, namely, 1 and 0.75. As b and d decrease from the defaults, VGAM in R handled the maximization less and less well (Fig. 2). We believe VGAM had problems computing the Hessian, even for the base case in Table 1: its internally generated SEs were too small by a factor of about 2, for .
By way of counterpoint, Stata did somewhat better when we used it to redo the MLE in (1–2). However, if we multiply the default b0 and d0 by 0.3 or 0.4, bias in Stata becomes noticeable. If we multiply by 0.1 or 0.2, many runs fail to converge, and the runs that do converge produce aberrant estimates, particularly for a multiplier of 0.1. For multipliers of 0.2 to 0.4, the bias in is upwards in R but downwards in Stata. In Table 2, Stata did well. However, if we scale b0 and d0 by 0.1 or 0.2, Stata has problems. In defense of R and Stata, we can say that they produce abundant warning messages when they get into difficulties.
In multidimensional problems, even the best numerical analysis routines find spurious maxima for the likelihood function. Our models present three kinds of problems: (1) flat spots on the log likelihood surface, (2) ill-conditioned maxima, where the eigenvalues of the Hessian are radically different in size, and (3) ill-conditioned saddle points with one small positive eigenvalue and several large negative eigenvalues. The maximizers in VGAM and Stata simply give up before finding anything like the maximum of the likelihood surface. This is a major source of the biases reported above.
The model defined by (1–2) is a harder challenge for maximum likelihood than (5–6), due to the extra parameter e. Our computations suggest that most of the difficulty lies in the joint estimation of three parameters, c, e, ρ. Indeed, we can fix a, b, d at the default values for Table 1 and maximize the likelihood over the remaining three parameters c, e, ρ. VGAM and Stata still have convergence issues. The problems are the same as with six parameters. For example, we found a troublesome sample where the Hessian of the log likelihood had eigenvalues 4.7, −1253.6, −2636.9. (We parameterize the correlation between the latents by log(1 + ρ) − log(1 − ρ) rather than ρ, since that is how binom2.rho in VGAM does things.)
One of us has an improved likelihood maximizer called GENOUD (Sekhon and Mebane 1998). GENOUD seems to do much better at the maximization, and its internally generated SEs are reasonably good. Results for GENOUD and Stata not reported here and are available from the authors.
Implications for Practice
There are two main conclusions from the simulations and the analytic results.The models analyzed here are very simple, with one covariate in each of (1–2) and (5–6). In real examples, the number of covariates may be quite large, and numerical behavior will be correspondingly more problematic.
Under ordinary circumstances, the two-step correction should not be used in probit response models. In some cases, the correction will reduce bias, but in many other cases, the correction will increase bias.
If the bivariate probit model is used, special care should be taken with the numerics. Conventional likelihood maximization algorithms produce estimates that are far away from the MLE. Even if the MLE has good operating characteristics, the “MLE” found by the software package may not. Results from VGAM 0.7-6 should be treated with caution. Results from Stata 9.2 may be questionable for various combinations of parameters.
Of course, there is a question more salient than the numerics: what is it that justifies probit models and the like as descriptions of behavior? For additional discussion, see Freedman (2005), which has further cites to the literature on this point.
Motivating the Estimator
Identifiability means that parameters are determined by the joint distribution of the observables; parameters that are not identifiable cannot be estimated. In the model defined by (1–2), the parameters are a, b, c, d, e and the correlation ρ between the latents; the observables are Xi, Zi, Ci, and Yi. In the model defined by (5–6), the parameters are a, b, c, d and the correlation ρ between the latents; observables are Xi, Ci, , , where = Zi and = Yi when Ci = 1, whereas = = ℳ when Ci = 0. Here, ℳ is just a special symbol that denotes “missing.”
Results are summarized as Propositions 1 and 2. The statements involve the sign of d, which is +1 if d > 0, 0 if d = 0, and −1 if d < 0. Since subjects are independent and identically distributed, only i = 1 need be considered. The variables (X1, Z1) are taken as bivariate normal, with a correlation strictly between −1 and +1. This assumption is discussed below.
Consider the model defined by (1–2). The parameters a and b in (1) are identifiable, and the sign of d in (2) is identifiable. If b ≠ 0, the parameters c, d, e, ρ in (2) are identifiable. If b = 0 but d ≠ 0, the parameters c, d, e, ρ are still identifiable. However, if b = d = 0, the remaining parameters c, e, ρ are not identifiable.
Consider the model defined by (5–6). The parameters a and b in (5) are identifiable, and the sign of d in (6) is identifiable. If b ≠ 0, the parameters c, d, ρ in (6) are identifiable. If b = 0 but d ≠ 0, the parameters c, d, ρ are still identifiable. However, if b = d = 0, the remaining parameters c, ρ are not identifiable.
Clearly, the joint distribution of C1 and X1 determines a and b, so we may consider these as given. The distributions of X1 and Z1 are determined (this is not so helpful). We can take the conditional distribution of Y1 given X1 = x and Z1 = z as known. In other words, suppose (U, V) are bivariate normal with mean 0, variance 1, and correlation ρ.
Fix x at any convenient value, and consider z > 0. Then is strictly decreasing, constant, or strictly increasing, according as d < 0, d = 0, or d > 0. The sign of d is therefore determined. The rest of proof, alas, consists of a series of cases.
The case b ≠ 0 and d > 0: Let u = −a − bx, v = −z, ξ = U, and ζ = (V + c)/d. Then (ξ, ζ) are bivariate normal, with unknown correlation ρ. We know ξ has mean 0 and variance 1. The mean and variance of ζ are unknown, being c/d and 1/d2, respectively. But
The case b ≠ 0 and d < 0 is the same, except that .
Of course, Λ′(0) = E(Ua) = −φ(−a)/Φ(−a). Thus, κ1 = −φ(−a)/Φ(−a) + c/d, which determines c/d. Next, Λ″(0) = var(Ua), so κ2 = (ρ/d)2var(Ua) + (1 − ρ2)/d2 is determined. Finally, κ3 = Λ′′′(0) is the third central moment of Ua. Since Ua has a skewed distribution, Λ′′′(0) ≠ 0. We can compute (ρ/d)3 from κ3, and then ρ/d. Next, we get 1/d2 from κ2, and then 1/d. (We are looking at the case d > 0.) Finally, c comes from κ1. Thus, c, d, ρ are determined, and e comes from ψ1 in (11). This completes the argument for the case b = 0 and d > 0.
The case b = 0 and d < 0 follows by the same argument.
The case b = d = 0: The three remaining parameters, c, e, and ρ, are not identifiable.
Here, we know the joint distribution of (X1, C1), which determines a, b. We also know the joint distribution of (X1, Z1, Y1) given C1 = 1; we do not know this joint distribution given C1 = 0. As in (11), suppose (U, V) are bivariate normal with mean 0, variance 1 and correlation ρ. The joint distributions of the observables determine a, b and the function
The random variable Ua was defined in the course of proving Proposition 1. If desired, the moments of Ua can be obtained explicitly in terms of φ and Φ, using repeated integration by parts.
The Laplace transform of Ua is easily obtained by completing the square, and
3. The argument for the case b = 0 and d > 0 in Proposition 1 is somewhat intricate, but it actually covers all values of b, whether zero or non-zero. The argument shows that for any particular real α, the values of c, d, ρ are determined by the number P(α + U < 0) and the function
4. Likewise, the argument for the case b ≠ 0 and d = 0 proves more. If we know P(U < u) and P(U < u and γ + V > 0) for all real u, that determines γ and ρ.
5. In (17), for example, if α = 1/2, then ρ = −1; but c can be anywhere in the range [0, ∞).
6. The propositions can easily be extended to cover vector-valued exogenous variables.
7. Our proof of the propositions really does depend on the assumption of an imperfect correlation between Xi and Zi. We hope to consider elsewhere the case where Zi ≡ Xi. The assumption of normality is not material; it is enough if the joint distributions have full support, although positive densities are probably easier to think about.
8. The assumption of bivariate normality for the latent variables is critical. If this is wrong, estimates are likely to be inconsistent.
9. Suppose (U, V) are bivariate normal with correlation ρ, and −1 < ρ < 1. Then
Some Relevant Literature
Cumulants are discussed by Rao (1973, 101). The ratio φ/Φ in (8) is usually called the “inverse Mills ratio,” in reference to Mills (1926)—although Mills tabulates [1 − Φ(x)]/φ(x) for x ≥ 0. Heckman (1978, 1979) proposes the use of Mi to correct for endogeneity and selection bias in the linear case, with a very clear explanation of the issues. He also describes potential use of the MLE. Meng and Schmidt (1985) discuss cases where the bivariate probit MLE is fragile. Rivers and Vuong (1988) propose an interesting alternative to the Heckman estimator. Their estimator (perhaps confusingly) is also called a two-step procedure. It seems most relevant when the endogenous variable is continuous; ours is binary.
For other estimation strategies and discussion, see Angrist (2001). Bhattacharya, Goldman, and McCaffrey (2006) discuss several “two-step” algorithms, including a popular Instrumental-Variables Least Squares regression estimator that turns out to be inconsistent; they do not seem to consider the particular two-step estimator of concern in our paper. Muthen (1979) discusses identifiability in a model with latent causal variables. The VGAM manual (Yee 2007) notes difficulties in computing standard errors. According to Stata (2005), its maximum likelihood routine “provides consistent, asymptotically efficient estimates for all the parameters in [the] models.”
Van de Ven and Van Praag (1981) found little difference between the MLE and the two-step correction; the difference doubtless depends on the model under consideration. Instabilities in the two-step correction are described by Winship and Mare (1992), Copas and Li (1997), and Briggs (2004), among others. For additional citations, see Dunning and Freedman (2007). Ono (2007) uses the two-step correction with probit response in a study of the Japanese labor market; X and Z are multidimensional. The sample size is 10,000, but only 300 subjects select into the treatment condition. Bushway, Johnson, and Slocum (2007) describe many overenthusiastic applications of the two-step correction in the criminology literature: binary response variables are among the least of the sins.
We do not suggest that finding the true maximum of the likelihood function guarantees the goodness of the estimator, because there are situations where the MLE performs rather badly. Freedman (2007) has a brief review of the literature on this topic. However, we would suggest that spurious maxima are apt to perform even less well, particularly with the sort of models considered here.