## Abstract

We look at conventional methods for removing endogeneity bias in regression models, including the linear model and the probit model. It is known that the usual Heckman two-step procedure should not be used in the probit model: from a theoretical perspective, it is unsatisfactory, and likelihood methods are superior. However, serious numerical problems occur when standard software packages try to maximize the biprobit likelihood function, even if the number of covariates is small. We draw conclusions for statistical practice. Finally, we prove the conditions under which parameters in the model are identifiable. The conditions for identification are delicate; we believe these results are new.

## Introduction

Suppose a linear regression model describes responses to treatment and to covariates. If subjects self-select into treatment, the process being dependent on the error term in the model, endogeneity bias is likely. Similarly, we may have a linear model that is to be estimated on sample data; if subjects self-select into the sample, endogeneity becomes an issue.

Heckman (1978, 1979) suggested a simple and ingenious two-step method for taking care of endogeneity, which works under the conditions described in those papers. This method is widely used. Some researchers have applied the method to probit response models. However, the extension is unsatisfactory. The nonlinearity in the probit model is an essential difficulty for the two-step correction, which will often make bias worse. It is well-known that likelihood techniques are to be preferred—although, as we show here, the numerics are delicate.

In the balance of this article, we define models for (1) self-selection into treatment or control and (2) self-selection into the sample, with simulation results to delineate the statistical issues. In the simulations, the models are correct. Thus, anomalies in the behavior of estimators are not to be explained by specification error. Numerical issues are explored. We explain the motivation for the two-step estimator and draw conclusions for statistical practice. We derive the conditions under which parameters in the models are identifiable; we believe these results are new. The literature on models for self-selection is huge, and so is the literature on probits; we conclude with a brief review of a few salient papers.

To define the models and estimation procedures, consider *n* subjects, indexed by $i=1,\u2026,n$. Subjects are assumed to be independent and identically distributed. For each subject, there are two manifest variables *X _{i}*,

*Z*and two latent variables

_{i}*U*,

_{i}*V*. Assume that (

_{i}*U*,

_{i}*V*) are bivariate normal, with mean 0, variance 1, and correlation ρ. Assume further that (

_{i}*X*,

_{i}*Z*) is independent of (

_{i}*U*,

_{i}*V*), that is, the manifest variables are exogenous. For ease of exposition, we take (

_{i}*X*,

_{i}*Z*) as bivariate normal, although that is not essential. Until further notice, we set the means to 0, the variances to 1, the correlation between

_{i}*X*and

_{i}*Z*to 0.40, and sample size

_{i}*n*to 1000.

## A Probit Response Model with an Endogenous Regressor

There are two equations in the model. The first is the selection equation:

*C*= 1 means that subject

_{i}*i*self-selects into treatment. The second equation defines the subject's response to treatment:

*Y*is binary rather than continuous. The data are the observed values of

_{i}*X*,

_{i}*Z*,

_{i}*C*,

_{i}*Y*. For example, the treatment variable

_{i}*C*may indicate whether subject

_{i}*i*graduated from college; the response

*Y*, whether

_{i}*i*has a full-time job.

Endogeneity bias is likely in (2). Indeed, *C _{i}* is endogenous due to the correlation

*ρ*between the latent variables

*U*and

_{i}*V*. A two-step correction for endogeneity is sometimes used (although it should not be):

_{i}*a*and

*b*in (4) would be unknown. These parameters are replaced by maximum likelihood estimates (MLEs) obtained from Step 1. The motivation for

*M*is explained in Section 6 below. Identifiability is discussed in Section 7: according to Proposition 1, parameters are identifiable unless

_{i}*b*=

*d*= 0.

Step 1. Estimate the probit model (1) by likelihood techniques.

Step 2. To estimate (2), fit the expanded probit model

The operating characteristics of the two-step correction were determined in a simulation study which draws 500 independent samples of size *n* = 1000. Each sample was constructed as described above. We set *a* = 0.50, *b* = 1, and ρ = 0.60. These choices create an environment favorable to correction.

Endogeneity is moderately strong: ρ = 0.60. So there should be some advantage to removing endogeneity bias. The dummy variable *C _{i}* is 1 with probability about 0.64, so it has appreciable variance. Furthermore, half the variance on the right hand side of (1) can be explained: var(

*bX*) = var(

_{i}*U*). The correlation between the regressors is only 0.40: making that correlation higher exposes the correction to well-known instabilities.

_{i}The sample is large: *n* = 1000. Regressors are exogenous by construction. Subjects are independent and identically distributed. Somewhat arbitrarily, we set the true value of *c* in the response equation (2) to −1, whereas *d* = 0.75 and *e* = 0.50. As it turned out, these choices were favorable too.

Table 1 summarizes results for three kinds of estimates:For each kind of estimate and each parameter, the table reports the mean of the estimates across the 500 repetitions. Subtracting the true value of the parameter measures the bias in the estimator. Similarly, the standard deviation (SD) across the repetitions, also shown in the table, measures the likely size of the random error.

raw (ignoring endogeneity);

the two-step correction; and

full maximum likelihood.

c | d | e | ρ | |

True values | ||||

−1.0000 | 0.7500 | 0.5000 | 0.6000 | |

Raw estimates | ||||

Mean | −1.5901 | 0.7234 | 1.3285 | |

SD | 0.1184 | 0.0587 | 0.1276 | |

Two-step | ||||

Mean | −1.1118 | 0.8265 | 0.5432 | |

SD | 0.1581 | 0.0622 | 0.2081 | |

MLE | ||||

Mean | −0.9964 | 0.7542 | 0.4964 | 0.6025 |

SD | 0.161 | 0.0546 | 0.1899 | 0.0900 |

c | d | e | ρ | |

True values | ||||

−1.0000 | 0.7500 | 0.5000 | 0.6000 | |

Raw estimates | ||||

Mean | −1.5901 | 0.7234 | 1.3285 | |

SD | 0.1184 | 0.0587 | 0.1276 | |

Two-step | ||||

Mean | −1.1118 | 0.8265 | 0.5432 | |

SD | 0.1581 | 0.0622 | 0.2081 | |

MLE | ||||

Mean | −0.9964 | 0.7542 | 0.4964 | 0.6025 |

SD | 0.161 | 0.0546 | 0.1899 | 0.0900 |

*Notes*. Correcting endogeneity bias when the response is binary probit. There are 500 repetitions. The sample size is 1000. The correlation between latents is ρ = 0.60. The parameters in the selection equation (1) are set at *a* = 0.50 and *b* = 1. The parameters in the response equation (2) are set at *c* = –1, *d* = 0.75, and *e* = 0.50. The response equation includes the endogenous dummy *C _{i}* defined by (1). The correlation between the exogenous regressors is 0.40. MLE computed by VGAM 0.7-6.

The “raw estimates” in Table 1 are obtained by fitting the probit model

The two-step estimates are obtained via (3–4), with and obtained by fitting (1). We focus on *d* and *e*, as the parameters in equation (2) that may be given causal interpretations. Without correction, averages about 0.72, with correction 0.83 (see Table 1). Correction doubles the bias. Without correction, averages 1.33, with correction 0.54. Correction helps a great deal, but some bias remains.

With the two-step correction, the SD of is about 0.21. Thus, random error in the estimates is appreciable, even with *n* = 1000. On the other hand, the standard error (SE) across the 500 repetitions is . The bias in cannot be explained in terms of random error in the simulation: increasing the number of repetitions will not make any appreciable change in the estimated biases.

Heckman (1978) also suggested the possibility of fitting the full model—equations (1) and (2)—by maximum likelihood. The full model is a “bivariate probit” or “biprobit” model. Results are shown in the last two lines of Table 1. The MLE is essentially unbiased. The MLE is better than the two-step correction, although random error remains a concern.

We turn to some variations on the setup described in Table 1. The simulations reported there generated new versions of the regressors on each repetition. Freezing the regressors makes almost no difference in the results: SDs would be smaller in the third decimal place.

The results in Table 1 depend on ρ, the correlation between the latent variables in the selection equation and the response equation. If ρ is increased from 0.60 to 0.80, say, the performance of the two-step correction is substantially degraded. Likewise, increasing the correlation between the exogenous regressors degrades the performance.

When ρ = 0.80 and the correlation between the regressors is 0.60, the bias in the two-step correction (3–4) for is about 0.15; for , about 0.20. Figure 1 plots the bias in against ρ, with the correlation between regressors set at 0.40 or 0.60, other parameters being fixed at their values for Table 1. The wiggles in the graph reflect variance in the Monte Carlo (there are “only” 500 replicates). The MLE is less sensitive to increasing correlations (data not shown).

Results are also sensitive to the distribution of the exogenous regressors. As the variance in the regressors goes down, bias goes up—in the two-step estimates and in the MLE. Furthermore, numerical issues become acute. There is some explanation: dividing the SD of *X* by 10, say, is equivalent to dividing *b* by 10 in equation (1); similarly for *Z* and *d* in equation (2). For small values of *b* and *d*, parameters are barely identifiable.

Figure 2 plots the bias in against the common SD of *X* and *Z*, which is set to values ranging from 0.1 to 1.0 (other parameters are set as in Table 1). The light line represents the MLE. Some of the “bias” in the MLE is indeed small-sample bias—when the SD is 0.1, a sample with *n* = 1000 is a small sample. Some of the bias, however, reflects a tendency of likelihood maximizers to quit before finding the global maximum.

The heavy line represents the two-step correction. (With an SD of 0.1, data for the two-step correction are not shown, because there are huge outliers; even the median bias is quite changeable from one set of 500 repetitions to another, but 0.2 may be a representative figure.) Curiously, the two-step correction is better than the MLE when the SD of the exogenous regressors is set to 0.2 or to 0.3. This is probably due to numerical issues in maximizing the likelihood functions.

We believe the bias in the two-step correction (Figs 1 and 2) reflects the operating characteristics of the estimator, rather than operating characteristics of the software. Beyond 1.0, the bias in the MLE seems to be negligible. Beyond 1.5, the bias in the two-step estimator for *e* is minimal, but *d* continues to be a little problematic.

As noted above, changing the scale of *X* is equivalent to changing *b*. Similarly, changing the scale of *Z* is equivalent to changing *d* (see equations (1) and (2)). Thus, in Fig. 2, we could leave the SDs at 1 and run through a series of (*b*, *d*) pairs:

*b*

_{0}= 1 and

*d*

_{0}= 0.75 were the initial choices for Table 1.

The number of regressors should also be considered. With a sample size of 1000, practitioners would often use a substantial number of covariates. Increasing the number of regressors is likely to have a negative impact on performance.

## A Probit Model with Endogenous Sample Selection

Consider next the situation where a probit model is fitted to a sample, but subjects self-select into the sample by an endogenous process. The selection equation is

*Y*and

_{i}*Z*are observed only when

_{i}*C*= 1. Thus, the data are the observed values of (

_{i}*X*,

_{i}*C*) for all

_{i}*i*, as well as (

*Z*,

_{i}*Y*) when

_{i}*C*= 1. When

_{i}*C*= 0, however,

_{i}*Z*and

_{i}*Y*remain unobserved. Notice that

_{i}*Y*is binary rather than continuous. Notice too that

_{i}*C*is omitted from (6); indeed, when (6) can be observed,

_{i}*C*≡ 1.

_{i}Fitting (6) to the observed data raises the question of endogeneity bias. Sample subjects have relatively high values of *U _{i}*; hence, high values of

*V*. (This assumes ρ > 0.) Again, there is a proposed solution that involves two steps.

_{i}*i*with

*C*= 1. This time,

_{i}Step 1. Estimate the probit model (5) by likelihood techniques.

Step 2. Fit the expanded probit model

c | d | ρ | |

True values | |||

−1.0000 | 0.7500 | 0.6000 | |

Raw estimates | |||

Mean | −0.7936 | 0.7299 | |

SD | 0.0620 | 0.0681 | |

Two-step | |||

Mean | −1.0751 | 0.8160 | |

SD | 0.1151 | 0.0766 | |

MLE | |||

Mean | −0.9997 | 0.7518 | 0.5946 |

SD | 0.0757 | 0.0658 | 0.1590 |

c | d | ρ | |

True values | |||

−1.0000 | 0.7500 | 0.6000 | |

Raw estimates | |||

Mean | −0.7936 | 0.7299 | |

SD | 0.0620 | 0.0681 | |

Two-step | |||

Mean | −1.0751 | 0.8160 | |

SD | 0.1151 | 0.0766 | |

MLE | |||

Mean | −0.9997 | 0.7518 | 0.5946 |

SD | 0.0757 | 0.0658 | 0.1590 |

*Notes*. Correcting endogeneity bias in sample selection when the response is binary probit. There are 500 repetitions. The sample size is 1000. The correlation between latents is *ρ* = 0.60. The parameters in the selection equation (5) are set at *a* = 0.50 and *b* = 1. The parameters in the response equation (6) are set at *c* = −1, and *d* = 0.75. Response data are observed only when *C _{i}* = 1, as determined by the selection equation. This will occur for about 64% of the subjects. The correlation between the exogenous regressors is 0.40. MLE computed using

*Stata*9.2.

Increasing the sample size from 1000 to 5000 in the simulations barely changes the averages, but reduces the SDs by a factor of about , as might be expected. This comment applies both to Tables 1 and 2 (data not shown), but not to the MLE results in Table 2. Increasing *n* would have made the STATA code prohibitively slow to run.

Many applications of Heckman's method feature a continuous response variable rather than a binary variable. Here, the two-step correction is on firmer ground, and parallel simulations (data not shown) indicate that the correction removes most of the endogeneity bias when the parameters are set as in Tables 1 and 2. However, residual bias is large when the SD of the regressors is set to 0.1 and the sample size is “only” 1000; the issues resolve when *n* = 10, 000. The problem with *n* = 1000 is created by (1) large random errors in , coupled with (2) poorly conditioned design matrices. In more complicated situations, there may be additional problems.

## Numerical Issues

Exploratory computations were done in several versions of *Matlab*, *R*, and *Stata*. In the end, to avoid confusion and chance capitalization, we redid the computations in a more unified way, with *R* 2.7 for the raw estimates, the two-step correction; VGAM 0.7-6 for the MLE in (1–2); and *Stata* 9.2 for the MLE in (5–6). Why do we focus on the behavior of *R* and *Stata*? *R* is widely used in the statistical community, and *Stata* is almost the lingua franca of quantitative social scientists.

Let *b*_{0} and *d*_{0} be the default values of *b* and *d*, namely, 1 and 0.75. As *b* and *d* decrease from the defaults, VGAM in *R* handled the maximization less and less well (Fig. 2). We believe VGAM had problems computing the Hessian, even for the base case in Table 1: its internally generated SEs were too small by a factor of about 2, for .

By way of counterpoint, *Stata* did somewhat better when we used it to redo the MLE in (1–2). However, if we multiply the default *b*_{0} and *d*_{0} by 0.3 or 0.4, bias in *Stata* becomes noticeable. If we multiply by 0.1 or 0.2, many runs fail to converge, and the runs that do converge produce aberrant estimates, particularly for a multiplier of 0.1. For multipliers of 0.2 to 0.4, the bias in is upwards in *R* but downwards in *Stata*. In Table 2, *Stata* did well. However, if we scale *b*_{0} and *d*_{0} by 0.1 or 0.2, *Stata* has problems. In defense of *R* and *Stata*, we can say that they produce abundant warning messages when they get into difficulties.

In multidimensional problems, even the best numerical analysis routines find spurious maxima for the likelihood function. Our models present three kinds of problems: (1) flat spots on the log likelihood surface, (2) ill-conditioned maxima, where the eigenvalues of the Hessian are radically different in size, and (3) ill-conditioned saddle points with one small positive eigenvalue and several large negative eigenvalues. The maximizers in VGAM and *Stata* simply give up before finding anything like the maximum of the likelihood surface. This is a major source of the biases reported above.

The model defined by (1–2) is a harder challenge for maximum likelihood than (5–6), due to the extra parameter *e*. Our computations suggest that most of the difficulty lies in the joint estimation of three parameters, *c*, *e*, ρ. Indeed, we can fix *a*, *b*, *d* at the default values for Table 1 and maximize the likelihood over the remaining three parameters *c*, *e*, ρ. VGAM and *Stata* still have convergence issues. The problems are the same as with six parameters. For example, we found a troublesome sample where the Hessian of the log likelihood had eigenvalues 4.7, −1253.6, −2636.9. (We parameterize the correlation between the latents by log(1 + ρ) − log(1 − ρ) rather than ρ, since that is how binom2.rho in VGAM does things.)

One of us has an improved likelihood maximizer called GENOUD (Sekhon and Mebane 1998). GENOUD seems to do much better at the maximization, and its internally generated SEs are reasonably good. Results for GENOUD and *Stata* not reported here and are available from the authors.

## Implications for Practice

There are two main conclusions from the simulations and the analytic results.The models analyzed here are very simple, with one covariate in each of (1–2) and (5–6). In real examples, the number of covariates may be quite large, and numerical behavior will be correspondingly more problematic.

Under ordinary circumstances, the two-step correction should not be used in probit response models. In some cases, the correction will reduce bias, but in many other cases, the correction will increase bias.

If the bivariate probit model is used, special care should be taken with the numerics. Conventional likelihood maximization algorithms produce estimates that are far away from the MLE. Even if the MLE has good operating characteristics, the “MLE” found by the software package may not. Results from VGAM 0.7-6 should be treated with caution. Results from

*Stata*9.2 may be questionable for various combinations of parameters.

Of course, there is a question more salient than the numerics: what is it that justifies probit models and the like as descriptions of behavior? For additional discussion, see Freedman (2005), which has further cites to the literature on this point.

## Motivating the Estimator

Consider (1–2). We can represent *V _{i}* as , where

*W*is an

_{i}*N*(0, 1) random variable, independent of

*U*. Then

_{i}*P*{

*U*> −

_{i}*a*−

*bx*} =

_{i}*P*{

*U*<

_{i}*a*+

*bx*} = Φ(

_{i}*a*+

*bx*). Likewise,

_{i}*E*{

*V*− ρ

_{i}*M*|

_{i}*X*,

_{i}*C*} = 0. If (2) were a linear regression equation, then OLS estimates would be unbiased, the coefficient of

_{i}*M*being nearly ρ. (These remarks take

_{i}*a*and

*b*as known, with the variance of the error term in the linear regression normalized to 1.) However, (2) is not a linear regression equation: (2) is a probit model. That is the source of the problem.

## Identifiability

Identifiability means that parameters are determined by the joint distribution of the observables; parameters that are not identifiable cannot be estimated. In the model defined by (1–2), the parameters are *a*, *b*, *c*, *d*, *e* and the correlation *ρ* between the latents; the observables are *X _{i}*,

*Z*,

_{i}*C*, and

_{i}*Y*. In the model defined by (5–6), the parameters are

_{i}*a*,

*b*,

*c*,

*d*and the correlation

*ρ*between the latents; observables are

*X*,

_{i}*C*, , , where =

_{i}*Z*and =

_{i}*Y*when

_{i}*C*= 1, whereas = = ℳ when

_{i}*C*= 0. Here, ℳ is just a special symbol that denotes “missing.”

_{i}Results are summarized as Propositions 1 and 2. The statements involve the sign of *d*, which is +1 if *d* > 0, 0 if *d* = 0, and −1 if *d* < 0. Since subjects are independent and identically distributed, only *i* = 1 need be considered. The variables (*X*_{1}, *Z*_{1}) are taken as bivariate normal, with a correlation strictly between −1 and +1. This assumption is discussed below.

Consider the model defined by (1–2). The parameters *a* and *b* in (1) are identifiable, and the sign of *d* in (2) is identifiable. If *b* ≠ 0, the parameters *c*, *d*, *e*, *ρ* in (2) are identifiable. If *b* = 0 but *d* ≠ 0, the parameters *c*, *d*, *e*, *ρ* are still identifiable. However, if *b* = *d* = 0, the remaining parameters *c*, *e*, *ρ* are not identifiable*.*

Consider the model defined by (5–6). The parameters *a* and *b* in (5) are identifiable, and the sign of *d* in (6) is identifiable. If *b* ≠ 0, the parameters *c*, *d*, ρ in (6) are identifiable. If *b* = 0 but *d* ≠ 0, the parameters *c*, *d*, ρ are still identifiable. However, if *b* = *d* = *0*, the remaining parameters *c*, ρ are not identifiable.

Clearly, the joint distribution of *C*_{1} and *X*_{1} determines *a* and *b*, so we may consider these as given. The distributions of *X*_{1} and *Z*_{1} are determined (this is not so helpful). We can take the conditional distribution of *Y*_{1} given *X*_{1} = *x* and *Z*_{1} = *z* as known. In other words, suppose (*U*, *V*) are bivariate normal with mean 0, variance 1, and correlation ρ.

The joint distribution of the observables determines *a*, *b*, and two functions ψ_{0}, ψ_{1} of *x*, *z*:

Fix *x* at any convenient value, and consider *z* > 0. Then $z\u2192\psi 0(x,z)$ is strictly decreasing, constant, or strictly increasing, according as *d* < 0, *d* = 0, or *d* > 0. The sign of *d* is therefore determined. The rest of proof, alas, consists of a series of cases.

The case *b* ≠ 0 and *d* > 0: Let *u* = −*a* − *bx*, *v* = −*z*, ξ = *U*, and ζ = (*V* + *c*)/*d*. Then (ξ, ζ) are bivariate normal, with unknown correlation ρ. We know ξ has mean 0 and variance 1. The mean and variance of ζ are unknown, being *c*/*d* and 1/*d*^{2}, respectively. But

*u*,

*v*). Does this determine ρ,

*c*,

*d*? Plainly so, because (12) determines the joint distribution of ξ, ζ. We can then compute ρ, , and

*c*=

*dE*(ζ). Finally, ψ

_{1}in (11) determines

*e*. This completes the argument for the case

*b*≠ 0 and

*d*> 0.

The case *b* ≠ 0 and *d* < 0 is the same, except that .

The case *b* ≠ 0 and *d* = 0: Here, we know

*u*→ ∞: the marginal distribution of

*V*determines

*c*. Furthermore, from (13), we can compute

*P*(

*V*> −

*c*|

*U*=

*u*) for all

*u*. Given

*U*=

*u*, we know that

*V*is distributed as , where

*W*is

*N*(0, 1). If ρ = ± 1, then

*u*to see that (14) determines . This is a 1−1 function of ρ. Thus, ρ can be determined, and then

*c*; finally,

*e*is obtained from ψ

_{1}in (11). This completes the argument for the case

*b*≠ 0 and

*d*= 0.

The case *b* = 0 and *d* > 0*:* As above, let *W* be independent of *U* and *N*(0, 1); represent *V* as . Let *G* = {*U* < −*a*}. From ψ_{0} and *a*, we compute

*U*for

_{a}*U*conditioned so that

*U*< −

*a*. The right hand side of (15), as a function of

*z*, determines the distribution function of the sum of three terms: two independent random variables,

*U*and , where

_{a}*W*is standard normal, plus the constant

*c*/

*d*. This distribution is therefore known, although it depends on the three unknowns,

*c*,

*d*, ρ.

Write Λ for the log Laplace transform of *U _{a}*. This is a known function. Now compute the log Laplace transform of the distribution in (15). This is

*c*,

*d*, and ρ are unknown. Consider the expansion of (16) as a power series near 0, of the form κ

_{1}

*t*+ κ

_{2}

*t*

^{2}/2! + κ

_{3}

*t*

^{3}/3! + ⋯. The κ’s are the “cumulants” or “semi-invariants” of the distribution in (15). These are known quantities because the function in (16) is known: κ

_{1}is the mean of the distribution given by (15), whereas κ

_{2}is the variance, and κ

_{3}is the central third moment.

Of course, Λ′(0) = *E*(*U _{a}*) = −φ(−

*a*)/Φ(−

*a*). Thus, κ

_{1}= −φ(−

*a*)/Φ(−

*a*) +

*c*/

*d*, which determines

*c*/

*d*. Next, Λ″(0) = var(

*U*), so κ

_{a}_{2}= (ρ/

*d*)

^{2}var(

*U*) + (1 − ρ

_{a}^{2})/

*d*

^{2}is determined. Finally, κ

_{3}= Λ′′′(0) is the third central moment of

*U*. Since

_{a}*U*has a skewed distribution, Λ′′′(0) ≠ 0. We can compute (ρ/

_{a}*d*)

^{3}from κ

_{3}, and then ρ/

*d*. Next, we get 1/

*d*

^{2}from κ

_{2}, and then 1/

*d*. (We are looking at the case

*d*> 0.) Finally,

*c*comes from κ

_{1}. Thus,

*c*,

*d*, ρ are determined, and

*e*comes from ψ

_{1}in (11). This completes the argument for the case

*b*= 0 and

*d*> 0.

The case *b* = 0 and *d* < 0 follows by the same argument.

The case *b* = *d* = 0: The three remaining parameters, *c*, *e*, and ρ, are not identifiable.

For simplicity, take *a* = 0, although this is not essential. Suppose

*c*,

*e*, ρ. Choose any particular ρ with −1 ≤ ρ ≤ 1. Choose

*c*so that (17) holds and

*e*so that (18) holds. The upshot: there are infinitely many

*c*,

*e*, ρ triplets yielding the same joint distribution for the observables. This completes the argument for the case

*b*=

*d*= 0, and so for Proposition 1.

Here, we know the joint distribution of (*X*_{1}, *C*_{1}), which determines *a*, *b*. We also know the joint distribution of (*X*_{1}, *Z*_{1}, *Y*_{1}) given *C*_{1} = 1; we do not know this joint distribution given *C*_{1} = 0. As in (11), suppose (*U*, *V*) are bivariate normal with mean 0, variance 1 and correlation ρ. The joint distributions of the observables determine *a*, *b* and the function

_{0}. Most of the argument is the same as before, or even a little easier. We consider in detail only one case.

The case *b* = *d* = 0: The two remaining parameters, *c*, ρ are not identifiable. Again, take *a* = 0. Fix any α with 0 < α < 1/2. Suppose

*c*, ρ. Fix any ρ with −1 ≤ ρ ≤ 1 and solve (20) for

*c*. There are infinitely many

*c*, ρ pairs giving the same joint distribution for the observables when

*b*=

*d*= 0. This completes our discussion of Proposition 2.

### Remarks

The random variable

*U*was defined in the course of proving Proposition 1. If desired, the moments of_{a}*U*can be obtained explicitly in terms of φ and Φ, using repeated integration by parts._{a}The Laplace transform of

*U*is easily obtained by completing the square, and_{a}

3. The argument for the case

*b*= 0 and*d*> 0 in Proposition 1 is somewhat intricate, but it actually covers all values of*b*, whether zero or non-zero. The argument shows that for any particular real α, the values of*c*,*d*, ρ are determined by the number*P*(α +*U*< 0) and the function

4. Likewise, the argument for the case

*b*≠ 0 and*d*= 0 proves more. If we know*P*(*U*<*u*) and*P*(*U*<*u*and γ +*V*> 0) for all real*u*, that determines γ and ρ.5. In (17), for example, if α = 1/2, then ρ = −1; but

*c*can be anywhere in the range [0, ∞).6. The propositions can easily be extended to cover vector-valued exogenous variables.

7. Our proof of the propositions really does depend on the assumption of an imperfect correlation between

*X*and_{i}*Z*. We hope to consider elsewhere the case where_{i}*Z*≡_{i}*X*. The assumption of normality is not material; it is enough if the joint distributions have full support, although positive densities are probably easier to think about._{i}8. The assumption of bivariate normality for the latent variables is critical. If this is wrong, estimates are likely to be inconsistent.

9. Suppose (

*U*,*V*) are bivariate normal with correlation ρ, and −1 < ρ < 1. Then

## Some Relevant Literature

Cumulants are discussed by Rao (1973, 101). The ratio φ/Φ in (8) is usually called the “inverse Mills ratio,” in reference to Mills (1926)—although Mills tabulates [1 − Φ(*x*)]/φ(*x*) for *x* ≥ 0. Heckman (1978, 1979) proposes the use of *M _{i}* to correct for endogeneity and selection bias in the linear case, with a very clear explanation of the issues. He also describes potential use of the MLE. Meng and Schmidt (1985) discuss cases where the bivariate probit MLE is fragile. Rivers and Vuong (1988) propose an interesting alternative to the Heckman estimator. Their estimator (perhaps confusingly) is also called a two-step procedure. It seems most relevant when the endogenous variable is continuous; ours is binary.

For other estimation strategies and discussion, see Angrist (2001). Bhattacharya, Goldman, and McCaffrey (2006) discuss several “two-step” algorithms, including a popular Instrumental-Variables Least Squares regression estimator that turns out to be inconsistent; they do not seem to consider the particular two-step estimator of concern in our paper. Muthen (1979) discusses identifiability in a model with latent causal variables. The VGAM manual (Yee 2007) notes difficulties in computing standard errors. According to Stata (2005), its maximum likelihood routine “provides consistent, asymptotically efficient estimates for all the parameters in [the] models.”

Van de Ven and Van Praag (1981) found little difference between the MLE and the two-step correction; the difference doubtless depends on the model under consideration. Instabilities in the two-step correction are described by Winship and Mare (1992), Copas and Li (1997), and Briggs (2004), among others. For additional citations, see Dunning and Freedman (2007). Ono (2007) uses the two-step correction with probit response in a study of the Japanese labor market; *X* and *Z* are multidimensional. The sample size is 10,000, but only 300 subjects select into the treatment condition. Bushway, Johnson, and Slocum (2007) describe many overenthusiastic applications of the two-step correction in the criminology literature: binary response variables are among the least of the sins.

We do not suggest that finding the true maximum of the likelihood function guarantees the goodness of the estimator, because there are situations where the MLE performs rather badly. Freedman (2007) has a brief review of the literature on this topic. However, we would suggest that spurious maxima are apt to perform even less well, particularly with the sort of models considered here.

## References

## Author notes

*Authors’ note:*Derek Briggs, Allan Dafoe, Thad Dunning, Joe Eaton, Eric Lawrence, Walter Mebane, Jim Powell, Rocío Titiunik, and Ed Vytlacil made helpful comments. Errors and omissions remain the responsibility of the authors.