Self-selection or sample selection models are applied when the individuals in the sample are not randomly chosen from the population from which one would like to draw inferences. In the prototypical self-selection model, interest centers on estimating the parameters of a regression equation (often labeled ``outcome equation'') from observations on individuals who self-selected into the sample on the basis of a criterion that is correlated with the dependent variable of the outcome equation.
To illustrate, suppose we are interested in estimating the expected income of a randomly chosen individual if she were working as a lawyer. Computing the average income of those who actually work as lawyers is likely to be an upward biased estimate since those observed as lawyers probably chose their profession because they are talented for this line of work and expect to earn a relatively high income.
The solution to the self-selection problem proposed by Heckman (1979) is to propose and estimate a model of the self-selection decision. That is, Heckman's solution adds a ``decision equation'' to the outcome equation. Formally, the model consists of the following two equations:
The self-selection problem arises if and
are correlated, i.e.
the (unobservable part of the) decision to select into the sample is correlated
with the (unobservable part of the) outcome of interest.
|
In Heckman (1979)'s classical solution to the problem it is assumed that
and
are jointly normally distributed:
![]() |
![]() |
![]() |
|
![]() |
![]() |
||
![]() |
![]() |
We illustrate
heckman
with simulated data where the error
terms in the decision and outcome equations are strongly correlated:
library("metrics") randomize(10178) n = 500 s1 = 1 s2 = 1 s12 = 0.7 ss = #(s1,s12)~#(s12,s2) ev = eigsm(ss) va = ev.values ve = ev.vectors ll = diag(va) ll = sqrt(ll) sh = ve*ll*ve' u = normal(n,2)*sh' z = 2*normal(n,2) g = #(1,2) q = (z*g+u[,1].>=0) x = matrix(n)~aseq(1, n ,0.25) b = #(-9, 1) y = x*b+u[,2] y = y.*(q.>0) heckit = heckman(x,y,z,q) heckit.b heckit.s heckit.g
Contents of b [1,] -8.9835 [2,] 0.99883 Contents of s [1,] 0.77814 Contents of g [1,] 1.0759 [2,] 2.1245Since the data generation process satisfies the assumptions of the parametric self-selection model it is not surprising that the
|
The distributional assumption (12.18) of the parametric
self-selection model is, more than anything else, made for convenience. If, however,
(12.18) is violated then the
heckman
estimator is not
consistent. Hence, there is ample reason to develop consistent estimators for
self-selection models with weaker distributional assumptions.
Powell (1987) considers a semiparametric self-selection model that combines the two-equation structure of (12.16) with the following weak assumption about the joint distribution of the error terms:
Note that for any two observations and
with
but
we can difference out the unknown function
by subtracting the regression functions for
and
In (12.21), we tacitly assume that we have already
obtained
an
estimate of
Under assumption (12.20),
we get a single index model for the decision equation in place of the
Probit model (12.19) in the parametric case:
It should be pointed out that (12.21) does not produce an estimate of the constant term because the constant term is eliminated by taking differences in (12.21). An estimator of the constant term can be obtained in a third step by a procedure suggested by Andrews and Schafgans (1997). Their estimator is defined as
The quantlet
select
combines the second and third steps of the
semiparametric estimation procedure. Both steps are also separately available in the
quantlets
powell
and
andrews
.
select
takes as inputs
the data for the outcome equation (x and y, where x
may not contain a
vector of ones), the vector id containing the estimated first step index
and the bandwidth vector h. The first element of h is the threshold parameter
used for estimating the intercept coefficient
while the second element is the bandwidth
used for estimating the slope
coefficients according to (12.21).
We illustrate
hhtest
using the
kyphosis
data:
library("metrics") randomize(66666) n = 200 ss1 = #(1,0.9)~#(0.9,1) g = #(1) b = #(-9, 1) u = gennorm(n, #(0,0), ss1) ss2 = #(1,0.4)~#(0.4,1) xz = gennorm(n, #(0,0), ss2) z = xz[,2] q = (z*g+u[,1].>=0) hd = 0.1*(max(z) - min(z)) d = dwade(z,q,hd)*(2*sqrt(3)*pi) id = z*d h = (quantile(id, 0.7))|(0.2*(max(id)-min(id))) x = matrix(n)~xz[,1] y = x*b+u[,2] zz = paf(y~x~id, q) y = zz[,1] x = zz[,3:(cols(zz)-1)] id = zz[,cols(zz)] {a,b} = select(x,y,id,h) d~a~b
Contents of _tmp [1,] 0.81368 -8.7852 1.0471Note that these estimates are quite close to the value of the corresponding population parameter. Yet, since the data generation process satisfies the assumptions of the parametric model, one could have obtained more efficient estimates using the parametric