12.3 Self-Selection Models

Self-selection or sample selection models are applied when the individuals in the sample are not randomly chosen from the population from which one would like to draw inferences. In the prototypical self-selection model, interest centers on estimating the parameters of a regression equation (often labeled ``outcome equation'') from observations on individuals who self-selected into the sample on the basis of a criterion that is correlated with the dependent variable of the outcome equation.

To illustrate, suppose we are interested in estimating the expected income of a randomly chosen individual if she were working as a lawyer. Computing the average income of those who actually work as lawyers is likely to be an upward biased estimate since those observed as lawyers probably chose their profession because they are talented for this line of work and expect to earn a relatively high income.

The solution to the self-selection problem proposed by Heckman (1979) is to propose and estimate a model of the self-selection decision. That is, Heckman's solution adds a ``decision equation'' to the outcome equation. Formally, the model consists of the following two equations:

\begin{displaymath}\begin{array}{lcll} I^* & = & z^T\gamma + \delta &\qquad \tex...
...ta + \varepsilon &\qquad \textrm{outcome equation}. \end{array}\end{displaymath} (12.16)

Here $ I^*$ is the unobserved propensity to select into the sample, $ z$ a vector of observable explanatory variables and $ \delta$ an unobservable error term. In the outcome equation, all quantities have the usual meaning. Note, however, that we denote the dependent variable as $ Y^*$ because it is not the observed dependent variable but rather the ``potential'' dependent variable. In our lawyer example, $ Y^*$ is the potential income as a lawyer for a randomly chosen individual. $ Y^*$ is observed only for those who actually choose to be lawyers. Formally,

$\displaystyle Y= Y^* \cdotp {\boldsymbol{I}}(I^*>0) = \left\{\begin{array}{ll} ...
...m{if}\quad I^*\le 0,\\ Y^* &\qquad \textrm{if}\quad I^* > 0. \end{array}\right.$ (12.17)

That is, potential income $ Y^*$ and observed income $ Y$ are equal only if the propensity to select into the sample (e.g. to become a lawyer) is positive ($ I^*>0$). For those not selecting into the sample ($ I^*\le 0$), $ Y^*$ is not observed and set equal to 0. Regarding the propensity to select into the sample, we are only able to observe whether it is positive (in which case $ {\boldsymbol{I}}(I^*>0\vert z)=1$) or negative ( $ {\boldsymbol{I}}(I^*>0\vert z)=0$).

The self-selection problem arises if $ \delta$ and $ \varepsilon$ are correlated, i.e. the (unobservable part of the) decision to select into the sample is correlated with the (unobservable part of the) outcome of interest.


12.3.1 Parametric Model


heckit = 28891 heckman (vhat, y, yhat, h)
two-step estimation of a parametric self-selection model

In Heckman (1979)'s classical solution to the problem it is assumed that $ \delta$ and $ \varepsilon$ are jointly normally distributed:

$\displaystyle \left(\begin{array}{c}\varepsilon\\ \delta\end{array}\right)= N\l...
...{\delta,\varepsilon}\\ \sigma_{\delta,\varepsilon}&1\end{array}\right) \right).$ (12.18)

The variance of $ \delta$ is not identifiable and set to $ 1$. Under this assumption, the regression function for the observed dependent variable $ Y$ can be written as
$\displaystyle E(Y\vert x)=E(Y^*\vert x,I^*>0)$ $\displaystyle =$ $\displaystyle E(x^T\beta\vert x,I^*>0)
\ + E(\varepsilon\vert x,I^*>0)$  
  $\displaystyle =$ $\displaystyle x^T\beta
+ E(\varepsilon\vert x,z^T\gamma>
-\delta)$  
  $\displaystyle =$ $\displaystyle x^T\beta
+
\sigma_{\delta,\varepsilon}\,\phi(z^T\gamma)/\Phi(z^T\gamma),$  

where $ \sigma_{\delta,\varepsilon}$ is the covariance between $ \delta$ and $ \varepsilon.$ The parameters of the model ($ \beta,$ $ \gamma,$ $ \sigma_{\delta,\varepsilon}$) may be estimated by the following two-step procedure, implemented in 28894 heckman :
  1. Probit step
    Estimate $ \gamma$ by fitting the probit model

    $\displaystyle P({\boldsymbol{I}}(I^*>0\vert z)=1)=\Phi(z^T\gamma)$ (12.19)

    using all observations, i.e. those with $ {\boldsymbol{I}}(I^*>0\vert z)=1$ (the lawyers, in the example) and $ {\boldsymbol{I}}(I^*>0\vert z)=0$ (the nonlawyers).
  2. OLS step
    Using only observations with $ {\boldsymbol{I}}(I^*>0\vert z)=1$ to estimate the regression function

    $\displaystyle E(Y\vert x)=x^T\beta+\sigma_{\delta,\varepsilon}\,\phi(z^T\gamma)/\Phi(z^T\gamma) $

    by an OLS regression of the observed $ Y$s on $ x$ and $ \phi(z^T\widehat{\gamma})/\Phi(z^T\widehat{\gamma}),$ where $ \widehat{\gamma}$ is the first-step estimate of $ \gamma.$
28905 heckman takes as inputs the data, i.e. observations on $ Y,$ $ x,$ $ z,$ and $ {\boldsymbol{I}}(I^*>0\vert z)$ (the latter are labeled q) and returns estimates of $ \beta$ (stored in heckit.b), $ \sigma_{\delta,\varepsilon}$ (heckit.s) and $ \gamma$ (heckit.g).

We illustrate 28910 heckman with simulated data where the error terms in the decision and outcome equations are strongly correlated:

  library("metrics")
  randomize(10178)
  n      =  500
  s1     =  1
  s2     =  1
  s12    =  0.7
  ss     =  #(s1,s12)~#(s12,s2)
  ev     =  eigsm(ss)
  va     =  ev.values
  ve     =  ev.vectors
  ll     =  diag(va)
  ll     =  sqrt(ll)
  sh     =  ve*ll*ve'
  u      =  normal(n,2)*sh'
  z      =  2*normal(n,2)
  g      =  #(1,2)
  q      =  (z*g+u[,1].>=0)
  x      =  matrix(n)~aseq(1, n ,0.25)
  b      =  #(-9, 1)
  y      =  x*b+u[,2]
  y      =  y.*(q.>0)
  heckit =  heckman(x,y,z,q)
  heckit.b
  heckit.s
  heckit.g
28914 XLGmetric05.xpl

The estimates of $ \beta$, $ \sigma_{\delta,\varepsilon}$ and $ \gamma$ are displayed in the XploRe output window:
  Contents of b
  [1,]  -8.9835 
  [2,]  0.99883 
  Contents of s
  [1,]  0.77814 
  Contents of g
  [1,]   1.0759 
  [2,]   2.1245
Since the data generation process satisfies the assumptions of the parametric self-selection model it is not surprising that the 28923 heckman coefficient estimates are quite close to the true coefficients.


12.3.2 Semiparametric Model


{a, b} = 29145 select (x, y, id, h)
three-step estimation of a semiparametric self-selection model

The distributional assumption (12.18) of the parametric self-selection model is, more than anything else, made for convenience. If, however, (12.18) is violated then the 29148 heckman estimator is not consistent. Hence, there is ample reason to develop consistent estimators for self-selection models with weaker distributional assumptions.

Powell (1987) considers a semiparametric self-selection model that combines the two-equation structure of (12.16) with the following weak assumption about the joint distribution of the error terms:

$\displaystyle f(\delta,\varepsilon\vert z)= f(\delta,\varepsilon\vert z^T\gamma).$ (12.20)

That is, all we assume about the joint density of $ \delta,\varepsilon$ (conditional on $ z$) is that it is a smooth but unknown function $ f(\bullet)$ that depends on $ z$ only through the linear index $ z^T\gamma.$ Under these assumptions the regression function for the observed outcomes $ Y$ takes the following form:
$\displaystyle E(Y\vert x)=E(Y^*\vert x,I^*>0)$ $\displaystyle =$ $\displaystyle x^T\beta
+ E(\varepsilon\vert x,z^T\gamma>
-\delta)$  
  $\displaystyle =$ $\displaystyle x^T\beta + \lambda(z^T\gamma),$  

where $ \lambda(\bullet)$ is an unknown smooth function.

Note that for any two observations $ i$ and $ j$ with $ x_{i}\neq x_{j}$ but $ z_{i}^T\gamma \, = \, z_{j}^T\gamma$ we can difference out the unknown function $ \lambda(\bullet)$ by subtracting the regression functions for $ i$ and $ j:$

$\displaystyle E(Y_i\vert x=x_{i})-E(Y_j\vert x=x_{j})$ $\displaystyle =$ $\displaystyle (x_{i}-x_j)^T \beta+
\lambda(z_{i}^T\gamma)-\lambda(z_{j}^T\gamma)$  
  $\displaystyle =$ $\displaystyle (x_{i}-x_j)^T \beta.\ $  

This is the basic idea underlying the estimator of $ \beta$ proposed by Powell (1987): regress differences in $ Y$ on differences in $ x,$ weighting differences strongly for which $ z_{i}^T\widehat{\gamma}$ and $ z_{j}^T\widehat{\gamma}$ are close together (and hence, $ \lambda(z_{i}^T\widehat{\gamma})-\lambda(z_{j}^T\widehat{\gamma})\approx 0$). That is, $ \beta$ is estimated by the following weighted least squares estimator:
$\displaystyle \widehat{\beta}^{'}$ $\displaystyle =$ $\displaystyle \left[ {n \choose 2}
\sum_{i=1}^{N}\sum_{j=i+1}^{N}\widehat{w}_{ijN}\,(x_i-x_j)(x_i-x_j)^{'}
\right]^{-1}$  
    $\displaystyle \quad\quad\left[{n \choose
2}\sum_{i=1}^{N}\sum_{j=i+1}^{N}\widehat{w}_{ijN}\,(x_i-x_j)(y_i-y_j)
\right]\,,$ (12.21)

where $ \widehat{w}_{ijN}=1/h \,\,K\left[(z_i^T\widehat{\gamma}-
z_i^T\widehat{\gamma})/h \right]$ with symmetric kernel function $ K(\bullet)$ and bandwidth $ h.$

In (12.21), we tacitly assume that we have already obtained $ \widehat{\gamma},$ an estimate of $ \gamma.$ Under assumption (12.20), we get a single index model for the decision equation in place of the Probit model (12.19) in the parametric case:

$\displaystyle P({\boldsymbol{I}}(I^*>0\vert z)=1)=g(z^T\gamma).$ (12.22)

Here, $ g(\bullet )$ is an unknown, smooth function. Estimators for $ \gamma$ in this model have been discussed in Subsections 12.1.4-12.1.6. Any of these methods may be applied to get an estimate of $ \gamma.$ This is the first step of the semiparametric procedure. Given $ \widehat{\gamma},$ the second step consists of estimating $ \beta$ using (12.21).

It should be pointed out that (12.21) does not produce an estimate of the constant term because the constant term is eliminated by taking differences in (12.21). An estimator of the constant term can be obtained in a third step by a procedure suggested by Andrews and Schafgans (1997). Their estimator is defined as

$\displaystyle \widehat{\beta}_0=\frac{\sum_{i=1}^{N}(Y_{i}-x_{i}^T\widehat{\beta}) \ K(z_i^T\widehat{\gamma}-k)} {K(z_i^T\widehat{\gamma}-k)}\,,$ (12.23)

where $ K(\bullet)$ is a nondecreasing $ [0,1]$-valued weight function that is set equal to zero for all negative values of its argument, and $ k$ is a threshold parameter that has to increase with increasing $ n$ for the estimator to be consistent. The basic idea behind (12.23) is that for large values of $ z^T\gamma,$ $ E(\varepsilon\vert x,z^T\gamma > -\delta)=\lambda(z^T\gamma)\approx0$ and the intercept of the $ \lambda(\bullet)$ function can be separated from the ``classical'' constant term of the regression equation.

The quantlet 29153 select combines the second and third steps of the semiparametric estimation procedure. Both steps are also separately available in the quantlets 29156 powell and 29159 andrews .

29162 select takes as inputs the data for the outcome equation (x and y, where x may not contain a vector of ones), the vector id containing the estimated first step index $ z^T\widehat{\gamma},$ and the bandwidth vector h. The first element of h is the threshold parameter $ k$ used for estimating the intercept coefficient while the second element is the bandwidth $ h$ used for estimating the slope coefficients according to (12.21).

We illustrate 29165 hhtest using the kyphosis data:

  library("metrics")
  randomize(66666)
  n      =  200                         
  ss1    =  #(1,0.9)~#(0.9,1)           
  g      =  #(1)                                
  b      =  #(-9, 1)                    
  u      =  gennorm(n, #(0,0), ss1)     
  ss2    =  #(1,0.4)~#(0.4,1)            
  xz     =  gennorm(n, #(0,0), ss2)  
  z      =  xz[,2]                      
  q      =  (z*g+u[,1].>=0)             
  hd     =  0.1*(max(z) - min(z))               
  d      =  dwade(z,q,hd)*(2*sqrt(3)*pi)        
  id     =  z*d                         
  h      =  (quantile(id, 0.7))|(0.2*(max(id)-min(id)))
  x      =  matrix(n)~xz[,1]    
  y      =  x*b+u[,2]           
  zz     =  paf(y~x~id, q)              
  y      =  zz[,1]
  x      =  zz[,3:(cols(zz)-1)]
  id     =  zz[,cols(zz)]
  {a,b}  =  select(x,y,id,h)
  d~a~b
29171 XLGmetric06.xpl

29176 select returns the first-, second- and third-step coefficient estimates. For the data at hand, they turn out to be equal to
  Contents of _tmp
  [1,]  0.81368  -8.7852   1.0471
Note that these estimates are quite close to the value of the corresponding population parameter. Yet, since the data generation process satisfies the assumptions of the parametric model, one could have obtained more efficient estimates using the parametric 29179 heckman estimator.