6.2 Estimation

When estimating a SIM, we have to take into account that the functional form of the link function is unknown. Moreover, since the shape of $ g(\bullet)$ will determine the value of

$\displaystyle E(Y\vert{\boldsymbol{X}}) = g\left\{ v_{\boldsymbol{\beta}}({\boldsymbol{X}})\right\}$

for a given index $ v_{\boldsymbol{\beta}}({\boldsymbol{X}})$, estimation of the index weights $ {\boldsymbol{\beta}}$ will have to adjust to a specific estimate of the link function to yield a ``correct" regression value. Thus, in SIM both the index and the link function have to be estimated, even though only the link function is of nonparametric character.

Recall that $ {\boldsymbol{X}}$ is the vector of explanatory variables, $ {\boldsymbol{\beta}}$ is a $ d$-dimensional vector of unknown coefficients (or weights), and $ g(\bullet)$ is an arbitrary smooth function. Let

$\displaystyle \varepsilon=m({\boldsymbol{X}})-Y=E(Y\vert{\boldsymbol{X}})-Y $

be the deviation of $ Y$ from its conditional expectation w.r.t.  $ {\boldsymbol{X}}$. Using $ \varepsilon$ we can write down the single index model as

$\displaystyle Y=g\left\{ v_{\boldsymbol{\beta}}({\boldsymbol{X}})\right\}+\varepsilon .$ (6.4)

Our goal is to find efficient estimators for $ {\boldsymbol{\beta}}$ and $ g(\bullet)$. As $ {\boldsymbol{\beta}}$ is inside the nonparametric link, the challenge is to find an appropriate estimator for $ {\boldsymbol{\beta}}$, in particular one that reaches the $ \sqrt{n}$-rate of convergence. (Recall that $ \sqrt{n}$-convergence is typically achieved by parametric estimators.)

Two essentially different approaches exist for this purpose:

In both cases the estimation procedure can be summarized as:
\fbox{\parbox{0.95\textwidth}{
\centerline{SIM Algorithm}
\hrule
\begin{center}\...
...d for the regression of $Y$\ on $\widehat\eta$\\
&
\end{tabular}\end{center}}}

The final step is relevant for all SIM estimators considered in the following sections. More precisely: Suppose that $ {{\boldsymbol{\beta}}}$ has already been estimated by $ \widehat{{\boldsymbol{\beta}}}$. Set

$\displaystyle \widehat{\eta_{i}}={\boldsymbol{X}}_{i}^\top\widehat{{\boldsymbol{\beta}}}, \quad i=1,\ldots,n.$ (6.5)

Once we have created these ``observations" of the new dependent variable $ \widehat{\eta}={\boldsymbol{X}}^\top\widehat{{\boldsymbol{\beta}}}$ we have a standard univariate regression problem. For example, a possible and convenient choice for estimating $ g(\bullet)$ would be the Nadaraya-Watson estimator introduced in Chapter 4.1:

$\displaystyle \widehat{g}_{\widetilde{h}}(z) =\frac{ \sum_{i=1}^{n} {K}_{{h}}(z-\widehat{\eta}_{i})Y_{i} } { \sum_{i=1}^{n} {K}_{{h}}(z-\widehat{\eta}_{i}) }.$ (6.6)

It can be shown that if $ {\boldsymbol{\beta}}$ is $ \sqrt{n}$-consistent, $ \widehat{g}_{{h}}(z)$ converges for $ {h}\sim n^{-1/5}$ at rate $ n^{2/5}=\sqrt{nh}$, i.e. like a univariate kernel regression estimator:

$\displaystyle n^{2/5} \left\{\widehat{g}_{h}(z)-g(z)\right\}
\mathrel{\mathop{\longrightarrow}\limits_{}^{L}} N( b_{z}, v_{z}) $

with bias $ b_{z}$ and variance $ v_{z}$, see e.g. Powell et al. (1989).

As in the first part of this book we concentrate on kernel based methods here. The choice of $ h$ and $ K$ in (6.6) is independent of the choice of these parameters in the following sections. For other than kernel based SIM estimators we refer to the bibliographic notes.

6.2.1 Semiparametric Least Squares

As indicated in the introduction we are going to concentrate on the estimation of $ {\boldsymbol{\beta}}$ now. The methods that we consider here under semiparametric least squares (SLS) and pseudo maximum likelihood estimation (PMLE) have the following idea in common: establish an appropriate objective function to estimate $ {\boldsymbol{\beta}}$ with parametric $ \sqrt{n}$-rate. Certainly, inside the objective function we use the conditional distribution of $ Y$, or the link function $ g$, or both of them. As these are unknown they need to be replaced by nonparametric estimates. The objective function then will be maximized (or minimized) with respect to the parameter  $ {\boldsymbol{\beta}}$.

Why is this extension of least squares or maximum likelihood not a trivial one? The reason is that when $ {\boldsymbol{\beta}}$ changes, the link $ g$ (respectively its nonparametric substitute) may change simultaneously. Thus, it is not clear if the necessary iteration will converge and even if it does, that it will converge to a unique optimum.

SLS and its weighted version (WSLS) have been introduced by Ichimura (1993). As SLS is just a special case of WSLS with a weighting equal to the identity matrix, we concentrate here on WSLS. An objective function of least squares type can be motivated by minimizing the variation in the data that can not be explained by the fitted regression. This ``left over'' variation can be written as

$\displaystyle \mathop{\mathit{Var}}\{Y\vert v_{\boldsymbol{\beta}}({\boldsymbol...
...symbol{X}})]
\right\}^2 \vert v_{\boldsymbol{\beta}}({\boldsymbol{X}}) \right] $

with $ v_{\boldsymbol{\beta}}(\bullet)$ being the index function specified up to $ {\boldsymbol{\beta}}$. The previous equation leads us to a variant of the well known LS criterion

$\displaystyle \min_{\boldsymbol{\beta}}E \left[ Y - E\{Y\vert v_{\boldsymbol{\beta}}({\boldsymbol{X}})\} \right]^2$ (6.7)

in which $ v_{\boldsymbol{\beta}}({\boldsymbol{X}})$ on the right hand side has to be replaced by a (consistent) estimate. The outer expectation in (6.7) can be replaced by an average.

We can account for possible heteroscedasticity by using proper weights. This motivates

$\displaystyle \min_{\boldsymbol{\beta}}\frac 1n \sum_{i=1}^n \left[ Y_i - E\{Y_...
...v_{\boldsymbol{\beta}}( {\boldsymbol{X}}_i )\} \right]^2 w( {\boldsymbol{X}}_i)$ (6.8)

with $ w(\bullet )$ an appropriate weight function. Next, employ the nonparametric technique to substitute the (inner) unknown conditional expectation. As the index function $ v_\beta$ is univariate, we could take any univariate consistent smoother. For simplicity of the presentation, we will use a Nadaraya-Watson type estimator here.

Define the WSLS estimator as

$\displaystyle \widehat{\boldsymbol{\beta}}= \min_{\boldsymbol{\beta}}\frac 1n \...
..._i) \right\}^2 w( {\boldsymbol{X}}_i) \Ind ({\boldsymbol{X}}_i\in{\mathcal{X}})$ (6.9)

where $ \Ind ({\boldsymbol{X}}_i\in {{\mathcal{X}}})$ is a trimming factor and $ \widehat{m}_{\boldsymbol{\beta}}$ a leave-one-out estimator of $ m$ assuming the parameter $ {\boldsymbol{\beta}}$ would be known. In more detail, $ \widehat{m}_{\boldsymbol{\beta}}$ is a (weighted) Nadaraya-Watson estimator

$\displaystyle \widehat{m}_{\boldsymbol{\beta}}({\boldsymbol{X}}_i) = \frac{\sum...
..._{\boldsymbol{\beta}}(X_j) \right\} w(X_j) \Ind (X_j\in {{\mathcal{X}}}_n ) }\,$ (6.10)

with $ h$ denoting a bandwidth and $ K_h$ a scaled (compact support) kernel. The trimming factor $ \Ind ({\boldsymbol{X}}_i\in {{\mathcal{X}}})$ has been introduced to guarantee that the density of the index $ v$ is bounded away from zero. $ {{\mathcal{X}}}$ has to be chosen accordingly. The set $ {{\mathcal{X}}}_n$ in the trimming factor of (6.10) is constructed in such a way that all boundary points of $ {{\mathcal{X}}}$ are interior to $ {{\mathcal{X}}}_n$:

$\displaystyle {\mathcal{X}}_n=\{x: \Vert x-z\Vert\le 2h \textrm{ for some }z\in{\mathcal{X}}\}.$

In practice trimming can often be skipped, but it is helpful for establishing asymptotic theory. Also the choice of taking the leave-one-out version of the Nadaraya-Watson estimator in (6.10) is for that reason. For the estimator given (6.9) the following asymptotic results can be proved, for the details see Ichimura (1993).

THEOREM 6.1  
Assume that $ Y$ has a $ \kappa$th absolute moment ( $ \kappa\geq 3$) and follows model (6.4). Suppose further that the bandwidth $ h$ converges to 0 at a certain rate. Then, under regularity conditions on the link $ g$, the error term $ \varepsilon$, the regressors $ {\boldsymbol{X}}$ and the index function $ v$, the WSLS estimator (6.9) fulfills

$\displaystyle \sqrt{n}(\widehat{\boldsymbol{\beta}}-{\boldsymbol{\beta}})\mathr...
...w}\limits_{}^{L}}N(0,{\mathbf{V}}^{-1} {\boldsymbol{\Sigma}}{\mathbf{V}}^{-1}),$

where, using the notation

$\displaystyle \gradi_v=\frac{\partial v_{\boldsymbol{\beta}}({\boldsymbol{X}})}...
...boldsymbol{\beta}}({\boldsymbol{X}}),{\boldsymbol{X}}\in{\mathcal{X}}\}}\right)$

and $ \sigma^2({\boldsymbol{X}})=\mathop{\mathit{Var}}(Y\vert{\boldsymbol{X}})$, the matrices $ {\mathbf{V}}$ and $ {\boldsymbol{\Sigma}}$ are defined by
$\displaystyle {\mathbf{V}}$ $\displaystyle =$ $\displaystyle E \left\{ w({\boldsymbol{X}})
\gradi_v{\gradi_v}^\top \big\vert {\boldsymbol{X}}\in {{\mathcal{X}}}\right\}\,,$  
$\displaystyle {\boldsymbol{\Sigma}}$ $\displaystyle =$ $\displaystyle E \left\{ w^2({\boldsymbol{X}}) \sigma^2({\boldsymbol{X}})
\gradi_v{\gradi_v}^\top \big\vert {\boldsymbol{X}}\in {{\mathcal{X}}}\right\}\,.$  

As we can see, the estimator $ \widehat{\boldsymbol{\beta}}$ is unbiased and converges at parametric $ \sqrt{n}$-rate. Additionally, by choosing the weight

$\displaystyle w({\boldsymbol{X}}) =\frac{1}{\sigma^{2}({\boldsymbol{X}})}\,,$

the WSLS reaches the efficiency bound for the parameter estimates in SIM (Newey, 1990). This means, $ \widehat{{\boldsymbol{\beta}}}$ is an unbiased parameter estimator with optimal rate and asymptotically efficient covariance matrix. Unfortunately, $ \sigma^2(\bullet)$ is an unknown function as well. In practice, one therefore uses an appropriate pre-estimate for the variance function.

How do we estimate the (asymptotic) variance of the estimator? Not surprisingly, the expressions $ {\mathbf{V}}$ and $ {\boldsymbol{\Sigma}}$ are estimated consistently by their empirical analogs. Denote

$\displaystyle \widehat{\gradi}_m=\frac{\partial \widehat{m}_{\boldsymbol{\beta}...
..._i)}{\partial {\boldsymbol{\beta}}}
\Big\vert_{\widehat{\boldsymbol{\beta}}}\,.$

Ichimura (1993) proves that
$\displaystyle \widehat {\mathbf{V}}$ $\displaystyle =$ $\displaystyle \frac 1n \sum_{i=1}^n w({\boldsymbol{X}}_i) \Ind ({\boldsymbol{X}}_i\in {{\mathcal{X}}}_n)
\widehat{\gradi}_m {\widehat{\gradi}_m}^\top\,,$  
$\displaystyle \widehat {\boldsymbol{\Sigma}}$ $\displaystyle =$ $\displaystyle \frac 1n \sum_{i=1}^n w^2({\boldsymbol{X}}_i) \Ind ({\boldsymbol{...
...a}}({\boldsymbol{X}}_i) \right\}^2
\widehat{\gradi}_m {\widehat{\gradi}_m}^\top$  

are consistent estimators for $ {\mathbf{V}}$ and $ {\boldsymbol{\Sigma}}$ under appropriate bandwidth conditions.

6.2.2 Pseudo Likelihood Estimation

For motivation of pseudo maximum likelihood estimation (PMLE) we rely on the ideas developed in the previous section. Let us first discuss why pseudo maximum likelihood always leads to an unbiased $ \sqrt{n}$-consistent estimators with minimum achievable variance: In fact, the computation of the PMLE reduces to a formal parametric MLE problem with as many parameters as observations. In this case (as we have also seen above), the inverse Fisher information turns out to be a consistent estimator for the covariance matrix of the PMLE. Gill (1989) and Gill & van der Vaart (1993) explain this as follows: a sensibly defined nonparametric MLE can be seen as a MLE in any parametric submodel which happens to include or pass through the point given by the PMLE. For smooth parametric submodels, the MLE solves the likelihood equations. Consequently, also in nonparametric problems the PMLE can be interpreted as the solution of the likelihood equations for every parametric submodel passing through it.

You may have realized that the mentioned properties coincide with the results made for the WSLS above. Indeed, looking closer at the objective function in (6.7) we could re-interpret it as the result of a maximum likelihood consideration. We only have to set weight $ w$ equal to the inverse of the (known) variance function (compare the discussion of optimal weighting). We refer to the bibliographic notes for more details.

To finally introduce the PMLE, let us come back to the issue of a binary response model, i.e., observe $ Y$ only in $ \{0,1\}$. Recall that this means

$\displaystyle Y = \left\{ \begin{array}{ll} 1 \quad & \textrm{ if } v_{\boldsym...
...boldsymbol{X}}) > \varepsilon, \\ 0 & \textrm{ otherwise, } \end{array} \right.$ (6.11)

where the index function $ v_{\boldsymbol{\beta}}$ is known up to $ {\boldsymbol{\beta}}$. We further assume that

$\displaystyle E(Y\vert{\boldsymbol{X}}) = E\{Y\vert v_{\boldsymbol{\beta}}({\boldsymbol{X}})\}.$

This is indeed an additional restriction but still allowing for multiplicative heteroscedasticity as discussed around equation (6.3). Unfortunately, under heteroscedasticity -- when the variance function of the error term depends on the index to estimate -- a change of $ {\boldsymbol{\beta}}$ changes the variance function and thus implicitly affects equation (6.11) through $ \varepsilon$. This has consequences for PMLE as the likelihood function is determined by (6.11) and the error distribution given the index $ v_{\boldsymbol{\beta}}(\bullet)$. For this reason we recall the notation

$\displaystyle \varepsilon = \varpi\{ v({\boldsymbol{X}},{\boldsymbol{b}}) \} \cdot \zeta$

where $ \varpi$ is the variance function and $ \zeta$ an error term independent of $ {\boldsymbol{X}}$ and $ v$. In the case of homoscedasticity we would have indeed $ \varepsilon = \zeta$. From (5.7) we know that

$\displaystyle E\{Y\vert v_{\boldsymbol{\beta}}({\boldsymbol{X}})\} = G_{\varepsilon \vert {\boldsymbol{X}}} \{ v_{\boldsymbol{\beta}}({\boldsymbol{X}})\}, $

where $ G_{\varepsilon \vert {\boldsymbol{X}}}$ is the conditional distribution of the error term. Since $ Y$ is binary, the log-likelihood function for this model (cf. (5.17)) is given by

$\displaystyle \frac 1n \sum_{i=1}^n \left( Y_i \log \left[ G_{\varepsilon \vert...
...vert {\boldsymbol{X}}_i } \{ v_{\boldsymbol{\beta}}( X_i ) \} \right] \right) .$ (6.12)

The problem is now to obtain a substitute for the unknown function $ G_{\varepsilon \vert {\boldsymbol{X}}_i }$. We see that

$\displaystyle G_{\varepsilon \vert x } (v) = P \left(\varepsilon < v \vert v \r...
...) = P\left(\varepsilon < v \right) \,\frac{\psi_{\varepsilon < v} (v)}{\psi(v)}$ (6.13)

with $ \psi$ being the pdf of the index $ v_{\boldsymbol{\beta}}({\boldsymbol{X}})$ and $ \psi_{\varepsilon<v}$ the conditional pdf of $ v_{\boldsymbol{\beta}}({\boldsymbol{X}})$ given $ \varepsilon<v_{\boldsymbol{\beta}}({\boldsymbol{X}})$. Since $ \varepsilon<v_{\boldsymbol{\beta}}({\boldsymbol{X}})$ if and only if $ Y=1$, this is equivalent to

$\displaystyle G_{\varepsilon \vert {\boldsymbol{x}}} (v) =
P(Y=1) \frac{\psi_{Y=1}(v)}{\psi(v)} .$

In the last expression, we can estimate all expressions nonparametrically. Instead of estimating $ P(Y=1)$ by $ \overline{Y}$, Klein & Spady (1993) propose to consider $ \psi_{y} = P(Y=y) \psi_{Y=1}
$ and $ \psi= \psi_{0}+\psi_{1}$. One can estimate

$\displaystyle \widehat \psi_{y}(v) = \frac{1}{n-1} \sum_{j\neq i} \Ind (Y_j=y) K_h \left\{ v_{\boldsymbol{\beta}}({\boldsymbol{X}}_j)-v \right\} ,$ (6.14)

where $ K_h( \bullet)$ denotes the scaled kernel as before. Hence, an estimate for $ G_{\varepsilon \vert {\boldsymbol{X}}_i }$ in (6.12) is given by

$\displaystyle \widehat{G}_{\varepsilon \vert {\boldsymbol{X}}_i }\{v_{\boldsymb...
...beta}}({\boldsymbol{X}}_j)-v_{\boldsymbol{\beta}}({\boldsymbol{X}}_i) \right\}}$

To obtain the $ \sqrt{n}$-rate for $ \widehat{\boldsymbol{\beta}}$, one uses either bias reduction via higher order kernels or an adaptive undersmoothing bandwidth $ h$. Problems in the denominator when the density estimate becomes small, can be avoided by adding small terms in both the denominator and the numerator. These terms have to vanish at a faster rate than that for the convergence of the densities.

We can now define the pseudo log-likelihood version of (6.12) by

$\displaystyle \frac 1n \sum_{i=1}^n w({\boldsymbol{X}}_i) \left\{ Y_i \log \lef...
... \vert {\boldsymbol{X}}_i } \{ v_{\boldsymbol{\beta}}( X) \} \right]^2 \right\}$ (6.15)

The weight function $ w(\bullet )$ can be introduced for numerical or technical reasons. Taking the squares inside the logarithms avoids numerical problems when using higher order kernels. (Otherwise these terms could be negative.) The estimator $ \widehat{\boldsymbol{\beta}}$ is found by maximizing (6.15).

Klein & Spady (1993) prove the following result on the asymptotic behavior of the PMLE $ \widehat{\boldsymbol{\beta}}$. More details about conditions for consistency, the choice of the weight function $ w$ and the appropriate adaptive bandwidth $ h$ can be found in their article. We summarize the asymptotic distribution in the following theorem.

THEOREM 6.2  
Under some regularity conditions it holds

$\displaystyle \sqrt{n}\left( \widehat{\boldsymbol{\beta}}-{\boldsymbol{\beta}}\...
...t) \mathrel{\mathop{\longrightarrow}\limits_{}^{L}}
N(0,{\boldsymbol{\Sigma}})$

where

$\displaystyle {\boldsymbol{\Sigma}}^{-1} = E\left\{
\frac{\partial\Gamma}{\par...
...on\vert {\boldsymbol{X}}_i} \{ v_{\boldsymbol{\beta}}( {\boldsymbol{X}}_i)\} . $

It turns out that the Klein & Spady estimator $ \widehat{\boldsymbol{\beta}}$ reaches the efficiency bound for SIM estimators. An estimator for $ {\mathbf{\Sigma}}$ can be obtained by its empirical analog.

As the derivation shows, the Klein & Spady PMLE $ \widehat{\boldsymbol{\beta}}$ does only work for binary response models. In contrast, the WSLS was given for an arbitrary distribution of $ Y$. For that reason we consider now an estimator that generalizes the idea of Klein & Spady to arbitrary distributions for $ Y$.

Typically, the pseudo log-likelihood is based on the density $ f$ (if $ Y$ is continuous) or on the probability function $ f$ (if $ Y$ is discrete). The main idea of the following estimator first proposed by Delecroix et al. (2003) is that the function $ f$ which defines the distribution of $ Y$ given $ {\boldsymbol{X}}$ only depends on the index function $ v$, i.e.,

$\displaystyle f(y\vert{\boldsymbol{x}}) = f\{y\vert v_{\boldsymbol{\beta}}({\boldsymbol{x}})\}.$

In other words, the index function $ v$ contains all relevant information about $ {\boldsymbol{X}}$. The objective function to maximize is

$\displaystyle E\left[ \log f \{Y\vert v_{\boldsymbol{\beta}}({\boldsymbol{X}})\}\right].$

As for SLS we proxy this expectation by averaging and introduce a trimming function:

$\displaystyle \frac 1n \sum_{i=1}^n \Ind \{(Y_i,{\boldsymbol{X}}_i)\in {\mathcal{S}}\} \log f \{Y_i \vert v_{\boldsymbol{\beta}}({\boldsymbol{X}}_i)\}$ (6.16)

where $ {\mathcal{S}}$ denotes a suitable subset of the support of $ (Y,{\boldsymbol{X}})$. Here, all we have to do is to estimate the conditional density or probability mass function $ f\{y\vert v_{\boldsymbol{\beta}}({\boldsymbol{x}})\}$. An estimator is given by:

$\displaystyle \widehat f \{ Y_i\vert v_{\boldsymbol{\beta}}({\boldsymbol{X}}_i)...
...ta}}({\boldsymbol{X}}_i)-v_{\boldsymbol{\beta}}({\boldsymbol{X}}_j)\right\}}\,.$ (6.17)

To reach the $ \sqrt{n}$-rate, fourth order kernels and a bandwidth of rate $ n^{-\delta}, \delta \in
(1/8,1/6)$, needs to be used. Delecroix et al. (2003) present their result for the linear index function

$\displaystyle v_{\boldsymbol{\beta}}({\boldsymbol{X}})= {\boldsymbol{X}}^\top {\boldsymbol{\beta}}.$

Therefore the following theorem only considers the asymptotic results for that special, but most common case.

THEOREM 6.3  
Let $ \widehat{\boldsymbol{\beta}}$ be the vector that maximizes (6.16) under the use of (6.17). Then under the above mentioned and further regularity conditions it holds

$\displaystyle \sqrt{n}\left( \widehat{\boldsymbol{\beta}}-{\boldsymbol{\beta}}\...
... \ \rightarrow \
N(0,{\mathbf{V}}^{-1}{\boldsymbol{\Sigma}}{\mathbf{V}}^{-1}) $

where

$\displaystyle {\boldsymbol{\Sigma}}=
E\left\{ \frac{\partial\log f(Y\vert{\bold...
...b}}^\top }
\Ind \left\{(Y,{\boldsymbol{X}})\in {\mathcal{S}}\right\} \right\}
$

and

$\displaystyle {\mathbf{V}}= E \left\{ -\frac{\partial^2
\log f(Y\vert{\boldsym...
...ta}}^\top } \Ind \left\{(Y,{\boldsymbol{X}})\in {\mathcal{S}}\right\} \right\}.$

Again, it can be shown that the variance of this estimator reaches the lower efficiency bound for SIM. That means this procedure provides efficient estimators for $ {\boldsymbol{\beta}}$, too. Estimates for the matrices $ {\mathbf{V}}$ and $ {\mathbf{\Sigma}}$ can be found by their empirical analogs.

6.2.3 Weighted Average Derivative Estimation

We will now turn to a different type of estimator with two advantages: (a) we do not need any distributional assumption on $ Y$ and (b) the resulting estimator is direct, i.e. non-iterative. The basic idea is to identify $ {\boldsymbol{\beta}}$ as the average derivative and thus the studied estimator is called average derivative estimator (ADE) or weighted average derivative estimator (WADE). The advantages of ADE or WADE estimators come at a cost, though, as they are inefficient. Furthermore, the average derivative method is only directly applicable to models with continuous explanatory variables.

At the end of this section we will discuss how to estimate the coefficients of discrete explanatory variables in SIM, a method that can be combined with the ADE/WADE method. For this reason we introduce the notation

$\displaystyle {\boldsymbol{X}}=({\boldsymbol{T}},{\boldsymbol{U}})$

for the regressors. Here, $ {\boldsymbol{T}}$ ($ q$-dimensional) refers explicitly to continuous variables and $ {\boldsymbol{U}}$ ($ p$-dimensional) to discrete (or categorical) variables.

Let us first consider a model with a $ d$-dimensional vector of continuous variables only, i.e.,

$\displaystyle E(Y\vert{\boldsymbol{T}})=m({\boldsymbol{T}})=g({\boldsymbol{T}}^\top {\boldsymbol{\gamma}}).$ (6.18)

Then, the vector of weighted average derivatives $ {\boldsymbol{\delta}}$ is given by

$\displaystyle {\boldsymbol{\delta}}=E\left\{\gradi_m({\boldsymbol{T}})\,w({\bol...
...^\top {\boldsymbol{\gamma}})w({\boldsymbol{T}})\right\}\;{\boldsymbol{\gamma}},$ (6.19)

where $ \gradi_m({\boldsymbol{T}})=\left(\partial_{1} m({\boldsymbol{T}}),
\ldots,\partial_{q} m({\boldsymbol{T}}) \right)^\top $ is the vector of partial derivatives of $ m(\bullet)$, $ g'$ the derivative of $ g(\bullet)$ and $ w(\bullet )$ a weight function. (By $ \partial_{k}$ we denote the partial derivative w.r.t. the $ k$th argument of the function.)

Looking at (6.19) shows that $ {\boldsymbol{\delta}}$ equals $ {\boldsymbol{\gamma}}$ up to scale. Hence, if we find a way to estimate $ {\boldsymbol{\delta}}$ then we can also estimate $ {\boldsymbol{\gamma}}$ up to scale. The approach studied in Powell et al. (1989) uses the density $ f(\bullet)$ of $ {\boldsymbol{T}}$ as the weight function:

$\displaystyle w(t)=f(t).$

This estimator is sometimes referred to as density weighted ADE or DWADE. We will concentrate on this particular weight function. Generalizations to other weight functions are possible.

For deriving the estimator, it is instructive to write (6.19) in more detail:

$\displaystyle \int_{\mathbb{R}^q} \gradi_m({\boldsymbol{t}}) f^2({\boldsymbol{t...
...q}
m({\boldsymbol{t}})\right)^\top
f^2({\boldsymbol{t}})\,dt_{1}\cdots dt_{q}.$

Partial integration yields

$\displaystyle {\boldsymbol{\delta}} = \left(\begin{array}{c} -2\int \cdots \int...
..._{q}\end{array}\right) = -2 E\{\gradi_f({\boldsymbol{t}})m({\boldsymbol{t}})\},$ (6.20)

if we assume that $ f({\boldsymbol{t}})\,m({\boldsymbol{t}})\to 0$ for $ \Vert{\boldsymbol{t}}\Vert \to \infty$. Noting that $ m({\boldsymbol{t}})=E(Y\vert{\boldsymbol{t}})$ and using the law of iterated expectations we finally arrive at

$\displaystyle E \{\gradi_f({\boldsymbol{T}}) m({\boldsymbol{T}})\}=E [E\{\gradi_f({\boldsymbol{T}}) Y\vert{\boldsymbol{T}}\}] =E\{\gradi_f({\boldsymbol{T}}) Y\}.$ (6.21)

We can now estimate $ {\boldsymbol{\delta}}$ by using the sample analog of the right hand side of (6.21):

$\displaystyle \widehat{{\boldsymbol{\delta}}} =- \frac{2}{n} \sum_{i=1}^{n} \widehat{\gradi}_{f_{{\boldsymbol{h}}}}({\boldsymbol{T}}_{i}) Y_{i},$ (6.22)

where we estimate $ \gradi_f$ by

$\displaystyle \widehat{\gradi}_{f_{{\boldsymbol{h}}}}({\boldsymbol{t}}) = \left...
...s, -\partial_{q} \widehat{f}_{{\boldsymbol{h}}}({\boldsymbol{t}})\right)^\top .$ (6.23)

Here, the $ \partial_{k} \widehat{f}_{{\boldsymbol{h}}}({\boldsymbol{T}})$, are the partial derivatives of the multivariate kernel density estimator from Section 3.6, i.e.,

$\displaystyle \partial_{k} \widehat{f}_{{\boldsymbol{h}}}({\boldsymbol{t}})
= \...
...{K}}\left(
\frac{t_{1}-T_{j1}}{h_1},\ldots,\frac{t_{q}-T_{jd}}
{h_{q}}\right). $

Regarding the sampling distribution of the estimator $ \widehat{{\boldsymbol{\delta}}}$ Powell et al. (1989) have shown the following theorem.

THEOREM 6.4  
Under regularity conditions we have

$\displaystyle \sqrt{n} (\widehat{{\boldsymbol{\delta}}}-{\boldsymbol{\delta}}) \mathrel{\mathop{\longrightarrow}\limits_{}^{L}}
N( 0, {\mathbf{\Sigma}}). $

The covariance matrix is given by $ {\mathbf{\Sigma}}=
4 E({\boldsymbol{r}}-{\boldsymbol{\delta}})({\boldsymbol{r}}-{\boldsymbol{\delta}})^\top$ where $ {\boldsymbol{r}}$ is given by $ {\boldsymbol{r}}=f({\boldsymbol{T}})\gradi_m({\boldsymbol{T}})-\{Y-m({\boldsymbol{T}})\}\gradi_f({\boldsymbol{T}})$.

Note that although $ \widehat{{\boldsymbol{\delta}}}$ is based on a multidimensional kernel density estimator, it achieves $ \sqrt{n}$-convergence as the SIM estimators considered previously which were all based on univariate kernel smoothing.

EXAMPLE 6.1  
We cite here an example on unemployment after completion of an apprenticeship in West Germany which was first analyzed in Proença & Werwatz (1995). The data comprise individuals from the first nine waves of the German Socio-Economic Panel (GSOEP), see GSOEP (1991). The dependent variable $ Y$ takes on the value 1 if the individual is unemployed one year after completing the apprenticeship. Explanatory variables are gross monthly earnings as an apprentice (EARNINGS), city size (CITY SIZE), percentage of people apprenticed in a certain occupation divided by the percentage of people employed in this occupation in the entire company (DEGREE) and unemployment rate in the state where the individual lived (URATE).


5pt
Table 6.1: WADE fit of unemployment data
  GLM WADE
  (logit) $ h=1$ $ h=1.25$ $ h=1.5$ $ h=1.75$ $ h=2$
constant -5630 -- -- -- -- --
EARNINGS -1.00 -1.00 -1.00 -1.00 -1.00 -1.00
CITY SIZE -0.72 -0.23 -0.47 -0.66 -0.81 -0.91
DEGREE -1.79 -1.06 -1.52 -1.93 -2.25 -2.47
URATE 363.03 169.63 245.75 319.48 384.46 483.31

Table 6.1 shows the results of a GLM fit (using the logistic link function) and the WADE coefficients (for different bandwidths). For easier comparison the coefficients are rescaled such that all coefficients of EARNINGS are equal to $ -1$. To eliminate the possible effects of the correlation between the variables and to standardize the data, a Mahalanobis transformation had been applied before computing the WADE. Note that in particular for $ h=1.5$ the coefficients of both the GLM and WADE are very close. One may thus argue that the parametric logit model is not grossly misspecified. $ \Box$

Let us now turn to the problem of estimating coefficients for discrete or categorical variables. By definition, derivatives can only be calculated if the variable under study is continuous. Thus, the WADE method fails for discrete explanatory variables. Before presenting a more general solution, let us explain how the coefficient of one dichotomous variable is introduced to the model. We extend model (6.18) by an additional term:

$\displaystyle E(Y\vert{\boldsymbol{T}},{\boldsymbol{U}}) = g({\boldsymbol{T}}^\top {\boldsymbol{\gamma}}+ {\boldsymbol{U}}^\top {\boldsymbol{\beta}})$ (6.24)

with $ {\boldsymbol{T}}$ the continuous and $ {\boldsymbol{U}}$ the discrete part of the regressors. In the simplest case we suppose that discrete part is a univariate binary variable $ U$ and that $ Y$ is binary as well. In this case, the model ``splits'' into two submodels

\begin{displaymath}\begin{array}{rcll}
E(Y\vert{\boldsymbol{T}},U)=P(Y=1\vert{\b...
... {\boldsymbol{\gamma}}+\beta ) &\textrm{if }\ U=1.
\end{array}\end{displaymath}

There are in fact two models to be estimated, one for $ U=0$ and one for $ U=1$. Note that $ {\boldsymbol{\gamma}}$ alone could be estimated from the first equation only.

Theoretically, the same $ {\boldsymbol{T}}_i$ can be associated with either $ U_i=0$ yielding an index value of $ {\boldsymbol{T}}_i^\top{\boldsymbol{\gamma}}$ or with $ U_i=1$ leading to an index value of $ {\boldsymbol{T}}_i^\top{\boldsymbol{\gamma}}+\beta$. Thus the difference between the two indices is exactly $ \beta$, see the left panel of Figure 6.2.

Figure 6.2: The horizontal and the integral approach
\includegraphics[width=0.65\defpicwidth]{SPMalpha.ps} \includegraphics[width=0.65\defpicwidth]{SPMalver.ps}

In practice finding these horizontal differences will be rather difficult. A common approach is based on the observation that the integral difference between the two link functions also equals $ \beta$, see the right panel of Figure 6.2.

A very simple estimator is proposed in Korostelev & Müller (1995). Essentially, the coefficient of the binary explanatory variable can be estimated by

$\displaystyle \widehat{\beta }=\widehat{J}^{(1)}-\widehat{J}^{(0)}$

with

$\displaystyle \widehat{J}^{(0)}=\sum_{i=0}^{n_{0}} Y^{(0)}_{i} ({\boldsymbol{T}...
...dsymbol{T}}^{(1)}_{i+1}
-{\boldsymbol{T}}^{(1)}_{i})^\top{\boldsymbol{\gamma}},$

where the superscripts $ ^{(0)}$ and $ ^{(1)}$ denote the observations from the subsamples according to $ U=0$ and $ U=1$. The estimator is in the simplest case of a binary $ Y$ variable $ \sqrt{n}$-consistent and can be improved for efficiency by a one-step estimator (Korostelev & Müller, 1995).

Horowitz & Härdle (1996) extend this approach to multivariate multi-categorical $ {\boldsymbol{U}}$ and an arbitrary distribution of $ Y$. Recall the model (6.24)

$\displaystyle E(Y\vert{\boldsymbol{T}},{\boldsymbol{U}}) = g({\boldsymbol{T}}^\top {\boldsymbol{\gamma}}+ {\boldsymbol{U}}^\top {\boldsymbol{\beta}}).$

Again, the approach for this model is based on a split of the whole sample into subsamples according to the categories of $ {\boldsymbol{U}}$. However, this subsampling makes the estimator infeasible for more than one or two discrete variables. To compute integral differences of the link functions according to the realizations of $ {\boldsymbol{U}}$, we consider the truncated link function

$\displaystyle \widetilde{g}=c_{o}\,\Ind(g<c_{o}) + g\,\Ind(c_{o}\le g
\le c_{1}) + c_{1}\,\Ind(g>c_{1}).$

Denote now $ {\boldsymbol{u}}^{(k)}$ a possible realization of $ {\boldsymbol{U}}$, then the integrated link function conditional on $ {\boldsymbol{u}}^{(k)}$ is

$\displaystyle J^{(k)}=\int_{v_{o}}^{v_{1}} \widetilde{g}(v+{{\boldsymbol{u}}^{(k)}}^\top{\boldsymbol{\beta}})\,dv.$

Now compare the integrated link functions for all $ {\boldsymbol{U}}$-categories $ {\boldsymbol{u}}^{(k)}$ ( $ k=1,\ldots,M$) to the first $ {\boldsymbol{U}}$-category $ {\boldsymbol{u}}^{(0)}$. It holds

$\displaystyle J^{(k)}-J^{(0)} =
(c_{1}-c_{0})\,\left\{{\boldsymbol{u}}^{(k)}-{\boldsymbol{u}}^{(0)}\right\} {\boldsymbol{\beta}},$

hence with

$\displaystyle \Delta J = \left(\begin{array}{cc} J^{(1)}-J^{(0)}\\
\cdots \\ J...
...\\
\cdots \\ {\boldsymbol{u}}^{(M)}-{\boldsymbol{u}}^{(0)} \end{array} \right)$

we obtain $ \Delta J = (c_{1}-c_{0})\,\Delta {\boldsymbol{u}}\,{\boldsymbol{\beta}}$. This yields finally

$\displaystyle {\boldsymbol{\beta}}= (c_{1}-c_{o})^{-1} (\Delta {\boldsymbol{u}}^\top \Delta {\boldsymbol{u}})^{-1} \Delta {\boldsymbol{u}}^\top \Delta J$ (6.25)

to determine the coefficients $ {\boldsymbol{\beta}}$. The estimation of $ {\boldsymbol{\beta}}$ is based on replacing $ J^{(k)}$ in 6.25 by

$\displaystyle \widehat{J}^{(k)}=\int_{v_{o}}^{v_{1}} \widehat{\widetilde{g}}
(v+{\boldsymbol{\beta}}^\top {\boldsymbol{u}}^{(k)})\,dv$

with $ \widehat{\widetilde{g}}$ a nonparametric estimate of the truncated link function $ \widetilde{g}$. This estimator is obtained by a univariate regression of the estimated ``continuous'' index values $ \widehat{{\boldsymbol{\gamma}}}^\top {\boldsymbol{T}}_i^{(k)}$ on $ Y_i^{(k)}$. Horowitz & Härdle (1996) show that using a $ \sqrt{n}$-consistent estimate $ \widehat{{\boldsymbol{\gamma}}}$ and a Nadaraya-Watson estimator $ \widehat{\widetilde{g}}$ the estimated coefficient $ \widehat{{\boldsymbol{\beta}}}$ is itself $ \sqrt{n}$-consistent and has an asymptotic normal distribution.