When estimating a SIM, we have to take into account
that the functional form of the link function is unknown.
Moreover, since the shape of
will determine the value
of
Recall that
is the vector of explanatory variables,
is a
-dimensional vector of unknown coefficients (or
weights), and
is an arbitrary smooth function. Let
Two essentially different approaches exist for this purpose:
The final step is relevant for all SIM estimators considered in the
following sections. More precisely: Suppose that
has already been estimated by
. Set
![]() |
(6.5) |
As in the first part of this book we concentrate on kernel based
methods here. The choice of and
in (6.6) is independent
of the choice of these parameters in the following sections. For other
than kernel based SIM estimators we refer
to the bibliographic notes.
As indicated in the introduction we are going to
concentrate on the estimation of
now.
The methods that we consider here under
semiparametric least squares (SLS) and pseudo
maximum likelihood estimation (PMLE)
have the following idea in common: establish an
appropriate objective function to estimate
with
parametric
-rate. Certainly, inside the objective
function we use the conditional distribution of
, or
the link function
, or both of them. As these are unknown they need
to be replaced by nonparametric estimates.
The objective function then will be maximized (or minimized)
with respect to the parameter
.
Why is this extension of least squares or maximum likelihood not a
trivial one? The reason is that when
changes, the link
(respectively its nonparametric substitute) may change simultaneously.
Thus, it is not clear if the
necessary iteration will converge and even if it does, that it
will converge to a unique optimum.
SLS and its weighted version (WSLS) have been introduced by Ichimura (1993). As SLS is just a special case of WSLS with a weighting equal to the identity matrix, we concentrate here on WSLS. An objective function of least squares type can be motivated by minimizing the variation in the data that can not be explained by the fitted regression. This ``left over'' variation can be written as
We can account for possible heteroscedasticity by using proper weights. This motivates
Define the WSLS estimator as
In practice trimming can often be skipped, but it is helpful for establishing asymptotic theory. Also the choice of taking the leave-one-out version of the Nadaraya-Watson estimator in (6.10) is for that reason. For the estimator given (6.9) the following asymptotic results can be proved, for the details see Ichimura (1993).
![]() |
![]() |
![]() |
|
![]() |
![]() |
![]() |
How do we estimate the (asymptotic) variance of
the estimator? Not surprisingly, the expressions
and
are estimated consistently by their empirical analogs.
Denote
![]() |
![]() |
![]() |
|
![]() |
![]() |
![]() |
For motivation of pseudo maximum likelihood estimation (PMLE)
we rely on the ideas developed in the previous section.
Let us first discuss why
pseudo maximum likelihood always leads to an
unbiased -consistent estimators with
minimum achievable variance: In fact, the computation of the PMLE
reduces to a formal parametric MLE problem with as many parameters
as observations. In this case (as we have also seen above),
the inverse Fisher information turns out to be a consistent estimator for the
covariance matrix of the PMLE. Gill (1989) and
Gill & van der Vaart (1993) explain this as follows:
a sensibly defined nonparametric MLE can be seen as a MLE in any
parametric submodel which happens to include or pass through the
point given by the PMLE. For smooth parametric submodels, the MLE
solves the likelihood equations. Consequently, also in
nonparametric problems the PMLE can be interpreted as the solution
of the likelihood equations for every parametric submodel passing
through it.
You may have realized that the mentioned properties coincide
with the results made for the WSLS above. Indeed, looking closer
at the objective function in (6.7) we could re-interpret
it as the result of a maximum likelihood consideration. We
only have to set weight equal to the inverse of the (known)
variance function (compare the discussion of optimal weighting).
We refer to the bibliographic notes for more details.
To finally introduce the PMLE, let us come back to the issue
of a binary response model, i.e., observe only in
.
Recall that this means
![]() |
(6.14) |
We can now define the pseudo log-likelihood version of (6.12) by
Klein & Spady (1993) prove the following result on the
asymptotic behavior of the PMLE
.
More details about conditions for consistency,
the choice of the weight function
and the appropriate adaptive
bandwidth
can be found in their article. We summarize the
asymptotic distribution in the following theorem.
It turns out that the Klein & Spady estimator
reaches the efficiency bound for
SIM estimators. An estimator for
can be
obtained by its empirical analog.
As the derivation shows, the Klein & Spady PMLE
does only work for binary response models. In contrast, the
WSLS was given for an arbitrary distribution of
.
For that reason we consider now an estimator that
generalizes the idea of Klein & Spady to arbitrary distributions
for
.
Typically, the pseudo log-likelihood is based on the
density (if
is continuous) or on the probability function
(if
is discrete). The main idea of the following estimator
first proposed by Delecroix et al. (2003) is that the
function
which defines the distribution of
given
only depends on the index function
, i.e.,
To reach the -rate, fourth order kernels and
a bandwidth of rate
, needs to be used. Delecroix et al. (2003)
present their result for the linear index function
Again, it can be shown that the variance of this estimator reaches
the lower efficiency bound for SIM. That
means this procedure provides efficient estimators for
,
too. Estimates for the matrices
and
can be found
by their empirical analogs.
We will now turn to a different type of estimator with two
advantages: (a) we do not need any distributional assumption on
and (b) the resulting estimator is direct, i.e. non-iterative.
The basic idea is to identify
as
the average derivative and thus the studied estimator
is called average derivative estimator (ADE) or weighted
average derivative estimator (WADE). The advantages of ADE or
WADE estimators come at a cost, though, as they are inefficient.
Furthermore, the average derivative method is only directly applicable to
models with continuous explanatory variables.
At the end of this section we will discuss how to estimate the coefficients of discrete explanatory variables in SIM, a method that can be combined with the ADE/WADE method. For this reason we introduce the notation
Let us first consider a model with a -dimensional
vector of continuous variables only, i.e.,
Looking at (6.19) shows that
equals
up to scale. Hence, if we find a way to estimate
then we can also estimate
up to scale.
The approach studied in Powell et al. (1989) uses
the density
of
as the weight function:
For deriving the estimator, it is instructive to write (6.19) in more detail:
We can now estimate
by using the sample analog of the
right hand side of (6.21):
![]() |
(6.22) |
![]() |
(6.23) |
Regarding the sampling distribution of the estimator
Powell et al. (1989) have shown the
following theorem.
Note that although
is based on a multidimensional
kernel density estimator, it achieves
-convergence as the SIM estimators considered previously which
were all based on univariate kernel smoothing.
GLM | WADE | |||||
(logit) | ![]() |
![]() |
![]() |
![]() |
![]() |
|
constant | -5630 | -- | -- | -- | -- | -- |
EARNINGS | -1.00 | -1.00 | -1.00 | -1.00 | -1.00 | -1.00 |
CITY SIZE | -0.72 | -0.23 | -0.47 | -0.66 | -0.81 | -0.91 |
DEGREE | -1.79 | -1.06 | -1.52 | -1.93 | -2.25 | -2.47 |
URATE | 363.03 | 169.63 | 245.75 | 319.48 | 384.46 | 483.31 |
Table 6.1 shows the results of a GLM fit (using the
logistic link function) and the WADE coefficients
(for different bandwidths). For easier comparison the coefficients
are rescaled such that all coefficients of EARNINGS are equal to .
To eliminate the possible effects of the correlation between the
variables and to standardize the data, a Mahalanobis transformation
had been applied before computing the WADE. Note that in
particular for
the coefficients of both the GLM and WADE
are very close. One may thus argue that the parametric logit model
is not grossly misspecified.
Let us now turn to the problem of estimating coefficients for discrete or categorical variables. By definition, derivatives can only be calculated if the variable under study is continuous. Thus, the WADE method fails for discrete explanatory variables. Before presenting a more general solution, let us explain how the coefficient of one dichotomous variable is introduced to the model. We extend model (6.18) by an additional term:
Theoretically, the same
can be associated with either
yielding an index value of
or with
leading to an index value of
.
Thus the difference between the two indices is exactly
,
see the left panel of Figure 6.2.
In practice finding these horizontal differences will be
rather difficult. A common approach is based on the
observation that the integral difference between the
two link functions also equals
, see the right panel of Figure 6.2.
A very simple estimator is proposed in Korostelev & Müller (1995). Essentially, the coefficient of the binary explanatory variable can be estimated by
Horowitz & Härdle (1996) extend this approach to multivariate
multi-categorical
and an arbitrary distribution of
.
Recall the model (6.24)