As we pointed out in the preceeding sections, for some nonparametric
estimators at least an asymptotic connection can be made to
kernel regression
estimators. Hence, in this section we will be focusing on
finding a good way of choosing the smoothing parameter of
kernel regression
estimators, namely the bandwidth
.
What conditions do we require for a bandwidth selection rule to
be ``good''?
First of all it should have theoretically desirable properties. Secondly,
it has to be applicable in practice.
Regarding the first condition, there have been a number of criteria proposed
that measure in one way or another how close the estimate is to the true
curve. It will be instructive to go through these measures one by one:
Which discrepancy measure should be used to derive a rule for choosing
? A natural choice would be
or its asymptotic version
since we have some experience of its optimization from the density
case. The
in the regression case, however, involves more
unknown quantities than the
in the density estimator. As a
result, plug-in approaches are mainly used for the local linear
estimator due to its simpler bias formula. See for instance
Wand & Jones (1995, pp. 138-139) for some examples.
We will discuss two approaches of rather general applicability:
cross-validation
and penalty
terms. For the sake of simplicity,
we restrict ourselves to bandwidth selection for the Nadaraya-Watson
estimator here. For that estimator is has been shown
(Marron & Härdle, 1986) that
,
and
lead
asymptotically to the same level of smoothing.
Hence, we can use the criterion which is the easiest to calculate and
manipulate: the discrete
.
We want to find the bandwidth
that minimizes
.
For easy reference, let us write down
in more detail:
We already pointed out that
is a random variable. Its conditional
expectation,
, is given by
with squared bias
 |
(4.44) |
and variance
![$\displaystyle v(h)=\frac{1}{n}\sum_{i=1}^{n}\left[\frac{1}{n^{2}}\sum_{j=1}^{n}...
...i}-X_{j})}{\widehat{f}_{h}(X_{i})}\right\}^{2}\sigma(X_{j})^{2}\right]w(X_{i}).$](spmhtmlimg1275.gif) |
(4.45) |
The following example shows the dependence of squared bias, variance
and its sum
on the bandwidth
.
Figure:
(thick line),
squared bias (thin solid line) and variance
part (thin dashed line) for simulated data, weights
SPMsimulmase
|
EXAMPLE 4.12
The squared bias is increasing in

as can be seen in
Figure
4.10 where

is plotted
along with the decreasing

and their sum

(thick line). Apparently, there is the familiar trade-off that increasing

will reduce the variance but increase the squared bias.
The minimum

is achieved at

.
You may wonder how we are able to compute
these quantities since they involve the unknown
.
The answer is simple: We have generated the data ourselves, determining
the regression function
beforehand. The data have been generated according to
see Figure
4.11.

Figure:
Simulated
data with true and estimated curve
SPMsimulmase
|
What is true for
is also true for
: it involves
, the function we want to estimate.
Therefore, we have to replace
with an approximation that can be
computed from the data.
A naive way of replacing
would be to use the observations
of
instead, i.e.
 |
(4.46) |
which is called the resubstitution estimate and is essentially
a weighted residual sum of squares (
).
However, there is a problem with this approach since
is used in
to predict itself. As a consequence,
can be made arbitrarily small by letting
(in which case
is an interpolation of the
s).
To gain more insight into this matter let us expand
by adding and
subtracting
:
where
.
Note that the first term
of (4.47) does not depend on
, and the second term is
. Hence, minimizing
would surely lead to the same result as minimizing
if it weren't for the third term
.
In fact, if we calculate the conditional expectation of
we observe that the third term (recall the definition of
in (4.7)), which is the conditional
expectation of
tends to zero at the same rate as the variance
in (4.45) and has a
negative sign. Therefore,
is downwardly biased as an
estimate of
, just as the bandwidth minimizing
is downwardly biased as an estimate of the bandwidth
minimizing
.
In the following two sections we will examine two ways out
of this dilemma. The method of cross-validation replaces
in (4.46)
with the leave-one-out-estimator
.
In a different approach
is multiplied by a penalizing
function which corrects for the downward bias of the
resubstitution estimate.
We already familiarized ourselves with know cross-validation in the context of
bandwidth selection in kernel density estimation. This time
around, we will use it as a remedy for the problem that in
 |
(4.49) |
is used in
to predict itself.
Cross-validation solves this problem by employing the
leave-one-out-estimator
 |
(4.50) |
That is, in estimating
at
the
th
observation
is left out (as reflected in the subscript ``
'').
This leads to the cross-validation function
 |
(4.51) |
In terms of the analysis of the previous section, it can be shown that
the conditional expectation of the third term of (4.47),
is equal to zero if we use
instead of
, i.e.
This means minimizing
is (on average) equivalent to minimizing
since the first term in (4.47) is independent of
.
We can conclude that with the bandwidth selection rule
``choose
to minimize
'' we have found a rule
that is both theoretically desirable and applicable in practice.
Figure:
Nadaraya-Watson
kernel regression with cross-validated
bandwidth
, U.K. Family Expenditure Survey 1973
SPMnadwaest
|
EXAMPLE 4.13
Let us apply the cross-validation method to the Engel curve example
now. Figure
4.12 shows the Nadaraya-Watson kernel
regression curve (recall that we always used the Quartic kernel for
the figures) with the bandwidth chosen by minimizing the
cross-validation criterion

.
Figure:
Local
polynomial regression (
) with cross-validated
bandwidth
, U.K. Family Expenditure Survey 1973
SPMlocpolyest
|
For comparison purposes, let us consider bandwidth selection for a
different nonparametric smoothing method. You can easily see that
applying the cross-validation approach to local
polynomial regression presents no problem. This is what we have done in
Figure 4.13. Here we show the local linear estimate with
cross-validated bandwidth for the
same data. As we already pointed out in Subsection 4.1.3
the estimate shows more stable behavior in the high net-income
region (regions with small number of observations) and outperforms
the Nadaraya-Watson estimate at the boundaries. 
Recall the formula (4.48)for the conditional expectation
of
.
That is,
You might argue that this inequality is not all that
important as long as the
bandwidth minimizing
is
equal to the bandwidth minimizing
.
Unfortunately, one of the two terms causing the inequality,
the last term of (4.48),
depends on
and is causing the downward bias.
The penalizing function approach corrects for
the downward bias
by multiplying
by a correction factor that penalizes too small
.
The ``corrected version'' of
can be written as
 |
(4.52) |
with a correction function
.
As we will see in a moment, a penalizing
function
with first-order Taylor expansion
for
,
will work well. Using this Taylor expansion we can write (4.52) as
Multiplying out and ignoring terms of higher order, we get
The first term in (4.54) does not depend on
.
The expectation of the third term, conditional on
,
is equal to the negative value of the last term of (4.48).
But this is just the conditional expectation of the last term
in (4.54), with a negative sign in front. Hence,
the last two terms cancel each other out asymptotically
and
is roughly equal to
.
The following list presents a number of penalizing functions that satisfy
the expansion
:
- (1)
-
Shibata's model selector (Shibata, 1981),
- (2)
- Generalized cross-validation (Craven and Wahba, 1979; Li,
1985),
- (3)
-
Akaike's Information Criterion (Akaike, 1970),
- (4)
- Finite Prediction Error (Akaike, 1974),
- (5)
- Rice's
(Rice, 1984),
To see how these various functions differ in the degree
of penalizing small values of
, consider Figure 4.14.
Figure:
Penalizing
functions
as a function of
(from left to right:
,
,
,
,
)
SPMpenalize
|
The functions differ in the relative weight they give to variance and bias
of
. Rice's
gives the most weight
to variance reduction while Shibata's model selector stresses
bias reduction the most.
The differences displayed in the graph are not substantial, however.
If we denote the bandwidth minimizing
with
and the
minimizer of
with
then for
Thus, regardless of which specific penalizing function we use, we can assume
that with an increasing number of observations
approximates the
minimizing bandwidth
.
Hence, choosing the bandwidth minimizing
is another ``good''
rule for bandwidth-selection in kernel regression estimation.
Note that
and
Hence
i.e.
with
. An analogous result is possible
for local polynomial regression, see Härdle & Müller (2000).
Therefore the cross-validation
approach is equivalent to the penalizing functions concept
and has the same asymptotic properties. (Note, that this equivalence
does not hold in general for other smoothing approaches.)