# 4.3 Smoothing Parameter Selection

As we pointed out in the preceeding sections, for some nonparametric estimators at least an asymptotic connection can be made to kernel regression estimators. Hence, in this section we will be focusing on finding a good way of choosing the smoothing parameter of kernel regression estimators, namely the bandwidth .

What conditions do we require for a bandwidth selection rule to be good''? First of all it should have theoretically desirable properties. Secondly, it has to be applicable in practice. Regarding the first condition, there have been a number of criteria proposed that measure in one way or another how close the estimate is to the true curve. It will be instructive to go through these measures one by one:

• We are already familiar with the mean squared error

 (4.38)

is just an unknown constant, but the estimator is a random variable. Hence, the expectation in (4.38) is taken with respect to the distribution of this random variable. Since is a function of the random variables it follows that

The measures the squared deviation of the estimate from at a single point . If we are interested in how well we estimate an entire then we should really use a global measure of closeness of the estimate to the true curve. Moreover, in the previous section, where we derived an approximate formula for the , we have already mentioned that the -optimal bandwidth is a complicated function of unknown quantities like or . One might argue that these unknowns may be replaced by consistent estimates but to get such an estimate for the choice of a bandwidth is required, which is the very problem we are trying to solve.
• The integrated squared error

 (4.39)

is a global discrepancy measure. But it is still a random variable as different samples will produce different values of , and thereby different values of . The weight function may be used to assign less weight to observations in regions of sparse data (to reduce the variance in this region) or at the tail of the distribution of (to trim away boundary effects).

• The mean integrated squared error

is not a random variable. It is the expected value of the random variable with the expectation being taken with respect to all possible samples of and .

• The averaged squared error

 (4.40)

is a discrete approximation to , and just like the it is both a random variable and a global measure of discrepancy.

• The mean averaged squared error

 (4.41)

is the conditional expectation of , where the expectation is taken w.r.t. the joint distribution of . If we view the as random variables then is a random variable.

Which discrepancy measure should be used to derive a rule for choosing ? A natural choice would be or its asymptotic version since we have some experience of its optimization from the density case. The in the regression case, however, involves more unknown quantities than the in the density estimator. As a result, plug-in approaches are mainly used for the local linear estimator due to its simpler bias formula. See for instance Wand & Jones (1995, pp. 138-139) for some examples.

We will discuss two approaches of rather general applicability: cross-validation and penalty terms. For the sake of simplicity, we restrict ourselves to bandwidth selection for the Nadaraya-Watson estimator here. For that estimator is has been shown (Marron & Härdle, 1986) that , and lead asymptotically to the same level of smoothing. Hence, we can use the criterion which is the easiest to calculate and manipulate: the discrete .

## 4.3.1 A Closer Look at the Averaged Squared Error

We want to find the bandwidth that minimizes . For easy reference, let us write down in more detail:

 (4.42)

We already pointed out that is a random variable. Its conditional expectation, , is given by
 (4.43)

with squared bias

 (4.44)

and variance

 (4.45)

The following example shows the dependence of squared bias, variance and its sum on the bandwidth .

EXAMPLE 4.12
The squared bias is increasing in as can be seen in Figure 4.10 where is plotted along with the decreasing and their sum (thick line). Apparently, there is the familiar trade-off that increasing will reduce the variance but increase the squared bias. The minimum is achieved at .

You may wonder how we are able to compute these quantities since they involve the unknown . The answer is simple: We have generated the data ourselves, determining the regression function

beforehand. The data have been generated according to

see Figure 4.11

What is true for is also true for : it involves , the function we want to estimate. Therefore, we have to replace with an approximation that can be computed from the data. A naive way of replacing would be to use the observations of instead, i.e.

 (4.46)

which is called the resubstitution estimate and is essentially a weighted residual sum of squares (). However, there is a problem with this approach since is used in to predict itself. As a consequence, can be made arbitrarily small by letting (in which case is an interpolation of the s).

To gain more insight into this matter let us expand by adding and subtracting :

 (4.47)

where . Note that the first term of (4.47) does not depend on , and the second term is . Hence, minimizing would surely lead to the same result as minimizing if it weren't for the third term . In fact, if we calculate the conditional expectation of
 (4.48)

we observe that the third term (recall the definition of in (4.7)), which is the conditional expectation of

tends to zero at the same rate as the variance in (4.45) and has a negative sign. Therefore, is downwardly biased as an estimate of , just as the bandwidth minimizing is downwardly biased as an estimate of the bandwidth minimizing .

In the following two sections we will examine two ways out of this dilemma. The method of cross-validation replaces in (4.46) with the leave-one-out-estimator . In a different approach is multiplied by a penalizing function which corrects for the downward bias of the resubstitution estimate.

## 4.3.2 Cross-Validation

We already familiarized ourselves with know cross-validation in the context of bandwidth selection in kernel density estimation. This time around, we will use it as a remedy for the problem that in

 (4.49)

is used in to predict itself. Cross-validation solves this problem by employing the leave-one-out-estimator

 (4.50)

That is, in estimating at the th observation is left out (as reflected in the subscript ''). This leads to the cross-validation function

 (4.51)

In terms of the analysis of the previous section, it can be shown that the conditional expectation of the third term of (4.47), is equal to zero if we use instead of , i.e.

This means minimizing is (on average) equivalent to minimizing since the first term in (4.47) is independent of . We can conclude that with the bandwidth selection rule choose to minimize '' we have found a rule that is both theoretically desirable and applicable in practice.

EXAMPLE 4.13
Let us apply the cross-validation method to the Engel curve example now. Figure 4.12 shows the Nadaraya-Watson kernel regression curve (recall that we always used the Quartic kernel for the figures) with the bandwidth chosen by minimizing the cross-validation criterion .

For comparison purposes, let us consider bandwidth selection for a different nonparametric smoothing method. You can easily see that applying the cross-validation approach to local polynomial regression presents no problem. This is what we have done in Figure 4.13. Here we show the local linear estimate with cross-validated bandwidth for the same data. As we already pointed out in Subsection 4.1.3 the estimate shows more stable behavior in the high net-income region (regions with small number of observations) and outperforms the Nadaraya-Watson estimate at the boundaries.

## 4.3.3 Penalizing Functions

Recall the formula (4.48)for the conditional expectation of . That is,

You might argue that this inequality is not all that important as long as the bandwidth minimizing is equal to the bandwidth minimizing . Unfortunately, one of the two terms causing the inequality, the last term of (4.48), depends on and is causing the downward bias. The penalizing function approach corrects for the downward bias by multiplying by a correction factor that penalizes too small . The corrected version'' of can be written as

 (4.52)

with a correction function . As we will see in a moment, a penalizing function with first-order Taylor expansion for , will work well. Using this Taylor expansion we can write (4.52) as
 (4.53)

Multiplying out and ignoring terms of higher order, we get
 (4.54)

The first term in (4.54) does not depend on . The expectation of the third term, conditional on , is equal to the negative value of the last term of (4.48). But this is just the conditional expectation of the last term in (4.54), with a negative sign in front. Hence, the last two terms cancel each other out asymptotically and is roughly equal to .

The following list presents a number of penalizing functions that satisfy the expansion :

(1)
Shibata's model selector (Shibata, 1981),

(2)
Generalized cross-validation (Craven and Wahba, 1979; Li, 1985),

(3)
Akaike's Information Criterion (Akaike, 1970),

(4)
Finite Prediction Error (Akaike, 1974),

(5)
Rice's (Rice, 1984),

To see how these various functions differ in the degree of penalizing small values of , consider Figure 4.14.

The functions differ in the relative weight they give to variance and bias of . Rice's gives the most weight to variance reduction while Shibata's model selector stresses bias reduction the most. The differences displayed in the graph are not substantial, however. If we denote the bandwidth minimizing with and the minimizer of with then for

Thus, regardless of which specific penalizing function we use, we can assume that with an increasing number of observations approximates the minimizing bandwidth . Hence, choosing the bandwidth minimizing is another good'' rule for bandwidth-selection in kernel regression estimation.

Note that

and

Hence

i.e. with . An analogous result is possible for local polynomial regression, see Härdle & Müller (2000). Therefore the cross-validation approach is equivalent to the penalizing functions concept and has the same asymptotic properties. (Note, that this equivalence does not hold in general for other smoothing approaches.)