We have still not found a way to select the bandwidth that is both applicable in practice as well as theoretically desirable. In the following two subsections we will introduce two of the most frequently used methods of bandwidth selection, the plug-in method and the method of cross-validation. The treatments of both methods will describe one representative of each method. In the case of the plug-in method we will focus on the ``quick & dirty" plug-in method introduced by Silverman. With regard to cross-validation we will focus on least squares cross-validation. For more complete treatments of plug-in and cross-validation methods of bandwidth selection, see e.g. Härdle (1991) and Park & Turlach (1992).

Generally speaking, plug-in methods derive their name from their
underlying principle: if you have an expression involving an
unknown parameter, replace the unknown parameter with an estimate.
Take (3.19) as an example. The
expression on the right hand side involves the unknown quantity
.
Suppose we knew or assumed that the unknown density
belongs to the family of normal distributions with mean and
variance
then we have

(3.21) | |||

(3.22) |

where denotes the pdf of the standard normal distribution. It remains to replace the unknown standard deviation by an estimator , such as

with .

You may object by referring to what we said at the beginning of Chapter 2. Isn't assuming normality of just the opposite of the philosophy of nonparametric density estimation? Yes, indeed. If we knew that had a normal distribution then we could estimate its density much easier and more efficiently if we simply estimate with the sample mean and with the sample variance, and plug these estimates into the formula of the normal density.

What we achieved by working under the normality assumption is an explicit,
applicable formula for bandwidth selection. In practice, we do not know
whether is normally distributed. If it is, then
in
(3.24) gives the optimal bandwidth. If not, then
in (3.24) will give a bandwidth not too far from the
optimum if the distribution of is not too different from the normal
distribution (the ``reference distribution'').
That's why we refer to (3.24) as a *rule-of-thumb
bandwidth* that will give reasonable results for all distributions that are
unimodal, fairly symmetric and do not have tails that are too fat.

A practical problem with the rule-of-thumb bandwidth is its sensitivity to outliers. A single outlier may cause a too large estimate of and hence implies a too large bandwidth. A more robust estimator is obtained from the interquartile range

(3.25) |

i.e. we simply calculate the sample interquartile range from the 75%-quantile (upper quartile) and the 25%-quantile (lower quartile). Still assuming that the true pdf is normal we know that and . Hence, asymptotically

(3.26) | |||

and thus

This relation can be plugged into (3.24) to give

We can combine (3.24) and (3.28) into a ``better rule of thumb''

Again, both (3.24) and (3.29) will work quite well if the true density resembles the normal distribution but if the true density deviates substantially from the shape of the normal distribution (by being multimodal for instance) we might be considerably misled by estimates using the rule-of-thumb bandwidths.

As mentioned earlier, we will focus on least squares cross-validation.
To get started, consider an alternative distance measure between
and , the *integrated squared error* ():

Comparing (3.30) with the definition of the you will notice that, as the name suggests, the is indeed the expected value of the . Our aim is to choose a value for that will make the as small as possible. Let us rewrite (3.30)

Apparently, does not depend on and can be ignored as far as minimization over is concerned. Moreover, can be calculated from the data. This leaves us with one term that depends on and involves the unknown quantity .

If we look at this term more closely, we observe that is the expected value of , where the expectation is calculated w.r.t. an independent random variable . We can estimate this expected value by

where

Here is the leave-one-out estimator. As the name of this estimator suggests the th observation is not used in the calculation of . This way we ensure that the observations used for calculating are independent of the observation at which we estimate in (3.32). (See also Exercise 3.15).

Let us repeat the formula of the integrated squared error (), the criterion function we seek to minimize with respect to :

(3.34) |

As pointed out above, we do not have to worry about the third term of the sum since it does not depend on . Hence, we might as well bring it to the left side of the equation and consider the criterion

(3.35) |

Now we can reap the fruits of the work done above and plug in (3.32) and (3.33) for estimating . This gives the so-called

We have almost everything in place for an applicable formula that allows us to calculate an optimal bandwidth from a set of observations. It remains to replace by a term that employs sums rather than an integral. It can be shown (Härdle, 1991, p. 230ff) that

where is the convolution of , i.e. . Inserting (3.37) into (3.36) gives the following criterion to minimize w.r.t.

Thus, we have found a way to choose a bandwidth based on a reasonable criterion without having to make any assumptions about the family to which the unknown density belongs.

A nice feature of the cross-validation method is that the selected bandwidth automatically adapts to the smoothness of . This is in contrast to plug-in methods like Silverman's rule-of-thumb or the refined methods presented in Subsection 3.3.3. Moreover, the cross-validation principle can analogously be applied to other density estimators (different from the kernel method). We will also see these advantages later in the context of regression function estimation.

Finally, it can be shown that the bandwidth selected by minimizing fulfills an optimality property. Denote the bandwidth selected by the cross-validation criterion by and assume that the density is a bounded function. Stone (1984) proved that this bandwidth is asymptotically optimal in the following sense

With Silverman's rule-of-thumb we introduced in Subsection 3.3.1 the simplest possible plug-in bandwidth. Recall that essentially we assumed a normal density for a simple calculation of . This procedure yields a relatively good estimate of the optimal bandwidth if the true density function is nearly normal. However, if this is not the case (as for multimodal densities) Silverman's rule-of-thumb will fail dramatically. A natural refinement consists of using nonparametric estimates for as well. A further refinement is the use of a better approximation to . The following approaches apply these ideas.

In contrast to the cross-validation method plug-in bandwidth selectors try to find a bandwidth that minimizes . This means we are looking at another optimality criteria than these from the previous section.

A common method of assessing the quality of a selected bandwidth is to compare it with , the optimal bandwidth, in relative value. We say that the convergence of to is of order if

Park & Marron (1990) proposed to estimate in by using a nonparametric estimate of and taking the second derivative from this estimate. Suppose we use a bandwidth here, then the second derivative of can be computed as

Using this to replace and optimizing w.r.t. in yields the bandwidth selector

Park & Marron (1990) showed that has a relative rate of convergence to of order which means a rate of convergence to the optimal bandwidth . The performance of in simulation studies is usually quite good. A disadvantage is that for small bandwidths, the estimator may give negative results.

In Subsection 3.3.2 we introduced

as a criterion of asymptotic optimality for a bandwidth selector This property was fulfilled by the least squares cross-validation criterion which tries to minimize .

Most of the other existing bandwidth choice methods attempt to minimize . A condition analogous to (3.41) for is usually much more complicated to prove. Hence, most of the literature is concerned with investigating the relative rate of convergence of a selected bandwidth to . Fan & Marron (1992) derived a Fisher-type lower bound for the relative errors of a bandwidth selector. It is given by

The biased cross validation method of Hall et al. (1991) has this property, however, this selector is only superior for very large samples. Another -convergent method is the smoothed cross-validation method but this selector pays with a larger asymptotic variance.

In summary: *one* best method does *not* exist! Moreover,
even asymptotically
optimal criteria may show bad behavior in simulations. See the
bibliographic notes for references on such simulation studies.
As a consequence, we recommend determining bandwidths by
different selection methods and comparing the resulting
density estimates.