# 3.3 Smoothing Parameter Selection

We have still not found a way to select the bandwidth that is both applicable in practice as well as theoretically desirable. In the following two subsections we will introduce two of the most frequently used methods of bandwidth selection, the plug-in method and the method of cross-validation. The treatments of both methods will describe one representative of each method. In the case of the plug-in method we will focus on the quick & dirty" plug-in method introduced by Silverman. With regard to cross-validation we will focus on least squares cross-validation. For more complete treatments of plug-in and cross-validation methods of bandwidth selection, see e.g. Härdle (1991) and Park & Turlach (1992).

## 3.3.1 Silverman's Rule of Thumb

Generally speaking, plug-in methods derive their name from their underlying principle: if you have an expression involving an unknown parameter, replace the unknown parameter with an estimate. Take (3.19) as an example. The expression on the right hand side involves the unknown quantity . Suppose we knew or assumed that the unknown density belongs to the family of normal distributions with mean and variance then we have

 (3.21) (3.22)

where denotes the pdf of the standard normal distribution. It remains to replace the unknown standard deviation by an estimator , such as

To apply (3.19) in practice we have to choose a kernel function. Taking the Gaussian kernel (which is identical to the standard normal pdf and will therefore be denoted by , too) we get the following rule-of-thumb'' bandwidth
 (3.23) (3.24)

with .

You may object by referring to what we said at the beginning of Chapter 2. Isn't assuming normality of just the opposite of the philosophy of nonparametric density estimation? Yes, indeed. If we knew that had a normal distribution then we could estimate its density much easier and more efficiently if we simply estimate with the sample mean and with the sample variance, and plug these estimates into the formula of the normal density.

What we achieved by working under the normality assumption is an explicit, applicable formula for bandwidth selection. In practice, we do not know whether is normally distributed. If it is, then in (3.24) gives the optimal bandwidth. If not, then in (3.24) will give a bandwidth not too far from the optimum if the distribution of is not too different from the normal distribution (the reference distribution''). That's why we refer to (3.24) as a rule-of-thumb bandwidth that will give reasonable results for all distributions that are unimodal, fairly symmetric and do not have tails that are too fat.

A practical problem with the rule-of-thumb bandwidth is its sensitivity to outliers. A single outlier may cause a too large estimate of and hence implies a too large bandwidth. A more robust estimator is obtained from the interquartile range

 (3.25)

i.e. we simply calculate the sample interquartile range from the 75%-quantile (upper quartile) and the 25%-quantile (lower quartile). Still assuming that the true pdf is normal we know that and . Hence, asymptotically
 (3.26)

and thus

 (3.27)

This relation can be plugged into (3.24) to give

 (3.28)

We can combine (3.24) and (3.28) into a better rule of thumb''

 (3.29)

Again, both (3.24) and (3.29) will work quite well if the true density resembles the normal distribution but if the true density deviates substantially from the shape of the normal distribution (by being multimodal for instance) we might be considerably misled by estimates using the rule-of-thumb bandwidths.

## 3.3.2 Cross-Validation

As mentioned earlier, we will focus on least squares cross-validation. To get started, consider an alternative distance measure between and , the integrated squared error ():

 (3.30)

Comparing (3.30) with the definition of the you will notice that, as the name suggests, the is indeed the expected value of the . Our aim is to choose a value for that will make the as small as possible. Let us rewrite (3.30)

 (3.31)

Apparently, does not depend on and can be ignored as far as minimization over is concerned. Moreover, can be calculated from the data. This leaves us with one term that depends on and involves the unknown quantity .

If we look at this term more closely, we observe that is the expected value of , where the expectation is calculated w.r.t. an independent random variable . We can estimate this expected value by

 (3.32)

where

 (3.33)

Here is the leave-one-out estimator. As the name of this estimator suggests the th observation is not used in the calculation of . This way we ensure that the observations used for calculating are independent of the observation at which we estimate in (3.32). (See also Exercise 3.15).

Let us repeat the formula of the integrated squared error (), the criterion function we seek to minimize with respect to :

 (3.34)

As pointed out above, we do not have to worry about the third term of the sum since it does not depend on . Hence, we might as well bring it to the left side of the equation and consider the criterion

 (3.35)

Now we can reap the fruits of the work done above and plug in (3.32) and (3.33) for estimating . This gives the so-called cross-validation criterion

 (3.36)

We have almost everything in place for an applicable formula that allows us to calculate an optimal bandwidth from a set of observations. It remains to replace by a term that employs sums rather than an integral. It can be shown (Härdle, 1991, p. 230ff) that

 (3.37)

where is the convolution of , i.e. . Inserting (3.37) into (3.36) gives the following criterion to minimize w.r.t.

 (3.38)

Thus, we have found a way to choose a bandwidth based on a reasonable criterion without having to make any assumptions about the family to which the unknown density belongs.

A nice feature of the cross-validation method is that the selected bandwidth automatically adapts to the smoothness of . This is in contrast to plug-in methods like Silverman's rule-of-thumb or the refined methods presented in Subsection 3.3.3. Moreover, the cross-validation principle can analogously be applied to other density estimators (different from the kernel method). We will also see these advantages later in the context of regression function estimation.

Finally, it can be shown that the bandwidth selected by minimizing fulfills an optimality property. Denote the bandwidth selected by the cross-validation criterion by and assume that the density is a bounded function. Stone (1984) proved that this bandwidth is asymptotically optimal in the following sense

where indicates convergence with probability (almost sure convergence). In other words, this means that the for asymptotically coincides with the bandwidth which minimizes , i.e. the optimal bandwidth.

## 3.3.3 Refined Plug-in Methods

With Silverman's rule-of-thumb we introduced in Subsection 3.3.1 the simplest possible plug-in bandwidth. Recall that essentially we assumed a normal density for a simple calculation of . This procedure yields a relatively good estimate of the optimal bandwidth if the true density function is nearly normal. However, if this is not the case (as for multimodal densities) Silverman's rule-of-thumb will fail dramatically. A natural refinement consists of using nonparametric estimates for as well. A further refinement is the use of a better approximation to . The following approaches apply these ideas.

In contrast to the cross-validation method plug-in bandwidth selectors try to find a bandwidth that minimizes . This means we are looking at another optimality criteria than these from the previous section.

A common method of assessing the quality of a selected bandwidth is to compare it with , the optimal bandwidth, in relative value. We say that the convergence of to is of order if

where is some random variable (independent of ). If then this is usually called -convergence and this rate of convergence is also the best achievable as Hall & Marron (1991) have shown.

Park & Marron (1990) proposed to estimate in by using a nonparametric estimate of and taking the second derivative from this estimate. Suppose we use a bandwidth here, then the second derivative of can be computed as

Of course, this will yield a bandwidth choice problem as well, and we only transfered our problem to bandwidth selection for the second derivative. However, we can now use a rule-of-thumb bandwidth in this first stage. A further problem occurs due to the bias of which can be overcome by using a bias corrected estimate

 (3.39)

Using this to replace and optimizing w.r.t. in yields the bandwidth selector

 (3.40)

Park & Marron (1990) showed that has a relative rate of convergence to of order which means a rate of convergence to the optimal bandwidth . The performance of in simulation studies is usually quite good. A disadvantage is that for small bandwidths, the estimator may give negative results.

## 3.3.4 An Optimal Bandwidth Selector?!

In Subsection 3.3.2 we introduced

 (3.41)

as a criterion of asymptotic optimality for a bandwidth selector This property was fulfilled by the least squares cross-validation criterion which tries to minimize .

Most of the other existing bandwidth choice methods attempt to minimize . A condition analogous to (3.41) for is usually much more complicated to prove. Hence, most of the literature is concerned with investigating the relative rate of convergence of a selected bandwidth to . Fan & Marron (1992) derived a Fisher-type lower bound for the relative errors of a bandwidth selector. It is given by

Considering the relative order of convergence to as a criterion, the best selector should fulfill

The biased cross validation method of Hall et al. (1991) has this property, however, this selector is only superior for very large samples. Another -convergent method is the smoothed cross-validation method but this selector pays with a larger asymptotic variance.

In summary: one best method does not exist! Moreover, even asymptotically optimal criteria may show bad behavior in simulations. See the bibliographic notes for references on such simulation studies. As a consequence, we recommend determining bandwidths by different selection methods and comparing the resulting density estimates.