3.3 Smoothing Parameter Selection

We have still not found a way to select the bandwidth that is both applicable in practice as well as theoretically desirable. In the following two subsections we will introduce two of the most frequently used methods of bandwidth selection, the plug-in method and the method of cross-validation. The treatments of both methods will describe one representative of each method. In the case of the plug-in method we will focus on the ``quick & dirty" plug-in method introduced by Silverman. With regard to cross-validation we will focus on least squares cross-validation. For more complete treatments of plug-in and cross-validation methods of bandwidth selection, see e.g. Härdle (1991) and Park & Turlach (1992).

3.3.1 Silverman's Rule of Thumb

Generally speaking, plug-in methods derive their name from their underlying principle: if you have an expression involving an unknown parameter, replace the unknown parameter with an estimate. Take (3.19) as an example. The expression on the right hand side involves the unknown quantity $ \Vert f''\Vert^{2}_{2}$. Suppose we knew or assumed that the unknown density $ f$ belongs to the family of normal distributions with mean $ \mu$ and variance $ \sigma^{2}$ then we have

$\displaystyle \Vert f'' \Vert^{2}_{2}$ $\displaystyle =$ $\displaystyle \sigma^{-5} \int\left\{\varphi''(x)\right\}^{2}\,dx$ (3.21)
  $\displaystyle =$ $\displaystyle \sigma^{-5}\,\frac{3}{8\sqrt{\pi}} \approx 0.212\; \sigma^{-5},$ (3.22)

where $ \varphi(\bullet)$ denotes the pdf of the standard normal distribution. It remains to replace the unknown standard deviation $ \sigma$ by an estimator $ \widehat \sigma$, such as

$\displaystyle \widehat \sigma=\sqrt{\frac{1}{n-1}\sum_{i=1}^{n}(x_{i}-\bar{x})^{2}}.$

To apply (3.19) in practice we have to choose a kernel function. Taking the Gaussian kernel (which is identical to the standard normal pdf and will therefore be denoted by $ \varphi$, too) we get the following ``rule-of-thumb'' bandwidth $ \widehat{h}_{rot}$
$\displaystyle \widehat{h}_{rot}$ $\displaystyle =$ $\displaystyle \left(\frac{\Vert \varphi
\Vert^{2}_{2}}{\Vert \widehat{f''}
\Vert^{2}_{2}\,\mu^{2}_{2}(\varphi)\,n}\right)^{1/5}$ (3.23)
  $\displaystyle =$ $\displaystyle \left(\frac{4\widehat{\sigma}^{5}}{3n}\right)^{1/5} \approx
1.06\;\widehat{\sigma} n^{-1/5},$ (3.24)

with $ \Vert \widehat{f''}
\Vert^{2}_{2}=\widehat{\sigma}^{-5}\frac{3}{8\sqrt{\pi}}$.

You may object by referring to what we said at the beginning of Chapter 2. Isn't assuming normality of $ f(x)$ just the opposite of the philosophy of nonparametric density estimation? Yes, indeed. If we knew that $ X$ had a normal distribution then we could estimate its density much easier and more efficiently if we simply estimate $ \mu$ with the sample mean and $ \sigma^{2}$ with the sample variance, and plug these estimates into the formula of the normal density.

What we achieved by working under the normality assumption is an explicit, applicable formula for bandwidth selection. In practice, we do not know whether $ X$ is normally distributed. If it is, then $ \widehat{h}_{opt}$ in (3.24) gives the optimal bandwidth. If not, then $ \widehat{h}_{opt}$ in (3.24) will give a bandwidth not too far from the optimum if the distribution of $ X$ is not too different from the normal distribution (the ``reference distribution''). That's why we refer to (3.24) as a rule-of-thumb bandwidth that will give reasonable results for all distributions that are unimodal, fairly symmetric and do not have tails that are too fat.

A practical problem with the rule-of-thumb bandwidth is its sensitivity to outliers. A single outlier may cause a too large estimate of $ \sigma$ and hence implies a too large bandwidth. A more robust estimator is obtained from the interquartile range

$\displaystyle {R}=X_{[0.75n]}-X_{[0.25n]},$ (3.25)

i.e. we simply calculate the sample interquartile range from the 75%-quantile $ X_{[0.75n]}$ (upper quartile) and the 25%-quantile $ X_{[0.25n]}$ (lower quartile). Still assuming that the true pdf is normal we know that $ X\sim N(\mu,\sigma^{2})$ and $ Z=\frac{X-\mu}{\sigma}\sim
N(0,1)$. Hence, asymptotically
$\displaystyle {R}$ $\displaystyle =$ $\displaystyle X_{[0.75n]}-X_{[0.25n]}$ (3.26)
  $\displaystyle =$ $\displaystyle (\mu+\sigma Z_{[0.75n]})-(\mu+\sigma Z_{[0.25n]})$  
  $\displaystyle =$ $\displaystyle \sigma(Z_{[0.75n]}-Z_{[0.25n]})$  
  $\displaystyle \approx$ $\displaystyle \sigma(0.67-(-0.67))= 1.34\;\sigma,$  

and thus

$\displaystyle \widehat\sigma=\frac{{R}}{1.34}.$ (3.27)

This relation can be plugged into (3.24) to give

$\displaystyle \widehat{h}_{rot}=1.06\;\frac{{R}}{1.34}\;n^{-1/5} \approx 0.79\;\widehat{R}\;n^{-1/5}.$ (3.28)

We can combine (3.24) and (3.28) into a ``better rule of thumb''

$\displaystyle \widehat{h}_{rot}=1.06\,\min\left\{\widehat{\sigma},\, \frac{{R}}{1.34}\right\}\,n^{-1/5}.$ (3.29)

Again, both (3.24) and (3.29) will work quite well if the true density resembles the normal distribution but if the true density deviates substantially from the shape of the normal distribution (by being multimodal for instance) we might be considerably misled by estimates using the rule-of-thumb bandwidths.

3.3.2 Cross-Validation

As mentioned earlier, we will focus on least squares cross-validation. To get started, consider an alternative distance measure between $ \widehat{f}$ and $ f$, the integrated squared error ($ \ise$):

$\displaystyle \index{ISE!kernel density estimation} \ise(h)=\ise\{\widehat f_{h...
...int \{\widehat{f}_{h}-f\}(x)^{2}\,dx =\int \{\widehat{f}_{h}(x)-f(x)\}^{2}\,dx.$ (3.30)

Comparing (3.30) with the definition of the $ \mise$ you will notice that, as the name suggests, the $ \mise$ is indeed the expected value of the $ \ise$. Our aim is to choose a value for $ h$ that will make the $ \ise$ as small as possible. Let us rewrite (3.30)

$\displaystyle \ise(h)=\int \widehat{f}^{2}_{h}(x)\,dx - 2 \int \{\widehat{f}_{h}f\}(x)\,dx + \int f^{2}(x)\,dx.$ (3.31)

Apparently, $ \int f^{2}(x)dx$ does not depend on $ h$ and can be ignored as far as minimization over $ h$ is concerned. Moreover, $ \int \widehat{f}_{h}^{2}(x)dx$ can be calculated from the data. This leaves us with one term that depends on $ h$ and involves the unknown quantity $ f$.

If we look at this term more closely, we observe that $ \int \{\widehat{f}_{h}f\}(x)\;dx$ is the expected value of $ \widehat{f}_{h}(X)$, where the expectation is calculated w.r.t. an independent random variable $ X$. We can estimate this expected value by

$\displaystyle \widehat{E\{\widehat{f}_{h}(X)\}}=\frac{1}{n}\sum_{i=1}^{n}\widehat{f}_{ h,-i}(X_{i}),$ (3.32)

where

$\displaystyle \widehat{f}_{h,-i}(x)=\frac{1}{n-1}\sum_{j=1,i \neq j}^{n}K_{h}(x-X_{j}).$ (3.33)

Here $ \widehat{f}_{h,-i}(x)$ is the leave-one-out estimator. As the name of this estimator suggests the $ i$th observation is not used in the calculation of $ \widehat{f}_{h,-i}(X_{i})$. This way we ensure that the observations used for calculating $ \widehat{f}_{h,-i}(\bullet)$ are independent of $ X_i,$ the observation at which we estimate $ \widehat{f}_{h,-i}(x)$ in (3.32). (See also Exercise 3.15).

Let us repeat the formula of the integrated squared error ($ \ise$), the criterion function we seek to minimize with respect to $ h$:

$\displaystyle \ise(h)=\int \widehat{f}^{2}_{h}(x)\,dx - 2E\{\widehat{f}_{h}(X)\} + \int f^{2}(x)\,dx.$ (3.34)

As pointed out above, we do not have to worry about the third term of the sum since it does not depend on $ h$. Hence, we might as well bring it to the left side of the equation and consider the criterion

$\displaystyle \ise(h)-\int f^{2}(x)\,dx=\int \widehat{f}^{2}_{h}(x)\,dx - 2 E\{\widehat{f}_{h}(X)\}.$ (3.35)

Now we can reap the fruits of the work done above and plug in (3.32) and (3.33) for estimating $ E\{\widehat{f}_{h}(X)\}$. This gives the so-called cross-validation criterion

$\displaystyle CV(h)=\int \widehat{f}^{2}_{h}(x)\,dx - \frac{2}{n(n-1)} \sum_{i=1}^{n}\sum_{j=1,i \neq j}^{n}K_{h}(X_{i}-X_{j}).$ (3.36)

We have almost everything in place for an applicable formula that allows us to calculate an optimal bandwidth from a set of observations. It remains to replace $ \int \widehat{f}^{2}_{h}(x)\,dx$ by a term that employs sums rather than an integral. It can be shown (Härdle, 1991, p. 230ff) that

$\displaystyle \int \widehat{f}^{2}_{h}(x)\, dx= \frac{1}{n^{2}h}\sum_{i=1}^{n} \sum_{j=1}^{n} K \star K \left( \frac{X_{j}-X_{i}}{h} \right) ,$ (3.37)

where $ K\star K(u)$ is the convolution of $ K$, i.e. $ K\star K (u) =\int
K(u-v)K(v)\,dv$. Inserting (3.37) into (3.36) gives the following criterion to minimize w.r.t. $ h$

$\displaystyle CV(h)=\frac{1}{n^{2}h}\sum_{i} \sum_{j} K \star K \left( \frac{X_{j}-X_{i}}{h} \right) - \frac{2}{n(n-1)} \sum_{i}\sum_{j\neq i}K_{h}(X_{i}-X_{j}).$ (3.38)

Thus, we have found a way to choose a bandwidth based on a reasonable criterion without having to make any assumptions about the family to which the unknown density belongs.

A nice feature of the cross-validation method is that the selected bandwidth automatically adapts to the smoothness of $ f(\bullet)$. This is in contrast to plug-in methods like Silverman's rule-of-thumb or the refined methods presented in Subsection 3.3.3. Moreover, the cross-validation principle can analogously be applied to other density estimators (different from the kernel method). We will also see these advantages later in the context of regression function estimation.

Finally, it can be shown that the bandwidth selected by minimizing $ CV$ fulfills an optimality property. Denote the bandwidth selected by the cross-validation criterion by $ \widehat{h}_{cv}$ and assume that the density $ f$ is a bounded function. Stone (1984) proved that this bandwidth is asymptotically optimal in the following sense

$\displaystyle \frac{\ise(\widehat{h}_{cv})}{ \min_h \ise(h) }
\mathrel{\mathop{\longrightarrow}\limits_{}^{a.s.}} 1,$

where $ a.s.$ indicates convergence with probability $ 1$ (almost sure convergence). In other words, this means that the $ \ise$ for $ \widehat{h}_{cv}$ asymptotically coincides with the bandwidth which minimizes $ \ise(\bullet)$, i.e. the $ \ise$ optimal bandwidth.

3.3.3 Refined Plug-in Methods

With Silverman's rule-of-thumb we introduced in Subsection 3.3.1 the simplest possible plug-in bandwidth. Recall that essentially we assumed a normal density for a simple calculation of $ \Vert f''\Vert^2$. This procedure yields a relatively good estimate of the optimal bandwidth if the true density function $ f$ is nearly normal. However, if this is not the case (as for multimodal densities) Silverman's rule-of-thumb will fail dramatically. A natural refinement consists of using nonparametric estimates for $ \Vert f''\Vert^2$ as well. A further refinement is the use of a better approximation to $ \mise$. The following approaches apply these ideas.

In contrast to the cross-validation method plug-in bandwidth selectors try to find a bandwidth that minimizes $ \mise$. This means we are looking at another optimality criteria than these from the previous section.

A common method of assessing the quality of a selected bandwidth $ \widehat{h}$ is to compare it with $ h_{opt}$, the $ \mise$ optimal bandwidth, in relative value. We say that the convergence of $ \widehat{h}$ to $ h_{opt}$ is of order $ O_p(n^{-\alpha})$ if

$\displaystyle n^\alpha\frac{\widehat{h}-h_{opt}}{h_{opt}} \mathrel{\mathop{\longrightarrow}\limits_{}^{P}} T,$

where $ T$ is some random variable (independent of $ n$). If $ \alpha=\frac{1}{2}$ then this is usually called $ \sqrt{n}$-convergence and this rate of convergence is also the best achievable as Hall & Marron (1991) have shown.

Park & Marron (1990) proposed to estimate $ \Vert f''\Vert^2_2$ in $ \mise$ by using a nonparametric estimate of $ f$ and taking the second derivative from this estimate. Suppose we use a bandwidth $ g$ here, then the second derivative of $ \widehat{f}_g$ can be computed as

$\displaystyle \widehat{f}''_g(x) = \frac{1}{ng^3} \sum_{i=1}^n K''\left(
\frac{x-X_i}{g}\right).$

Of course, this will yield a bandwidth choice problem as well, and we only transfered our problem to bandwidth selection for the second derivative. However, we can now use a rule-of-thumb bandwidth in this first stage. A further problem occurs due to the bias of $ \Vert\widehat{f}''_g(x)\Vert^2_2$ which can be overcome by using a bias corrected estimate

$\displaystyle \widehat{\Vert f''\Vert^2_2}=\Vert\widehat{f}_g''\Vert^2_2 - \frac{1}{ng^{5}} \Vert K''\Vert^2_2.$ (3.39)

Using this to replace $ \Vert f''\Vert^2_2$ and optimizing w.r.t. $ h$ in $ \amise$ yields the bandwidth selector

$\displaystyle \widehat{h}_{pm}=\left( \frac{\Vert K \Vert^{2}_{2}}{ \widehat{\Vert f'' \Vert^{2}_{2}}\,\mu^2_{2}(K)\,n}\right)^{1/5}.$ (3.40)

Park & Marron (1990) showed that $ \widehat h_{pm}$ has a relative rate of convergence to $ h_{opt}$ of order $ O_p(n^{-4/13})$ which means a rate of convergence $ \alpha=\frac{4}{13}<\frac{1}{2}$ to the optimal bandwidth $ h_{opt}$. The performance of $ \widehat h_{pm}$ in simulation studies is usually quite good. A disadvantage is that for small bandwidths, the estimator $ \widehat{\Vert f'' \Vert^{2}_{2}}$ may give negative results.

3.3.4 An Optimal Bandwidth Selector?!

In Subsection 3.3.2 we introduced

$\displaystyle \frac{\ise(\widehat{h})}{ \min_h \ise({h}) } \mathrel{\mathop{\longrightarrow}\limits_{}^{a.s.}} 1.$ (3.41)

as a criterion of asymptotic optimality for a bandwidth selector $ \widehat{h}.$ This property was fulfilled by the least squares cross-validation criterion $ CV$ which tries to minimize $ \ise$.

Most of the other existing bandwidth choice methods attempt to minimize $ \mise$. A condition analogous to (3.41) for $ \mise$ is usually much more complicated to prove. Hence, most of the literature is concerned with investigating the relative rate of convergence of a selected bandwidth $ \widehat{h}$ to $ h_{opt}$. Fan & Marron (1992) derived a Fisher-type lower bound for the relative errors of a bandwidth selector. It is given by

$\displaystyle \sigma^2_f = \frac{4}{25} \left(
\frac{\int f^{(4)}(x)^2 f(x) \,dx}{\Vert f''\Vert^2_2}\ -\ 1\,\right).$

Considering the relative order of convergence to $ h_{opt}$ as a criterion, the best selector should fulfill

$\displaystyle \sqrt{n} \left( \frac{\widehat{h}-h_{opt}}{h_{opt}} \right)
\mathrel{\mathop{\longrightarrow}\limits_{}^{L}} N(0,\sigma_f^2).$

The biased cross validation method of Hall et al. (1991) has this property, however, this selector is only superior for very large samples. Another $ \sqrt{n}$-convergent method is the smoothed cross-validation method but this selector pays with a larger asymptotic variance.

In summary: one best method does not exist! Moreover, even asymptotically optimal criteria may show bad behavior in simulations. See the bibliographic notes for references on such simulation studies. As a consequence, we recommend determining bandwidths by different selection methods and comparing the resulting density estimates.