3.4 Choosing the Kernel

3.4.1 Canonical Kernels and Bandwidths

To discuss the choice of the kernel we will consider equivalent kernels, i.e. kernel functions that lead to exactly the same kernel density estimator. Consider a kernel function $ K(\bullet)$ and the following modification:

$\displaystyle K_{\delta}(\bullet)=\delta^{-1}K(\bullet/\delta).$

Now compare the kernel density estimate $ \widehat{f}_h$ using kernel $ K$ and bandwidth $ h$ with a kernel density estimate $ \widetilde{f}_{\widetilde{h}}$ using $ K_\delta$ and bandwidth $ \widetilde{h}$. It is easy to derive that

$\displaystyle \widehat{f}_h(x)
=\frac{1}{nh}\sum_{i=1}^n K\left(\frac{x-X_i}{h}...
...left(\frac{x-X_i}{\widetilde{h}\delta}\right)
=\widetilde{f}_{\widetilde{h}}(x)$

if the relation

$\displaystyle \widetilde{h}\delta=h$

holds. This means, all rescaled versions $ K_\delta$ of a kernel function $ K$ are equivalent if the bandwidth is adjusted accordingly.

Different values of $ \delta$ correspond to different members of an equivalence class of kernels. We will now show how Marron & Nolan (1988) use the equivalence class idea to uncouple the problems of choosing $ h$ and $ K$. Recall the $ \amise$ criterion, i.e.

$\displaystyle \amise =\frac{1}{nh}\Vert K \Vert^{2}_{2}+\frac{h^{4}}{4} \Vert f'' \Vert^{2}_{2} \mu_{2}^{2}(K).$ (3.42)

We rewrite this formula for some equivalence class of kernel functions $ K_{\delta}$:

$\displaystyle \amise(K_{\delta})=\frac{1}{nh}\Vert K_{\delta} \Vert^{2}_{2}+\frac{h^{4}}{4}\Vert f''\Vert^{2}_{2}\mu^{2}_{2}(K_{\delta}).$ (3.43)

In each of the two components of this sum there is a term involving $ K_{\delta}$, namely $ \Vert K_{\delta} \Vert^{2}_{2}$ in the left component, and $ \mu^{2}_{2}(K_{\delta})$ in the right component. The idea for separating the problems of choosing $ h$ and $ K$ is to find $ \delta$ such that

$\displaystyle \Vert K_{\delta} \Vert^{2}_{2} = \mu^{2}_{2}(K_{\delta}).$

This is fulfilled (see Exercise 3.7) if

$\displaystyle \delta_{0} =\left(\frac{\Vert K \Vert^{2}_{2}}{\mu^{2}_{2}(K)}\right)^{1/5}.$ (3.44)

The value $ \delta_0$ is called the canonical bandwidth corresponding to the kernel function $ K$.

What happens to $ \amise$ if we use the very member that corresponds to $ \delta _{0}$, namely the kernel $ K_{\delta_{0}}$? By construction, for $ \delta _{0}$ we have

$\displaystyle \frac{1}{\delta_{0}}\Vert K \Vert^{2}_{2} =
\delta_{0}^4 \mu^{2}_{2}(K) = T(K),$

or equivalently, cf. (3.44),

$\displaystyle T(K)=\frac{1}{\delta_{0}} \Vert K \Vert^2_{2} = \left\{\Vert K \Vert^{8}_{2}\, \mu^{2}_{2}(K)\right\}^{1/5}.$ (3.45)

Hence, $ \amise$ becomes

$\displaystyle \amise[K_{\delta_{0}}] =\left\{ \frac{1}{nh}\,+\,\frac{1}{4}h^{4}\Vert f''\Vert^{2}_{2}\right\}\;T(K).$ (3.46)

Obviously, there is only one term left that involves $ K$, and this term is merely a multiplicative constant. This has an interesting implication: Even though $ T(K)$ is not the same for different kernels, it does not matter for the asymptotic behavior of $ \amise$ (since it is just a multiplicative constant). Hence, $ \amise$ will be asymptotically equal for different equivalence classes if we use $ K_{\delta_{0}}$ to represent each class. To put it differently, using $ K_{\delta_{0}}$ ensures that the degree of smoothness is asymptotically equal for different equivalence classes when we use the same bandwidth.

Because of these unique properties Marron and Nolan call $ K_{\delta_{0}}$ the canonical kernel of an equivalence class. Table 3.2 gives the canonical bandwidths $ \delta _{0}$ for selected (equivalence classes of) kernels.


Table 3.2: $ \delta _{0}$ for different kernels
Kernel     $ \delta _{0}$
Uniform $ \left(\frac{9}{2}\right)^{1/5}$ $ \approx$ 1.3510
Epanechnikov $ 15^{1/5}$ $ \approx$ 1.7188
Quartic $ 35^{1/5}$ $ \approx$ 2.0362
Triweight $ \left(\frac{9450}{143}\right)^{1/5}$ $ \approx$ 2.3122
Gaussian $ \left(\frac{1}{4\pi}\right)^{1/10}$ $ \approx$ 0.7764


3.4.2 Adjusting Bandwidths across Kernels

In Subsection 3.1.4 we saw that the smoothness of two kernel density estimates with the same bandwidth but different kernel functions may be quite different. To get estimates based on two different kernel functions that have about the same degree of smoothness, we have to adjust one of the bandwidths by multiplying with an adjustment factor.

These adjustment factors can be easily computed from the canonical bandwidths. Suppose now that we have estimated an unknown density $ f$ using some kernel $ K^{A}$ and bandwidth $ h_{A}$ ($ A$ might stand for Epanechnikov, for instance). We consider estimating $ f$ with a different kernel, $ K^{B}$ ($ B$ might stand for Gaussian, say). Now we ask ourselves: what bandwidth $ h_{B}$ should we use in the estimation with kernel $ K^{B}$ when we want to get approximately the same degree of smoothness as we had in the case of $ K^{A}$ and $ h_{A}$? The answer is given by the following formula:

$\displaystyle h_{B}=h_{A}\frac{\delta_{0}^{B}}{\delta_{0}^{A}}.$ (3.47)

That is, we have to multiply $ h_{A}$ by the ratio of the canonical bandwidths $ \delta_{0}^{B}$ and $ \delta_{0}^{A}$ from Table 3.2.

EXAMPLE 3.1  
As an example, suppose we want to compare an estimate based on the Epanechnikov kernel and bandwidth $ h_{E}$ with an estimate based on the Gaussian kernel. What bandwidth $ h_{G}$ should we use in the estimation with the Gaussian kernel? Using the values for $ \delta_{0}^{E}$ and $ \delta_{0}^{G}$ given in Table 3.2:

$\displaystyle h_{G}=\frac{\delta_{0}^{G}}{\delta_{0}^{E}}h_{E}=
\frac{0.7764}{1.7188}\,h_{E}\approx 0.452\;h_{E}.$

An analogous calculation for Gaussian and Quartic kernel yields

$\displaystyle h_{Q}=\frac{\delta_{0}^{Q}}{\delta_{0}^{G}}\;h_{G}=
\frac{2.0362}{0.7764}\,h_{G}\approx 2.623\;h_{G},$

which can be used to derive the rule-of-thumb bandwidth for the Quartic kernel. $ \Box$

The scaling factors $ \delta _{0}$ are also useful for finding an optimal kernel function (see Exercise 3.6). We turn your attention to this problem in the next section.

3.4.3 Optimizing the Kernel

Recall that if we use canonical kernels the $ \amise$ depends on $ K$ only through a multiplicative constant $ T(K)$ and we have effectively separated the choice of $ h$ from the choice of $ K$.

A question of immediate interest is to find the kernel that minimizes $ T(K)$ (this, of course, will be the kernel that minimizes $ \amise$ with respect to $ K$). Epanechnikov (1969, the person, not the kernel) has shown that under all nonnegative kernels with compact support, the kernel of the form

$\displaystyle K(u)=\frac{3}{4}\left(\frac{1}{15^{1/5}}\right) \left\{1-\left(\frac{u}{15^{1/5}}\right)^{2}\right\} \Ind\left(\vert u\vert\leq 15^{1/5}\right)$ (3.48)

minimizes the function $ T(K)$. You might recognize (3.48) as the canonical Epanechnikov kernel.

Does this mean that one should always use the Epanechnikov kernel? Before we can answer this question we should compare the values of $ T(K)$ of other kernels with the value of $ T(K)$ for the Epanechnikov kernel. Table 3.3 shows that using, say, the Quartic kernel will lead to an increase in $ T(K)$ of less than half a percent.


Table 3.3: Efficiency of kernels
Kernel $ T(K)$ $ T(K)/T(K_{Epa})$
Uniform 0.3701 1.0602
Triangle 0.3531 1.0114
Epanechnikov 0.3491 1.0000
Quartic 0.3507 1.0049
Triweight 0.3699 1.0595
Gaussian 0.3633 1.0408
Cosine 0.3494 1.0004

After all, we can conclude that for practical purposes the choice of the kernel function is almost irrelevant for the efficiency of the estimate.