5.4 Comparing bandwidths between laboratories (canonical kernels)

Observe that if one used a kernel of the form

\begin{displaymath}K_s (u)=s^{-1} K(u/s)\end{displaymath}

and rescaled the bandwidth by the factor $s$ one would obtain the same estimate as with the original kernel smoother. A kernel can therefore be seen as an equivalence class of functions $K$ with possible rescalings by $s$. A consequence of this scale dependence is that the bandwidth selection problem is not identifiable if the kernel $K$ is determined only up to scale. Which member of this equivalence class is ``most representative?"

More generally, consider the situation in which two statisticians analyze the same data set but use different kernels for their smoothers. They come up with some bandwidths that they like. Their smoothing parameters have been determined subjectively or automatically, but they have been computed for different kernels and therefore cannot be compared directly. In order to allow some comparison one needs a common scale for both bandwidths. How can we find such a ``common scale?"

A desirable property of such a scale should be that two kernel smoothers with the same bandwidth should ascribe the same amount of smoothing to the data. An approach to finding a representative member of each equivalence class of kernels has already been presented in Section 4.5. Epanechnikov (1969) has selected kernels with kernel constant $d_K=1$. Another approach taken by Gasser, Müller and Mammitzsch (1985) insists that the support of the kernel be $[-1,1]$. A drawback to both methods is that they are rather arbitrary and are not making the attempt to give the same amount of smoothing for different kernels.

An attempt for such a joint scale is given by so-called canonical kernels in the class of kernels $K_s$ (Marron and Nolan 1988). It is based on the well-known expansion of the MSE for $d=1, p=2$ and $K=K_s$,

\begin{displaymath}d_M (h) \approx n^{-1} h^{-1} c_{K_s} C_1 + h^4 d_{K_s}^2 C_2 ,
\ h \to 0, \ nh \to \infty, \end{displaymath}

where $C_1, C_2$ denote constants depending on the unknown distribution of the data. A little algebra shows that this is equal to
\begin{displaymath}
n^{-1} h^{-1} C_1 (s^{-1} c_K) + h^4 C_2 (s^2 d_K )^2.
\end{displaymath} (5.4.19)

Observe that the problems of selecting $K$ and $h$ are ``uncoupled" if

\begin{displaymath}s^{-1} c_K = ( s^2 d_K)^2.\end{displaymath}

This uncoupling can be achieved by simply defining

\begin{displaymath}s=s^*= \left[ {c_K \over d_K^2} \right]^{1/5}\end{displaymath}

Hence, define the canonical kernel $K^*$ as that kernel of the class $K_s$ with $s=s^*$. For this canonical kernel one has

\begin{eqnarray*}
\left(\int u^2 K^* (u)du\right)^2 &=& \int (K^* (u))^2 du \cr
...
... = {d_K^{2/5} \over c_K^{1/5} } c_K \cr
&=& d_K^{2/5} c_K^{4/5}. \end{eqnarray*}



Hence, for the canonical kernel,

\begin{displaymath}d_M (h) \approx (d_K)^{2/5} (c_K)^{4/5} [ n^{-1} h^{-1}
C_1 + h^4 C_2 ],\end{displaymath}

which shows again that the canonical kernel $K^*$ uncouples the problems of kernel and bandwidth selection. Note that $K^*$ does not depend on the starting choice of $K$: one could replace $K$ by any $K_s$ and $K^*$ would still be the same.

The advantage of canonical kernels is that they allow simple comparison between different kernel classes. Suppose that $K_{(1)}$ and $K_{(2)}$ are the canonical kernels from each class and that one wants the two estimated curves to represent the same amount of smoothing, that is, the variance and bias$^2$ trade-off should be the same for both smoothers. This is simply achieved by using the same bandwidth for both estimators. If canonical kernels are used, the $d_M (h)$ functions will look different for the two kernels, as one is a multiple of the other, but each will have its minimum at the same place. The kernel class that has the lowest minimum is given by the ``optimal kernel" of order 2, the so-called Epanechnikov kernel.

One interesting family of kernels, which contains many of the kernels used in practice, is

\begin{displaymath}K^\alpha (u) = C_\alpha (1-x^2)^\alpha I(\vert u \vert \le 1),\end{displaymath}

where $C_\alpha$ makes $K^\alpha $ a probability density:

\begin{displaymath}C_\alpha = \Gamma(2\alpha +2) \Gamma(\alpha +1)^{-2}2^{-2\alpha-1}.\end{displaymath}

The first three columns of Table 5.2 show the values of $\alpha$ and $C_\alpha$ for the most common cases. The normal case is included as $\alpha = \infty$. It is simple to check that the rescaling factor $s^*$ for each $K^\alpha $ is

\begin{displaymath}s^* = 2^{-1/5} \Gamma(\alpha +1)^{-4/5}(2\alpha +3)^{2/5}
\Ga...
...pha +2)^{2/5}\Gamma(2\alpha +1)^{2/5}\Gamma(4\alpha +2)^{-1/5}.\end{displaymath}


Table 5.2: Canonical kernels from the family $K^\alpha $
Kernel $\alpha$ $C_\alpha$ $s^*$    
Uniform 0 $1/2$ $(9/2)^{1/5}$ $\approx $ $ 1.3510$
Epanechnikov 1 $3/4$ $15^{1/5}$ $\approx $ $ 1.7188$
Quartic 2 $15/16$ $35^{1/5}$ $\approx $ $ 2.0362$
Triweight 3 $35/32$ $(9450/143)^{1/5}$ $\approx $ $ 2.3122$
Gaussian $\infty$ - $(1/(4\pi))^{1/10}$ $\approx $ $ 0.7764$

In practice, one uses kernels that are not necessarily canonical, since one is used to thinking in terms of a certain scale of the kernel, for example, multiples of the standard deviation for the normal kernel. How does one then compare the smoothing parameters $h_1, h_2$ between laboratories? The following procedure is based on canonical kernels. First transform the scale of both kernel classes to the canonical kernel $K^*(u)=(s^*)^{-1}K(u/s^*)$. Then compare the bandwidths for the respective canonical kernels. More formally, this procedure is described in Algorithm 5.4.1.

Algorithm 5.4.1
Suppose that lab $j$ used kernel $K_j$ and bandwidth $h_j, \ j=1, 2$.

STEP 1.

Transform $h_j$ to canonical scale:

\begin{displaymath}h_j^* = h_j/s_j^* , \ j =1, 2. \end{displaymath}

STEP 2.

Decide from the relation of $ h_1^* $ to $ h_2^* $ whether both labs have produced the same smooth or whether one or the other has over- or undersmooothed.

Suppose, for example, that laboratory 1 used the Gaussian kernel and came up with a bandwidth of, say, $h_1=0.05$ (see Figure 3.21). Another statistician in laboratory 2 used a quartic kernel and computed from cross-validation a bandwidth of $h_2=0.15$ (see Figure 5.4). A typical situation is depicted in Figure 5.19, showing the average squared error $d_A(h)$ for the Gaussian and the quartic kernel smoothers as applied to the simulated data set from Table 2 in the Appendix. Obviously, the bandwidth minimizing each of these functions gives the same amount of trade-off between bias$^2$ and variance.

Figure 5.19: The averaged squared error $\scriptstyle d_A(h)$ averaged from the simulated data set (Table 3.2) for Gaussian (solid line, label 1) and quartic (dashed line, label 2) kernel smoothers with weight function $\scriptstyle w(u)=I(\left\vert u-0.5 \right\vert \leq 0.4)$. 16140 ANRsimase.xpl
\includegraphics[scale=0.7]{ANRsimase.ps}

Let me compute $s^*$ explicitly for this example. The factor $s^*_1$ for the Gaussian kernel is

\begin{displaymath}s^*_1 = \left(\int\left({1 \over 2 \pi}\right) e^{-u^2} du \right)^{1/5} = (2
\sqrt \pi )^{- 1/5} \approx 0.776.\end{displaymath}

The bandwidth for the canonical normal kernel is therefore $h^*_1=h_1 /
0.776=0.0644$. The quartic kernel $K(u)=(15/16) (1-u^2)^2 I (
\left\vert u \right\vert \le 1)$ has $d_K = 1/7$ and $c_K = 15/21$; the ``canonical quartic kernel" is therefore determined by

\begin{displaymath}s^*_2= \left({15 \cdot 49 \over 21}\right)^{1/5} = 35^{1/5} = 2.036, \end{displaymath}

which means that $h^*_2=h_2/2.036=0.0736$.

In summary, the optimal bandwidth $\hat h_0= \arg\min [ d_A (h)]$ is 0.0736 (on the canonical kernel scale), which means that my subjective choice (Figure 3.21) for this simulated example of $h^*_2=0.0644$ resulted in slight undersmoothing.

Exercises

5.4.1Compute the canonical kernel from the triangular kernel.

5.4.2Derive the canonical kernels for the derivative kernels from Section 4.5.

5.4.3Try kernel smoothing in practice and transform your bandwidth by the procedure of Algorithm 5.4.1. Compare with another kernel smooth and compute the bandwidth that gives the same amount of smoothing for both situations.