4.5 The accuracy as a function of the kernel

The effective weight function $\{ W_{hi}(x)\} $ of kernel smoothers is determined by the kernel $K$ and the bandwidth sequence $h=h_n$. The accuracy of the estimated curve ${\hat{m}}_h(x)$ is not only a function of the bandwidth alone, but, more precisely, it is dependent upon the pair $(K,h)$. In this section the behavior of quadratic distance measures is studied as a function of the kernel $K$. The variation of these distance measures as a function of the kernel can be uncoupled from the problem of finding a good smoothing parameter, as is shown in what follows. The bottom line of this section is that for practical problems the choice of the kernel is not so critical. The precision of ${\hat{m}}_h$ is more a question of the choice of bandwidth. Recall the asymptotic equivalence of the squared error distances as described in Theorem 4.1.1. Given this equivalence I concentrate on the behavior of MISE as a function of $K$.

In Section 3.1 we have seen that the MSE of ${\hat{m}}_h(x)$ can be written as

\begin{displaymath}
C_V c_K n^{-1}h^{-1}+C_B^2 d_K^2 h^4,
\end{displaymath} (4.5.26)

where $C_V, C_B$ are constants depending on the joint distribution of $(X,\allowbreak
Y)$. The bandwidth minimizing 4.5.26 is
\begin{displaymath}
h_0=\left({C_V \over 4 C_B^2}\right)^{1/5} \left({c_K \over
d_K^2}\right)^{1/5} n^{-1/5}.
\end{displaymath} (4.5.27)

This smoothing parameter results in the following MSE

\begin{eqnarray*}
MSE_{\textrm{opt}}&=&n^{-4/5}(4 C_B^2)^{1/5}C_V^{4/5} c_K^{4/5...
.../5} (C_V)^{4/5} C_b^{2/5} (4^{1/5}+4^{-4/5}) c_K^{4/5}d_K^{2/5}.
\end{eqnarray*}



This minimal MSE depends on the kernel through the factor
\begin{displaymath}
V(K)B(K)=c_K^2 d_K=\left(\int K^2(u) d u\right)^2 \int u^2 K(u) d
u.
\end{displaymath} (4.5.28)

More generally, we can consider the case of estimating the $k$th derivative, $m^{(k)}$, of a $p$-times differentiable $m$. If we use derivative kernels, $K^{(k)}$, this functional then takes the form

\begin{displaymath}V(K)B(K)=\left[\int_{-1}^1 (K^{(k)}(u)^2 d u\right]^{p-k} \left \vert \int_{-1}^1
K^{(k)}(u)u^p d u \right \vert^{2 k+1}.\end{displaymath}

How can this complicated expression be minimized as a function of $K$?

To answer this question note first that we have to standardize the kernel somehow since this functional of $K$ is invariant under scale transformations

\begin{displaymath}
K^{(k)} (u)\ \to\ s^{-(k+1)}K^{(k)}(u/s).
\end{displaymath} (4.5.29)

There are several approaches to this standardization question. Here I present the approach by Gasser, Müller and Mammitzsch (1985), who propose to set the support of $K$ equal to $[-1,1]$. A possible drawback of this standardization is that one can lose the feeling of what the bandwidth is really doing to the data. Consider, for instance, the kernel function

\begin{displaymath}K(u)=C_\alpha (1-u^2)^\alpha\ I(\vert u \vert \le 1),\end{displaymath}

which has for all $\alpha$ support $[-1,1]$. For large $\alpha$ the kernels become steeper and steeper and it becomes difficult to interpret the bandwidths as multiples of the support. In Section 5.4, when I discuss the canonical kernels of Marron and Nolan (1988), I come back to this standardization question.

Gasser, Müller and Mammitzsch (1985) used variational methods to minimize $V(K)B(K)$ with respect to $K$. The answers are polynomials of degree $p$. Some of these ``optimal'' kernels are presented in Table 4.1.


Table 4.1: Kernel functions minimizing $V(K)B(K)$. Source: Gasser and Mammitzsch (1985).
$k$ $p$ kernel $K(u)$
0 2 $(3/4)(-u^2+1)$ $I(\left \vert u \right \vert \le 1)$
0 4 $(15/32)(7 u^4-10 u^2+3)$ $I(\left \vert u \right \vert \le 1)$
1 3 $(15/4)(u^3-u)$ $I(\left \vert u \right \vert \le 1)$
1 5 $(105/32)(-9 u^5+14 u^3-5 u)$ $I(\left \vert u \right \vert \le 1)$
2 4 $(105/16)(-5 u^4+6 u^2-1)$ $I(\left \vert u \right \vert \le 1)$
2 6 $(315/64)(77 u^6-135 u^4+63 u^2-5)$ $I(\left \vert u \right \vert \le 1)$

It is said that a kernel is of order $(k,p)$ if it satisfies the following moment conditions:

\begin{eqnarray*}
\int_{-1}^1 K(u)u^j\ d u&=&0, \quad j=1,\ldots,p-k-1;\cr
&=&C_K(-1)^k\ {(p-k)!\over k!}, \quad j=p-k.\end{eqnarray*}



Then $K^{(k)}$ satisfies

\begin{displaymath}\begin{array}{rll}
\int_{-1}^1 K^{(k)}(u)u^j du &=0, &j=0,\ld...
...ldots,p-1;\\
&=(-1)^k k!, &j=k;\\
&=C_K, &j=p.
\end{array} \end{displaymath}

The optimal kernels given in Table 4.1 are of order $(k,p)$. Another important issue can be seen from Table 4.1 : Derivatives of ``optimal'' kernels do not yield ``optimal'' kernels for estimation of derivatives, for example, the kernel for $(k,p)=(1,3)$ is not the derivative of the one with $(k,p)=(0,4)$. But note that the derivative of the latter kernel satisfies 4.5.28 with $(k,p)=(1,3)$.

Figure 4.13 depicts two optimal kernels for $p=2, 4$ and $k=0$.

Figure 4.13: Two optimal kernels for estimating $m$ (from Table 4.5.1). Label 1 (solid line): $(k,p)=(0,2)$. Label 2 (dashed line): $(k,p)=(0,4)$. 12188 ANRoptker1.xpl
\includegraphics[scale=0.7]{ANRoptker1.ps}

Note that the kernel with $p=4$ has negative side lobes. The Epanechnikov kernel is ``optimal'' for estimating $m$ when $p=2$. The kernel functions estimating the first derivative must be odd functions by construction. A plot of two kernels for estimating the first derivative of $m$ is given in Figure 4.14. The kernels for estimating second derivatives are even functions, as can be seen from Figure 4.15. A negative effect of using higher order kernels is that by construction they have negative side lobes. So a kernel smooth (computed with a higher order kernel) can be partly negative even though it is computed from purely positive response variables. Such an effect is particularly undesirable in demand theory, where kernel smooths are used to approximate statistical Engel curves; see Bierens (1987).

Figure 4.14: Two optimal kernels for estimating $m'$, the first derivative of $m$ (from Table 4.5.1). Label 1 (solid line): $(k,p)=(1,3)$. Label 2 (dashed line): $(k,p)=(1,5)$. 12192 ANRoptker2.xpl
\includegraphics[scale=0.7]{ANRoptker2.ps}

Figure 4.15: Two optimal kernels for estimating $m''$, the second derivative of $m$ (from Table 4.5.1). Label 1 (solid line): $(k,p)=(2,4)$. Label 2 (dashed line): $(k,p)=(2,6)$. 12196 ANRoptker3.xpl
\includegraphics[scale=0.7]{ANRoptker3.ps}

A natural question to ask is, how ``suboptimal'' are nonoptimal kernels, that is, by how much the expression $V(K)B(K)$ is increased for nonoptimal kernels? Table 4.2 lists some commonly used kernels (for $k=0,p=2$) and Figure 4.16 gives a graphical impression of these kernel. Their deficiencies with respect to the Epanechnikov kernel are defined as

\begin{displaymath}D(K_{\textrm{opt}}, K)=[V(K_{\textrm{opt}})B(K_{\textrm{opt}})]^{-1} [V(K)B(K)].\end{displaymath}


Table 4.2: Some kernels and their efficiencies
Kernel $K(u)$ $D(K_{\textrm{opt}},K)$
Epanechnikov $ (3/4)(-u^2+1)\ I(\left \vert u \right \vert \le 1)$ 1
Quartic $(15/16)(1-u^2)^2\ I(\left \vert u \right \vert \le 1)$ 1.005
Triangular $ (1-\vert u \vert)\ I(\left \vert u \right \vert \le 1)$ 1.011
Gauss $(2 \pi)^{-1/2}\exp(-u^2/2)$ 1.041
Uniform $(1/2)\ I(\left \vert u \right \vert \le 1)$ 1.060

Note:The efficiency is computed as $\{V(K_{\textrm{opt}})B(K_{\textrm{opt}})/[ V(K)B(K)] \}^{-1/2}$ for $k=0$,$p=2$.

A picture of these kernels is given in Figure 4.16. The kernels really look different, but Table 4.2 tells us that their MISE behavior is almost the same.

Figure 4.16: Positive kernels for estimating $m$ (from Table 4.2). Label 1: quartic; label 2: triangular; label 3: Epanechnikov; label 4: Gauss; label 5: uniform. 12212 ANRposkernels.xpl
\includegraphics[scale=0.7]{ANRposkernels.ps}

The bottom line of Table 4.2 is that the choice between the various kernels on the basis of the mean squared error is not very important. If one misses the optimal bandwidth minimizing MISE (or some other measure of accuracy) by 10 percent there is a more drastic effect on the precision of the smoother than if one selects one of the ``suboptimal'' kernels. It is therefore perfectly legitimate to select a kernel function on the basis of other considerations, such as the computational efficiency (Silverman 1982; Härdle, 1987a).

Exercises
4.5.1 Verify the ``small effect of choosing the wrong kernel'' by a Monte Carlo study. Choose

\begin{displaymath}m(x)=\exp(-x^2/2), \epsilon \sim N(0,1), X \sim U(-1,1), n=100.\end{displaymath}

Select as $h$ the MSE optimal bandwidth for estimating $m(0)$. Compute the MSE at $x=0$ for the different kernels over 10000 Monte Carlo experiments.

4.5.2 Compute $V(K)B(K)$ for the triweight kernel

\begin{displaymath}K(u)=C_3 (1-u^2)^3\ I(\vert u \vert \le 1).\end{displaymath}

Insert the obtained efficiency loss into Table 4.2.

4.5.3 Prove that $V(K)B(K)$ as defined in 4.5.28 is invariant under the scale transformations 4.5.29.

4.5.4 A colleague has done the Monte Carlo study from Exercise 4.5 in the field of density smoothing. His setting was

\begin{displaymath}f=\phi, n=100, x=0\end{displaymath}

with a MSE optimal bandwidth $h$. From the 10,000 Monte Carlo runs he obtained the following table.


Table 4.3: Some kernels and their efficiencies
  estimated $95\%$ confidence
Kernel MSE interval
Epanechnikov 0.002214 $\pm 0.000051$
Quartic 0.002227 $\pm 0.000051$
Triangular 0.002244 $\pm 0.000052$
Gauss 0.002310 $ \pm 0.000054 $
Uniform 0.002391 $ \pm 0.000055$

Note:The efficiency is computed as $\{V(K_{\textrm{opt}})B(K_{\textrm{opt}})/[ V(K)B(K)] \}^{-1/2}$ for $k=0$,$p=2$.

Do these numbers correspond to the values $D(K_{\textrm{opt}},K)$ from Table 4.2?

4.5.1 Complements

I give a sketch of a proof for the optimality of the Epanechnikov kernel. First, we have to standardize the kernel since $V(K)B(K)$ is invariant under scale transformations. For reasons that become clear in Section 5.4 I use the standardization $V(K)=B(K)$. The task for optimizing $V(K)B(K)$ is then to minimize

\begin{displaymath}\int K^2(u) d u\end{displaymath}

under the constraints

\begin{displaymath}\begin{array}{lrcl}
\hbox{ (i)}& \int K(u)&=&1,\cr
\hbox{ (ii)}&K(u)&=&K(-u),\cr
\hbox{ (iii)}&d_K&=&1.\end{array}\end{displaymath}

If $\Delta K$ denotes a small variation for an extremum subject to the constraints (i)-(iii), the variation of

\begin{displaymath}\int K^2(u) d u+\lambda_1 \left[\int K(u) d u-1\right]+\lambda_2 \left[\int
u^2 K(u) d u- 1\right]\end{displaymath}

should be zero. This leads to

\begin{displaymath}2 \int K(u)\Delta K(u) d u+\lambda_1 \left[\int \Delta K(u) d u\right
]+\lambda_2 \left[\int \Delta K(u) u^2\right]=0.\end{displaymath}

Therefore,

\begin{displaymath}2 K(u)+\lambda_1+\lambda_2 u^2=0.\end{displaymath}

The kernel $K(u)$ is zero at $u=\pm (-\lambda_1/\lambda_2)^{1/2}$. The answer is thus the Epanechnikov kernel if we determine $\lambda_1, \lambda_2$ from the constraints (i)-(iii). The above standardization of $c_K^2=d_K$ gives then the rescaled version

\begin{displaymath}K(u)=3/(4 \cdot 15^{1/2}) (1-(u/15^{1/2})^2 I (\vert u/15^{1/2}\vert \le
1)).\end{displaymath}