5.1 Cross-validation, penalizing functions and the plug-in method.

The accuracy of kernel smoothers as estimators of $m$ or of derivatives of $m$ is a function of the kernel $K$ and the bandwidth $h$. I have argued that the accuracy depends mainly on the smoothing parameter $h$ (Section 4.5). In this section, several bandwidth selection procedures will be presented that optimize quadratic error measures for the regression curve and its derivatives. In particular, I consider the distances

\begin{eqnarray*}
d_A(h) &= &n^{-1} \sum^n_{j=1} [ \hat m_h(X_j)-m(X_j) ]^2 w(X_...
... w(x)f(x)dx, \cr
d_C(h) &=& E [ d_A(h)\vert X_1, \ldots, X_n ],
\end{eqnarray*}



where $w(x)$ is a nonnegative weight function. The form of the above distances is determined by a variance component (decreasing in $h$) and a bias$^2$ component (increasing in $h$). Consider, for instance, $d_C(h);$ here the bias$^2 (h)
$ is

\begin{displaymath}b^2(h)=n^{-1} \sum^n_{j=1} \left[ n^{-1} \sum^n_{i=1} W_{hi}(X_j)m(X_i)-
m(X_j) \right]^2 w(X_j)\end{displaymath}

and the variance component is

\begin{displaymath}v(h)=n^{-1} \sum^n_{j=1} \left[ n^{-2} \sum^n_{i=1} W_{hi}(X_j) \sigma^2
(X_j) \right] w(X_j),\end{displaymath}

where $W_{hi}(x)=K_h(x-X_i)/ \hat f_h(x)$ are the Nadaraya-Watson weights. The fact that $b^2(h)$ increases as $h$ increases can be seen in Figure 5.1, where for the the simulated data set (Table 2, Appendix) the function $d_C(\cdot)$ is presented as a function of $h$. The power of $h$ at which $b^2(h)$ increases is a function of the selected kernel and the degree of differentiability of the regression function; see Section 4.6.

The decreasing curve in Figure 5.1 shows $v(h)$ roughly proportional to $h^{-1}$. The sum of both components is the conditional squared error $d_C(h)$, which is shown in Figure 5.1 as the curve above $b^2(h)$ and $v(h)$.

Figure 5.1: The conditional squared error $ \scriptstyle d_C(h)$ as the sum of the bias $ \scriptstyle ^2$ component $ \scriptstyle b^2(h)$ and the variance $ \scriptstyle
v(h)$ for the simulated data set (Table 2, Appendix). The weight function $ \scriptstyle w(x)=I(\vert x-1/2 \vert < $ $\scriptstyle 0.4)$ has been used. The function $ \scriptstyle b^2(h)$ is indicated as an increasing solid line (label 1). The variance is decreasing (fine dashed line, label 2) and $\scriptstyle d_C$ is the sum of both (long dashed line, label 3). 13283 ANRsimmbv.xpl
\includegraphics[scale=0.7]{ANRsimmbv.ps}

Figure 5.2:

Three measures of accuracy: $ \scriptstyle
d_I, d_A, d_C$ for the weight function $ \scriptstyle w(x)=I(\vert x-1/2 \vert \leq 0.4)$. The integrated squared error (computed from $ \scriptstyle 300$ grid points) is shown as a solid line labeled $ \scriptstyle 1$. The averaged squared error is indicated as a fine dashed line (label $ \scriptstyle 2$) and the error measure $\scriptstyle d_C$ is displayed with label $ \scriptstyle 3$ as a long dashed line. Computed from the simulated data set (Table 2, Appendix).

\includegraphics[scale=0.15]{ANR5,2.ps}

Theorem 4.1.1 about the asymptotic equivalence of $d_I, d_A$ and $d_C$ states that all three distances should have roughly the same minimum. The approximate identity of the three distances can be seen from Figure 5.2. It is highly desirable to choose a smoothing parameter that balances the systematic bias$^2$ effects versus the stochastic uncertainty expressed by the magnitude of the variance. For such a choice of smoothing parameter the squared bias and the variance are of the same order.

How can we find such a smoothing parameter? We have already seen a theoretical analysis of the MSE properties of kernel smoothers in Section 3.1. We know the asymptotic preferable choice of $h_0 \sim n^{-1/5}$, but the MSE and thus $h_0$ involved complicated unknowns that had to be estimated from the data as well.

The basic idea behind all smoothing parameter selection algorithms is to estimate the ASE or equivalent measures (up to some constant). The hope is then that the smoothing parameter minimizing this estimate is also a good estimate for the ASE itself. Expand the ASE as

\begin{eqnarray*}
d_A(h) &= &n^{-1} \sum^n_{j=1} m^2(X_j)w(X_j)+n^{-1} \sum^n_{j...
...w(X_j) \cr
&& -2n^{-1} \sum^n_{j=1} m(X_j) \hat m_h(X_j)w(X_j).
\end{eqnarray*}



Can we estimate this expression (up to a constant)? At first sight it seems possible. The first term is independent of the smoothing parameter. The second term can be computed entirely from the data. If the third term could be estimated and if it vanished faster than $d_A$ itself tends to zero, then indeed a device for selecting the bandwidth could be established quite easily.

A naive estimate of the third term would be

\begin{displaymath}n^{-1} \sum^n_{j=1} Y_j \hat m_h(X_j) w(X_j),\end{displaymath}

where the unknown mean $m(X_j)$ is replaced by the observation $Y_j$ at $X_j$. This is equivalent to considering the so-called resubstitution estimate of the prediction error

\begin{displaymath}p(h)=n^{-1} \sum^n_{j=1} [ Y_j- \hat m_h(X_j) ]^2 w(X_j)\end{displaymath}

as a device for selecting $h$. Unfortunately, the prediction error is a biased estimate of $d_A$. Figure 5.3 shows $p(h)$ as an increasing function in $h$; the best bandwidth would thus be the smallest bandwidth!

Figure 5.3: The prediction error $ \scriptstyle p(h)$ for the simulated data set (Table 2, Appendix 2). The weight function $ \scriptstyle w$ was set to the indicator function on $\scriptstyle [0.1,0.9 ]$. 13288 ANRsimpre.xpl
\includegraphics[scale=0.7]{ANRsimpre.ps}

The intuitive reason for the bias in $p(h)$ is that the observation $Y_j$ is used (in $\hat m_h(X_j)$) to predict itself. To see this in more detail, consider the expansion

\begin{eqnarray*}
p(h) &= &n^{-1} \sum^n_{j=1} \varepsilon_j^2 w(X_j)+d_A(h) \cr...
...n^{-1} \sum^n_{j=1} \varepsilon_j (\hat m_h(X_j)-m(X_j))w(X_j).
\end{eqnarray*}



The last term can be rewritten as
\begin{displaymath}
C_{1n}(h)=-2n^{-1} \sum^n_{j=1} \varepsilon_j \left[ n^{-1} \sum^n_{i=1}
W_{hi}(X_j)Y_i-m(X_j) \right] w(X_j),
\end{displaymath} (5.1.1)

which has expectation (given $\{X_1, \ldots, X_n \}$)

\begin{displaymath}-2n^{-1} \sum^n_{j=1} [ n^{-1} W_{hj}(X_j) \sigma^2 (X_j) ] w(X_j).
\end{displaymath}

This quantity tends to zero at the same rate as the variance component of $d_A$, which explains the bias of $p(h)$ as an estimate of $d_A(h)$. There are at least three possible ways to find an unbiased estimate of $d_A$:
  1. a leave-out-technique to obtain zero expectation for 5.1.1;

  2. a modification of $p(h)$ such that bias terms like that of 5.1.1 cancel asymptotically;

  3. a ``plug-in" method, using asymptotics of ``optimal bandwidth" sequences.

5.1.1 Leave-one-out method, cross-validation

The leave-out method is based on regression smoothers in which one, say the $j$th, observation is left out:

\begin{displaymath}
\hat m_{h,j}(X_j)=n^{-1} \sum_{i \not= j} W_{hi}(X_j)Y_i.
\end{displaymath} (5.1.2)

With these modified smoothers, the function
\begin{displaymath}
CV(h)=n^{-1} \sum^n_{j=1} [ Y_j- \hat m_{h,j}(X_j) ]^2 w(X_j)
\end{displaymath} (5.1.3)

is formed.

The function $CV$ is commonly called a cross-validation function since it validates the ability to predict $\{ Y_j \}^n_{j=1}$ across the subsamples $\{(X_i,Y_i) \}_{i \not= j}$ (Stone 1974). In the context of kernel smoothing this score function for finding $h$ was proposed by Clark (1975). The idea is related to variables selection in linear regression. Allen (1974) proposed the related quantity PRESS (prediction sum of squares). Wahba and Wold (1975) proposed a similar technique in the context of spline smoothing. The general structure of smoothing methods in linear regression models is discussed in Hall and Titterington (1986a).

The reason why cross-validation works is simple: The cross-product term from the $CV$ function, similar to 5.1.1, is

\begin{displaymath}
C_{2n}(h)=-2n^{-1} \sum^n_{j=1} \varepsilon_j \left[ n^{-1} ...
...^n_{i
\not= j \atop i=1} W_{hi} (X_j)Y_i-m(X_j) \right] w(X_j)
\end{displaymath} (5.1.4)

and has expectation zero. The $CV$-function for the simulated data set is shown in Figure 5.4.

Figure 5.4: The cross-validation function $ \scriptstyle CV(h)$ for the simulated data set (Table 2, Appendix 2). The weight function $\scriptstyle
w(x)=I(\left\vert x-1/2 \right\vert \leq 0.4)$ was used. 13518 ANRsimcv.xpl
\includegraphics[scale=0.7]{ANRsimcv.ps}

Note that by itself the fact that 5.1.4 has expectation zero does not guarantee that $\hat h=\arg\min [CV(h)]$ minimizes $d_A$ (or any other of the equivalent error measures). For this procedure it must be required that $C_{2n}(h)$ converges uniformly over $h$ to zero. Note also that the bandwidth suggested here by cross-validation (for the quartic kernel, $h=0.1$) is not exactly equal to the subjectively chosen bandwidth from Section 3.11. The reason may be twofold. First, the two bandwidths could be really different, even on the ``correct scale." Second, they could be different since Figure 3.21 was produced with a Gaussian kernel and the above cross-validation function was computed using a quartic kernel. A ``common scale" for comparing bandwidths from different kernels is derived in Section 5.4.

5.1.2 Penalizing functions

The second proposal, based on adjusting $p(h)$ in a suitable way, aims at an asymptotic cancellation of the bias 5.1.1. For this purpose introduce the penalizing function $\Xi (u)$ with first-order Taylor expansion

\begin{displaymath}\Xi(u)=1+2u+ O(u^2), \ u \to 0.\end{displaymath}

This form of the penalizing function will work out well as will be seen in what follows. The prediction error $p(h)$ is adjusted by $\Xi (n^{-1} W_{hj}(X_j))$, that is, modified to
\begin{displaymath}
G(h)=n^{-1} \sum^n_{j=1} ( Y_j- \hat m_h(X_j) )^2 \Xi (n^{-1} W_{hj}(X_j))
w(X_j).
\end{displaymath} (5.1.5)

The reason for this adjustment is that the correction function

\begin{displaymath}\Xi (n^{-1} W_{hj}(X_j)) =
\Xi (n^{-1}h^{-1} K(0)/\hat f_h(X_j) ) \end{displaymath}

penalizes values of $h$ too low. Recall that the naive approach to finding $h = \arg\min [p(h)]$ leads to too small an $h$. By penalizing $p(h)$ as in 5.1.5 one corrects for this too small $h$. Indeed, using the above Taylor expansion for $\Xi$, the score $G(h)$ is to first order
$\displaystyle {n^{-1} \sum^n_{j=1} \{ [ \varepsilon^2_j + (m(X_j)- \hat m_h(X_j))^2}$
    $\displaystyle +2 \varepsilon_j (m(X_j)- \hat m_h(X_j)) ]$  
    $\displaystyle \times [ 1+2 n^{-1} W_{hj} (X_j) ] \} w(X_j).$  

Multiplying out and disregarding lower order terms leads to

\begin{eqnarray*}
\lefteqn{n^{-1} \sum^n_{j=1} \varepsilon_j^2 w(X_j)+d_A(h)} \c...
...n^{-1} \sum^n_{j=1} \varepsilon^2_j n^{-1} W_{hj} (X_j) w(X_j).
\end{eqnarray*}



Note that the first term is independent of $h$ and that the expectation (given
$\{X_1, \ldots, X_n \}$) of the third term equals

\begin{displaymath}-2n^{-1} \sum^n_{j=1} n^{-1} W_{hj}(X_j) \sigma^2(X_j) w(X_j), \end{displaymath}

which is the negative expected value of the last term. The last two terms cancel asymptotically so that $G(h)$ is (up to a shift by $
n^{-1} \sum^n_{j=1} \varepsilon^2_j w(X_j)$) roughly equal to $d_A(h)$. A number of penalizing functions $\Xi$ have been proposed in the literature, the simplest of which is due to Shibata (1981):

\begin{displaymath}\Xi(u)=1+2u.\end{displaymath}

Some of the established correction functions are discussed and compared in their performance in the next section. Figure 5.10 gives an impression of some correction functions $\Xi$.

5.1.3 The plug-in method

The third method, the ``plug-in" procedure, is based on the asymptotic expansion of the squared error for kernel smoothers:

\begin{eqnarray*}
MSE &=& n^{-1} h^{-1} \sigma^2(x) c_K/f(x) \cr
&& + h^4 [ d_K (m''(x)+2m'(x) (f'/f)(x) )/2 ]^2.
\end{eqnarray*}



An ``optimal" bandwidth minimizing this expression would be (as pointed out in Section 4.1) proportional to $n^{-1/5}$ with constants depending on the unknowns $\sigma^2(x), m''(x)$ and so on. In practice, these quantities have to be estimated on the basis of some preliminary smoothing process which raises a second-order bandwidth selection problem. Although the ``plug-in" method achieves the same efficiency as the two other methods (Section 5.2), it is due to considerable uncertainty about how to choose the bandwidth in that first step. Another disadvantage, from a theoretical point of view, is that one is always restricted to a certain smoothness class (in the above expansion, to twice differentiable regression functions).

The first two methods, the leave-out and the penalty technique, lead to estimates of $d_A$ (up to a shift of $h$) and hence to estimates of $d_I, d_C$. The random constant by which the $CV$ function or the $G$ function differ from $d_A$ is roughly $
n^{-1} \sum^n_{j=1} \varepsilon^2_j w(X_j)$, which tends to $\int \sigma^2(x)f(x)
w(x)dx$. In Figure 5.5, the upper curve is the $CV$-function and the lower curve, with a similar shape is the averaged squared error $d_A$ for the simulation example (Table 2, Appendix 2).

Figure 5.5: The cross-validation $ \scriptstyle CV(h)$ (label 1) and the averaged squared error $\scriptstyle d_A(h)$ (label 2) for the simulated set (Table 2, Appendix). The weight function was $\scriptstyle
w(x)=I(\left\vert x-1/2 \right\vert \leq 0.4)$. 13740 ANRsimcvase.xpl
\includegraphics[scale=0.7]{ANRsimcvase.ps}

The two curves in Figure 5.5 differ by a constant in the range $[ 0.7,0.9 ]$, which is a remarkably accurate estimate of

\begin{displaymath}\displaystyle
0.8= \int \sigma^2(x)f(x)w(x)dx= \int^{0.9}_{0.1} 1 dx.\end{displaymath}

Consider the example of finding the Engel curve of potatoes as a function of net income. Figure 1.1 shows the data in the form of a sunflower plot. The cross-validation curve $CV(h)$ of this data set is displayed in Figure 5.6.

Figure 5.6: The cross-validation function $ \scriptstyle CV(h)$ for the potato versus net income data set (see Figure $\scriptstyle 1.0.1)$. The quartic kernel leave-one-out smoother was computed on the inner $\scriptstyle 90$ percent of the net income range (year $\scriptstyle 1973)$. Family Expenditure Survey (1968-1983).
\includegraphics[scale=0.15]{ANR5,6.ps}

The cross-validation function has a clear minimum at $h \approx 0.35$. The corresponding kernel smooth is shown in Figure 5.7.

Figure 5.7: The optimal kernel smooth for the potato versus net income data set with quartic kernel, $\scriptstyle h=0.35$, $\scriptstyle n=7125$. Family Expenditure Survey (1968-1981). 13745 ANRpotqua.xpl
\includegraphics[scale=0.7]{ANRpotqua.ps}

The estimated curve shows the same nonlinearity as Figure 1.2 but is slightly rougher.

In order to make cross-validation or the penalty method a mathematically justifiable device for selecting the smoothing parameter, it must be shown that the score ($CV$ or $G$) approximates (up to a constant) the accuracy measure $d_A(h)$ uniformly over $h$. If this is the case, then the relative loss for a selected bandwidth $\hat h$,

\begin{displaymath}
{d_\bullet(\hat h) \over \inf_{h \in H_n} d_\bullet(h)}
{\buildrel a.s.\over \to} 1.
\end{displaymath} (5.1.7)

Here $H_n$ is a set of reasonable bandwidths and $d_\bullet(\cdot)$ is any of the discussed squared deviation measures. A data-driven bandwidth $h$ that satisfies (5.1.7) is said to be asymptotically optimal. The following Theorem by Härdle and Marron (1985 b,c) says that $CV$ and $G$ yield optimal bandwidth selections.

Theorem 5.1.1   Suppose that

30pt to 30pt(A1) for $n=1,2,\ldots $, $H_n= [ \underline h, \overline h ] $, where

\begin{displaymath}\underline h \geq C^{-1}n^{\delta -1/d}, \ \ \ \overline h
\leq Cn^{- \delta}\end{displaymath}

for some constants $C, \delta \in (0, 1/(2d))$;

30pt to 30pt(A2) $K$ is Hölder continuous, that is, for some $L >0$, $\xi \in (0,1)$

\begin{displaymath}\left\vert K(u)-K(v) \right\vert \leq L \left\Vert u-v \right\Vert ^{\xi},\end{displaymath}

where $\left\Vert\cdot \right\Vert$ denotes Euclidean norm on $\mathbb{R}^d$ and also

\begin{displaymath}\int \left\Vert u \right\Vert^{\xi} \left\vert K(u) \right\vert du < \infty;\end{displaymath}

30pt to 30pt(A3) the regression function $m$ and the marginal density $f$ are Hölder continuous;

30pt to 30pt(A4) the conditional moments of $Y$ given $X=x$ are bounded in the sense that there are positive constants $C_1,C_2,\ldots $ such that for $k=1,2,...$, $E( Y^k \vert X=x) \leq C_k$ for all $x$;

30pt to 30pt(A5) the marginal density $f(x)$ of $X$ is bounded from below on the support of $w$;

30pt to 30pt(A6) the marginal density $f(x)$ of $X$ is compactly supported.

Then the bandwidth selection rule, ``Choose $\hat h$ to minimize $CV$($h$) (or $G$($h$))" is asymptotically optimal.

Asymptotic optimality of the kernel smoother with weights $W_{hi}(x)=K_h(x-X_i)/f(x)$ was shown by Härdle and Kelly (1987) for a slightly larger range of bandwidths. Rice (1984a) proved a related theorem using penalizing functions in the fixed design setting. These penalizing functions do not yield asymptotically optimal smoothing parameters in the stochastic design setting, as was shown by Härdle and Marron (1985a).

It is remarkable that the above devices yield optimal smoothing parameters without reference to a specific smoothness class to which either $m$ or $f$ belongs. Minimization of $h$ is performed over a wide range $H_n$ of possible bandwidths. The method is not just restricted to a specific range, for example, $[ an^{-1/5}, bn^{-1/5} ], \ 0<a<b,$ containing, for example, the optimal smoothing parameters for twice differentiable regression functions. In this sense, the cross-validation and the penalty method yield optimal smoothing parameters uniformly over smoothness classes (see the remarks of Section 4.1). This, in turn, has the effect that the data-driven kernel smoothers achieve ``their" optimal rate, independently of the smoothness of the underlying regression model (Härdle and Marron 1985b, section 3). From a practical point of view, this last theoretical property of cross-validated bandwidth sequences is welcome. The user of the cross-validation method need not worry about the roughness of the underlying curve. The cross-validated bandwidth will automatically give him the right amount of smoothing, independently of how smooth (in terms of degree of differentiability) the true regression curve is. This feature is not accomplished by the ``plug-in" method.

The cross-validation procedure is formally described in the following algorithm.

Algorithm 5.1.1

DO OVER (a dense grid $H_n$ of $h$ values)

STEP 1.

Compute the leave-out estimate

\begin{displaymath}\hat m_{h,j}(X_j)\end{displaymath}

at the observation points.

STEP 2.

Construct the cross validation function

\begin{displaymath}CV(h) = n^{-1} \sum_{j=1}^n (Y_j - \hat m_{h,j}(X_j) )^2 w(X_j).\end{displaymath}

where $w$ denotes a weight function.

END OVER.

STEP 3.

Define the automatic bandwidth as

\begin{displaymath}\hat h = \arg\min_{h \in H_n } [CV(h)]\end{displaymath}

5.1.4 Bandwidth choice for derivative estimation

The principal idea for smoothing parameter selection in the setting of derivative estimation is similar to that of finding a bandwidth for estimating $m$ itself. As in Rice (1985), consider the setting of fixed, equidistant predictor variables. The leave-out estimators for estimating $m'$ are defined by leaving out the observations $(X_j,Y_j)$ and $(X_{j-1}, Y_{j-1}),$

\begin{displaymath}\hat m^{(1)}_{h,j} (x)=n^{-1} \sum^n_{i=1 \atop i\not= j,j-1} W_{hi}^
{(1)} (x) Y_i,\end{displaymath}

where $\{ W^{(1)}_{hi}(x) \}$ are kernel derivative weights (see Section 3.1). Instead of comparing the leave-out estimators with the original response variables (at $n$ points) one evaluates the prediction error (at $n_2=n/2$ points)

\begin{displaymath}CV^{(1)} (h)=n^{-1}_2 \sum^{n_2}_{j=1} [ Y^{(1)}_{(2j)} - \hat
m^{(1)}_{h,(2j)} (X_{(2j)}) ]^2 w(X_{(2j)} ),\end{displaymath}

where $\{(X_{(j)},Y_{(j)}) \}$ denote the input data, sorted by $X$, and
$Y^{(1)}_{(j)}={Y_{(j)}-Y_{(j-1)} \over X_{(j)}-X_{(j-1)}}$ is the first difference of the $Y$-variables (sorted by $X$). Note that

\begin{displaymath}E \{Y^{(1)}_{(j)} \vert X_1, \ldots, X_n \} =
{m(X_{(j)})-m(X_{(j-1)}) \over X_{(j)}-X_{(j-1)}}= m' (\xi_{(j)})\end{displaymath}

with $\xi_{(j)} \in [ X_{(j-1)},X_{(j)} ]$. Squaring $CV^{(1)}(h)$ leads to

\begin{eqnarray*}
\lefteqn{n^{-1}_2 \sum^{n_2}_{j=1} [ Y^{(1)}_{(j)}-m'(\xi_{(j)...
... [ m'
(\xi_{(j)}) -\hat m^{(1)}_{h,(j)} (X_{(j)}) ] w(X_{(j)}).
\end{eqnarray*}



As in ordinary cross-validation (for estimating $m$) the cross-product term asymptotically vanishes, so that the function $CV^{(1)}(h)$ behaves (up to a constant) like

\begin{eqnarray*}
\lefteqn {n^{-1}_2
\sum^{n_2}_{j=1} [ m'(\xi_{(j)})- \hat m_{...
...n_2}_{j=1} [
m'(X_{(j)})-\hat m^{(1)}_h (X_{(j)}) ]^2 w(X_{(j)}) \end{eqnarray*}



plus a constant (independent of $h$). This approach was considered in the random design setting by Härdle and Carroll (1989). Müller, Stadtmüller and Schmidt (1987) propose a so-called factor method, which is also based on the plug-in approach. The factor method is based on comparing bandwidths $h_0$ for estimating $m$ with $h_0^{(k)}$ for estimating $m^{(k)}$. These bandwidths are the same up to a scale factor depending on $k, p$ and the kernel function. More precisely, for a $p$-times differentiable function $m$ the MSE for estimating $m^{(k)}(x)$ is, as seen before,

\begin{eqnarray*}
MSE &\approx & n^{-1}h^{-(2k+1)} \sigma^2
\int {K^{(k)}}^2(u...
... h^{p-k} m^{(k)}(x) (-1)^p \int u^p K^{(k)}(u)
du/p! \right]^2.
\end{eqnarray*}



The bandwidth that minimizes MISE is, as described in Section 4.1,

\begin{displaymath}h_0^{(k)} = \left[ { 2k+1 \over 2(p-k) }
{ \sigma^2 \int {K...
...{m^{(k)}}^2(x)dx }
\right]^{ 1/(2p+1) } n^{ -1/(2p+1) }\cdotp \end{displaymath}

If we compare the optimum $h_0$ with $h_0^{(k)}$ then we see that they differ by the factor

\begin{displaymath}C_{0,k} = \left[ { (2k+1)p \over p-k }
{ \int {K^{(k)}}^2(u)...
...-1)^p \int u^p K^{(k)}(u) du/p!]^2 }
\right]^{1/(2p+1)}\cdotp \end{displaymath}

Thus the MISE optimal bandwidth $h_0^{(k)}$ is given by

\begin{displaymath}h_0^{(k)}= C_{0,k} h_0. \end{displaymath}

5.1.5 The dependence of the smoothing parameter on the weight function

The weight function $w$ was introduced to reduce boundary effects. If one had not introduced the weight function and just formed the bandwidth selection scores over the whole data range one would have obtained a bandwidth sequence optimized with respect to the ``boundary behavior" of the kernel smoother. As pointed out in Section 4.4, the convergence rate is slightly slower at the boundary points. Since the cross-validation method, for example, is still asymptotically optimal (in the sense of Theorem 5.1.1) one would artificially select a slower rate of convergence in the center of the data range, where the majority of the data lie.

However, cutting the range of interest down to, say, 90 percent, doesn't solve the problem since typically the kernel weights cover more than 10 percent of the data range (see Figure 5.6). This raises the question of how variable the $CV$-function is as the weight function $w$ is varied. Figure 5.8 shows an optimal kernel smooth estimating liver weights as a function of age. The $CV$-function was computed disregarding the outer 5 percent data on each side. What happens if the weight function $w$ is varied?

Figure 5.9 shows cross-validation curves as the weights cut off 2, 4, 6, 8 and 10 percent of the data at each end of the data interval. The location of the minimum, the selected optimal bandwidth, is remarkably stable except for the case where only 80 percent of the data interval is cross-validated. I did similar comparisons for the simulated data set (Table 2, Appendix 2) and found qualitatively the same behavior: The weight function does not influence the selected bandwidth to a large extent.

Figure 5.8: Liver weights versus age of $ \scriptstyle 300$ female corpses (smoothed with cross-validated bandwidth $\scriptstyle h=22$ years). From Härdle and Marron (1985b) with permission of the Institute of Mathematical Statistics.
\includegraphics[scale=0.15]{ANR5,8.ps}

Figure 5.9: Cross-validation curves as the weight function $ \scriptstyle w$ is varied. The uppermost $\scriptstyle CV$ curve was computed leaving $ \scriptstyle 2$ percent of observations at each end. The curves below leave out $\scriptstyle 4,6,
8,10$ percent at each end. From Härdle and Marron (1985b) with permission of the Institute of Mathematical Statistics.
\includegraphics[scale=0.15]{ANR5,9.ps}

Exercises
5.1.1Try cross validation and some of the penalizing functions to find a smoothing parameter for the simulated data set given in the appendix.

5.1.2Recall the asymptotic equivalence of $k$-$NN$ and kernel smoothing. How would you choose a good $k$ with the cross validation method?

5.1.3How would you modify the penalizing functions in the setting of $k$-$NN$ smoothing?

5.1.4Write an efficient algorithm for computing the cross validation function.

[Hint: Use the WARPing technique or the FFT method.]

5.1.5One could argue that an asymptotically optimal smoothing parameter for $m$ is also good for estimating $m'$. A good estimate for $m$ should give a good estimate for $m'$! Can you find an argument against this?

5.1.6Find $\hat h=\arg\min [CV(h)]$ and $\hat h^{(1)} = \arg\min [CV^{(1)}(h)]$. Compare $\hat h$ with $\hat h^{(1)}$. Do you find that $\hat h < \hat h^{(1)}$? [Hint: Study the factor method. ]

5.1.6 Complements

Proof of Theorem 5.1.1

The proof of this theorem is based on the uniform approximation (over $H_n$) of the distances $d_A, d_I$, and so on; see Theorem 4.1.1. If suffices to prove the asymptotic optimality for $d_A(\cdot)$. The Hölder continuity of $K, m, f$ ensures that it suffices to consider a discrete subset $H'_n$ of $H_n$. The existence of all conditional moments of order $k$ gives over this sufficiently dense subset $H'_n$ of $H_n$:

\begin{displaymath}
\sup_{h,h' \in H'_n}\left\vert {{d_A(h)-d_A(h')-(CV(h)-CV(h'...
...(h)+
d_M(h')}} \right\vert {\buildrel a.s. \over \to 0} \cdotp
\end{displaymath} (5.1.8)

A key step in proving 5.1.8 is Whittle's inequality (Whittle 1960) on bounding higher moments of quadratic forms of independent random variables. Using the Hölder continuity of $K, m$ and $f$ and Theorem 4.1.1 gives
\begin{displaymath}
\sup_{h,h' \in H_n} \left\vert {{d_A(h)-d_A(h')-(CV(h)-CV(h'...
..._A(h)+
d_A(h')}}\right\vert {\buildrel a.s. \over \to 0}\cdotp
\end{displaymath} (5.1.9)

Now let $\varepsilon > 0$ be given and let

\begin{displaymath}
% latex2html id marker 14464
\hat h_0 = \mathop {\arg\min}_{h \in H_n} [d_A(h)],\end{displaymath}


\begin{displaymath}
% latex2html id marker 14466
\hat h= \mathop {\arg\min}_{h \in H_n} [CV(h)].\end{displaymath}

From 5.1.9 we have with probability 1,

\begin{displaymath}{{d_A(\hat h)-d_A(\hat h_0)-(CV(\hat h)-CV(\hat h_0))} \over {d_A(\hat
h)+
d_A(\hat h_0)}} \leq \varepsilon.\end{displaymath}

This implies

\begin{displaymath}0 \geq CV(\hat h)-CV(\hat h_0) \geq (1- \varepsilon)d_A(\hat h)-(1+
\varepsilon)d_A(\hat h_0), \end{displaymath}

which entails

\begin{displaymath}1\leq {{d_A(\hat h)} \over {d_A(\hat h_0)}} \leq {{1+ \varepsilon}
\over {1- \varepsilon}}\cdotp\end{displaymath}

Since $\varepsilon$ was arbitrary, so

\begin{displaymath}P \left\{ \lim_{n \to \infty}\left\vert {{d_A(\hat h)} \over ...
... h_0)}}
-1\right\vert <
\delta \right\}=1 \ \forall \delta > 0,\end{displaymath}

which means that $\hat h$ is asymptotically optimal.