5.3 Local adaptation of the smoothing parameter

A smoothing parameter that is selected by one of the previously described methods optimizes a global error criterion. Such a ``global" choice need not necessarily be optimal for the estimation of the regression curve at one particular point, as the trivial inequality

\begin{displaymath}\inf_h \int E(\hat m_h - m)^2 \ge \int \inf_h E(\hat m_h - m)^2\end{displaymath}

shows. In this section I present two methods for locally adapting the choice of the smoothing parameter. The first one is based on the idea of approximating the distribution of $\sqrt{nh} (\hat m_h - m)$ by bootstrapping. The second one, the supersmoother developed by Friedman (1984), is constructed via a ``local cross-validation" method for $k-NN$ smoothers.

5.3.1 Improving the smooth locally by bootstrapping

We have already seen that the so-called wild bootstrap method (Section 4.2) allows us to approximate the distribution of $\sqrt{nh} (\hat m_h - m)$. In the following, though, I would like to present a slightly different bootstrap method in the simpler setting of i.i.d. error terms. This simpler setting has the advantage that resampling can be done from the whole set of observed residuals. Let $X_i=i/n$ and $\textrm{var}(\varepsilon_i) = \sigma^2$. The stochastics of the observations are completely determined by the observation error. Resampling should therefore be performed with the estimated residuals,

\begin{eqnarray*}
\hat \varepsilon_i &=&Y_i- \hat m_g(X_i) \cr
&=&Y_i-n^{-1} \sum^n_{j=1}K_g(X_i-X_j)Y_j, \ \ i=1, \ldots, n,
\end{eqnarray*}



where $g$ denotes a pilot bandwidth. Since the estimate is more biased near the boundary it is advisable to use only residuals from an interior subinterval $[ \eta, 1-\eta ], 0< \eta<1/2
$. In order to let the resampled residuals reflect the behavior of the true regression curve they are recentered by their mean:

\begin{displaymath}\tilde \varepsilon_i = \hat \varepsilon_i- \hbox{mean} \{ \hat \varepsilon_i
\}.\end{displaymath}

Bootstrap residuals $\{ \varepsilon_i^* \}$ are then created by sampling with replacement from $\{ \tilde \varepsilon_i \},$ producing bootstrap response variables

\begin{displaymath}Y_i^*= \hat m_g(X_i)+ \varepsilon_i^*.\end{displaymath}

A bootstrap estimator $\hat m^*$ of $m$ is obtained by smoothing $\{ (X_i, Y_i^*) \}$ rather than $\{ (X_i,Y_i) \}$. It is commonly said that the bootstrap principle holds if the distributions of $\hat m^*(x)$ and $\hat m(x)$, when suitably normalized, become close as the sample size $n$ increases. If convergence of these distributions is examined in the Mallows metric (Bickel and Freedman 1981), then the second moments of these distributions also become close. Since at a fixed point $x$ the MSE,

\begin{displaymath}E ( \hat m_h(x)-m(x) )^2,\end{displaymath}

is the quantity we are interested in, the bootstrap approximation in terms of the Mallows metric will give us a method for estimating the local mean squared error. To simplify the following calculations assume that the kernel is standardized to have $d_K=1$.

In the bootstrap, any occurrence of $\varepsilon_i$ is replaced by $\varepsilon^*_i$, and therefore

\begin{displaymath}\hat m^*_{h,g}(x)=n^{-1} \sum^n_{i=1} K_h(x-X_i) ( \hat m_g(X_i) +
\varepsilon^*_i )\end{displaymath}

is the bootstrap smoother. It is the aim of the bootstrap to approximate the distribution of $\sqrt{nh} (\hat m_h(x)-m(x))$, where

\begin{eqnarray*}
\hat m_h(x)-m(x)& =
& n^{-1} \sum^n_{i=1} K_h(x-X_i) \varepsi...
...repsilon_i \cr
&&\ +(h^2/2)m''(x), \ h \to 0, \ nh \to \infty.
\end{eqnarray*}



If this expansion is mirrored by the bootstrap estimator $\hat m^*_{h,g}$, one should center first around the expectation under the bootstrap distribution, which is approximately
\begin{displaymath}
\hat m_{C,h,g}(x)= n^{-1} \sum_i K_1 (x-X_i;h,g) Y_i,
\end{displaymath} (5.3.16)

where

\begin{displaymath}K_1(v;h,g)= K_h * K_g = \int K_h(u)K_g(v-u)du \end{displaymath}

is the convolution kernel of $K_h$ and $K_g$. The bias component $(h^2/2)m''(x)$ may be estimated by employing a consistent estimator of $m''(x)$. (In Section 3.1 I defined kernel estimators of derivatives.) This results in the bootstrap approximation

\begin{displaymath}\sqrt{nh} (\hat m^*_{h,g}(x)- \hat m_{C,h,g}(x) + (h^2/2) \hat m''(x)),\end{displaymath}

where $\hat m''(x)$ denotes any consistent estimate of the second derivative $m''(x)$. Härdle and Bowman (1988) proved that the bootstrap principle holds.

Theorem 5.3.1   If $h$ and $g$ tend to zero at the rate $n^{-1/5}$ the kernel function $K$ is Lipschitz continuous and $m$ is twice differentiable then the bootstrap principle holds, that is,

\begin{eqnarray*}
\lefteqn{d_2(\sqrt{nh} ( \hat m_h(x)-m(x) ),} \cr
&& \ \sqrt{n...
...C,h,g}(x)+(h^2/2) \hat m''
(x) )) {\buildrel p \over \ \to \ 0}, \end{eqnarray*}



where

\begin{displaymath}d_2(F,G)=\inf_{{X \sim F}\atop{Y \sim G}} [ E_{(X,Y)} (X-Y)^2 ]^{1/2}\end{displaymath}

denotes the Mallows metric.

The MSE $d_M(x,h)=E ( \hat m_h(x)-m(x) )^2$ can then be estimated by

\begin{displaymath}\hat d_M(x;h)= \int ( \hat m^*_{h,g}(x)- \hat m_{C,h,g}(x) + (h^2/2)
\hat m''(x) )^2 d F_n^*, \end{displaymath}

where $ F_n^*$ denotes the empirical distribution function of $\{ \tilde
\varepsilon_i \}$. Denote by $\hat h(x)$ the bandwidth that minimizes $\hat d_M(x;h)$ over a set of smoothing parameters $H_n$.

This choice of local adaptive bandwidth is asymptotically optimal in the sense of Theorem 5.1.1 as Härdle and Bowman (1988) show; that is,

\begin{displaymath}
{d_M(x; \hat h(x)) \over \inf_{h \in H_n} d_M(x,h)} {\buildrel p \over
\to} 1.
\end{displaymath} (5.3.17)

This adaptive choice of $h = h(x)$ is illustrated in Figure 5.15, which displays some data simulated by adding a normally distributed error, with standard deviation 0.1, to the curve $m(x)=\sin(4 \pi x)$ evaluated at $X={(i-1/2)\over 100}$, $i=1, \ldots, n=100$. Cross-validation was used to select a good global smoothing parameter $(g=0.03)$ and the resulting estimate of the regression function shows the problems caused by bias at the peaks and troughs, where $\left\vert
m''(x) \right\vert $ is high.

Figure: Data simulated from the curve $\scriptstyle m(x)=\sin(4\pi x)$, with $\scriptstyle N(0,(0.1)^2)$ error distribution. True curve (solid line label 1); global smoothing (dashed line, label 2); local adaptive smoothing (fine dashed line, label 3). From Härdle and Bowman (1988) with permission of the American Statistical Association .
\includegraphics[scale=0.12]{ANR5,13.ps}

To see what local smoothing parameters have been actually used consider Figure 5.16. This figure plots the local smoothing parameters obtained by minimizing the bootstrap estimate $\hat d_M(x,h)$ as a function of $x$.

Figure 5.16: Local smoothing parameters for the simulated data of Figure 5.15. Asymptotically optimal (solid line, label 1); direct estimation (dashed line, label 2); bootstrap (fine dashed line, label 3). From Härdle and Bowman (1988) with permission of the American Statistical Association .
\includegraphics[scale=0.12]{ANR5,14.ps}

For comparison, the asymptotically optimal local smoothing parameters

\begin{displaymath}h_0^*(x)=C_0(x)\ n^{-1/5},\end{displaymath}

with

\begin{displaymath}C_0(x)=\left[ {\sigma^2 \ c_K \over
d_K \ m''(x))^2 } \right]^{1/5}, \end{displaymath}

are also plotted. It can be seen that an appropriate pattern of local smoothing has been achieved. Comparison with the ``plug-in" local smoothing parameters (based on estimating $C_0$) revealed for this example little difference. The advantage of the above bootstrap method though lies in the fact that it is insensitive to irregularites introduced by estimation of $m''(x)$; see Härdle and Bowman (1988). Also, the plug-in method requires an estimate of bias; see Müller and Stadtmüller (1987). The above idea of bootstrapping from estimated residuals has been applied to spectral density estimation by Franke and Härdle (1988).

5.3.2 The supersmoother

The so-called supersmoother proposed by Friedman (1984) is based on local linear $k$-$NN$ fits in a variable neighborhood of the estimation point $x$. ``Local cross-validation" is applied to estimate the optimal span as a function of the predictor variable. The algorithm is based on the $k-NN$ updating formulas as described in Section 3.4. It is therefore highly computationally efficient.

The name ``supersmoother" stems from the fact that it uses optimizing resampling techniques at a minimum of computational effort. The basic idea of the supersmoother is the same as that for the bootstrap smoother. Both methods attempt to minimize the local mean squared error. The supersmoother is constructed from three initial smooths, the tweeter, midrange and woofer. They are intended to reproduce the three main parts of the frequency spectrum of $m(x)$ and are defined by $k$-$NN$ smooths with $k=0.05n$, $0.2n$ and $0.5n$, respectively. Next, the cross-validated residuals

\begin{displaymath}
r_{(i)}(k)= \left[ Y_i- \hat m_k(X_i) \right] \left(1-1/k- {(X_i- \overline
\mu_{X_i})^2 \over V_{X_i}}\right)
\end{displaymath} (5.3.18)

are computed, where $\overline \mu_{X_i}$ and $V_{X_i}$ denote the local mean and variance from the $k$ nearest neighbors of $X_i$ as in Section 3.4. Then the best span values $\hat k(X_i)$ are determined by minimizing $r_{(i)}(k)$ at each $X_i$ over the tweeter, midrange and woofer value of $k$.

Since a smooth based on this span sequence would, in practice, have an unnecessarily high variance, smoothing the values $ \left\vert r_{(i)}(k) \right\vert $ against $X_i$ is recommended using the resulting smooth to select the best span values, $\hat k(X_i)$. In a further step the span values $\hat k(X_i)$ are smoothed against $X_i$ (with a midrange smoother). The result is an estimated span for each observation with a value between the tweeter and the woofer values.

The resulting curve estimate, the supersmoother, is obtained by interpolating between the two (out of the three) smoothers with closest span values. Figure 5.17 shows $n=200$ pairs $\{(X_i, Y_i) \}^n_{i=1}$ with $\{X_i \}$ uniform on $[0,1]$,

\begin{displaymath}Y_i=\sin(2 \pi (1-X_i)^2)+X_i \varepsilon_i, \end{displaymath}

where the $ \{ \varepsilon_i \}$ are i.i.d. standard normal variates. The resulting supersmoother is shown as a solid line.

Figure 5.17: A scatter plot of $\scriptstyle n=200$ data points $\scriptstyle
\{ (X_i,Y_i) \}^n_{i=1}. X_i$ is uniformly distributed over $\scriptstyle [ 0,1 ]
$, $\scriptstyle Y_i=\sin(2 \pi (1-X_i)^2+X_i \varepsilon_i, \varepsilon_i \sim N(0,1)$. The solid line indicates the supersmoother. From Friedman (1984) with permission of the author. 15814 ANRsupsmo.xpl
\includegraphics[scale=0.7]{ANRsupsmo.ps}

Figure 5.18 shows the estimated optimal span $\hat k(X_i)$ as a function of $X_i$. In the ``low-noise high-curvature" region $(x<0.2)$ the tweeter span is proposed. In the remaining regions a span value about the midrange is suggested.

Figure: The selected span sequence $\scriptstyle \hat k(X_i)$ for the data from Figure 5.18. From Friedman (1984) with permission of the author.
\includegraphics[scale=0.2]{ANR5,16.ps}

When $m(x)$ is very smooth, more accurate curve estimates can be obtained by biasing the smoothing parameter toward larger span values. One way of doing this would be to use a smoothing parameter selection criterion that penalizes more than the ``no smoothing" point $k=1$. For example, Rice's $T$ (Figure 5.10) would bias the estimator toward smoother curves. Friedman (1984) proposed parameterizing this ``selection bias" for enhancing the bass component of the smoother output. For this purpose, introduce the span

\begin{displaymath}\tilde k(X_i) = \hat k(X_i)+(k_W- \hat k(X_i)) R_i^{10- \alpha} \end{displaymath}

with

\begin{displaymath}R_i = \left[ {\hat e (X_i, \hat k (X_i)) \over \hat e (X_i, k_W)} \right],
\end{displaymath}

where $\hat e (x,k)$ denotes the estimated residual at $x$ with smoothing parameter $k$, and $k_W =0.5n$ is the woofer span. The parameter $0 \le \alpha \le 10$ is called the tone control. The value $\alpha=0$ corresponds to very little bass enhancement, whereas $\alpha=10$ corresponds to the woofer (maximum bass). A choice of $\alpha$ between these two extremes biases the selection procedure toward larger span values.

Exercises

5.3.1Prove that the term $ \hat m_{C,h,g}(x)$ is an approximation of lower order than $\sqrt{nh}$ to $E_{F_n^*} \hat m_{h,g}^* (x)$,

\begin{displaymath}\hat m_{C,h,g}(x) = E_{F_n^*} \hat m_{h,g}^* (x)
+ o_p(n^{1/2}h^{1/2}). \end{displaymath}

5.3.2What is the difference between the method here in Section 5.3 and the wild bootstrap? Can you prove Theorem 5.3.1 without the bias estimate?

[Hint: Use an oversmooth resampling mean $\hat m_g(x)$ to construct the boostrap observations, $Y_i^* = \hat m_g(X_i) + \varepsilon_i^*$. The difference

\begin{displaymath}E_{F_n^*} \hat m_{g, h}(x) - \hat m_g(x)\end{displaymath}

will reflect, as in the wild bootstrap, the bias of $\hat m_h(x)$.]

5.3.3Show that the cross-validated residuals 5.3.18 stem from the leave-out technique applied to $k$-$NN$ smoothing.

5.3.4Try the woofer, midrange and tweeter on the simulated data set from Table 2, Appendix. Compare it with the supersmoother. Can you comment on where and why the supersmoother changed the smoothing parameter?

[Hint: Use XploRe (1989) or a similar interactive package.]