4.3 Smoothing Parameter Selection

As we pointed out in the preceeding sections, for some nonparametric estimators at least an asymptotic connection can be made to kernel regression estimators. Hence, in this section we will be focusing on finding a good way of choosing the smoothing parameter of kernel regression estimators, namely the bandwidth $ h$.

What conditions do we require for a bandwidth selection rule to be ``good''? First of all it should have theoretically desirable properties. Secondly, it has to be applicable in practice. Regarding the first condition, there have been a number of criteria proposed that measure in one way or another how close the estimate is to the true curve. It will be instructive to go through these measures one by one:

Which discrepancy measure should be used to derive a rule for choosing $ h$? A natural choice would be $ \mise$ or its asymptotic version $ \amise$ since we have some experience of its optimization from the density case. The $ \amise$ in the regression case, however, involves more unknown quantities than the $ \amise$ in the density estimator. As a result, plug-in approaches are mainly used for the local linear estimator due to its simpler bias formula. See for instance Wand & Jones (1995, pp. 138-139) for some examples.

We will discuss two approaches of rather general applicability: cross-validation and penalty terms. For the sake of simplicity, we restrict ourselves to bandwidth selection for the Nadaraya-Watson estimator here. For that estimator is has been shown (Marron & Härdle, 1986) that $ \ase$, $ \ise$ and $ \mise$ lead asymptotically to the same level of smoothing. Hence, we can use the criterion which is the easiest to calculate and manipulate: the discrete $ \ase(h)$.

4.3.1 A Closer Look at the Averaged Squared Error

We want to find the bandwidth $ h$ that minimizes $ \ase(h)$. For easy reference, let us write down $ \ase(h)$ in more detail:

$\displaystyle \index{ASE!regression}\ase(h)$ $\displaystyle =$ $\displaystyle \frac{1}{n}\sum_{i=1}^{n}m^{2}(X_{i})w(X_{i})
+\frac{1}{n}\sum_{i=1}^{n}\widehat{m}^2_{h}(X_{i})w(X_{i})$  
    $\displaystyle \quad -2\frac{1}{n}\sum_{i=1}^{n}m(X_{i})\widehat{m}_{h}(X_{i})w(X_{i}).$ (4.42)

We already pointed out that $ \ase$ is a random variable. Its conditional expectation, $ \mase$, is given by
$\displaystyle \index{MASE!regression}
\mase(h)$ $\displaystyle =$ $\displaystyle E\{\ase(h)\vert X_{1}=x_{1},\ldots,X_{n}=x_{n}\}$ (4.43)
  $\displaystyle =$ $\displaystyle \frac{1}{n}\sum_{i=1}^{n}
E[\left\{\widehat{m}_{h}(X_{i})-m(X_{i})\right\}^{2}
\vert X_{1}=x_{1},\ldots,X_{n}=x_{n}]\,w(X_{i})$  
  $\displaystyle =$ $\displaystyle \frac{1}{n}\sum_{i=1}^{n}\Big[\underbrace{Var\{\widehat{m}_{h}(X_{i})\vert
X_{1}=x_{1},\ldots,X_{n}=x_{n}\}}_{v(h)}$  
    $\displaystyle \qquad\qquad+ \underbrace{{\bias}^{2}\{\widehat{m}_{h}(X_{i})\vert
X_{1}=x_{1},\ldots,X_{n}=x_{n}\}}_{b^{2}(h)}\Big]\,w(X_{i}), \nonumber$  

with squared bias

$\displaystyle b^{2}(h)=\frac{1}{n}\sum_{i=1}^{n}\left\{\frac{1}{n}\sum_{j=1}^{n...
...h}(X_{i}-X_{j})}{\widehat{f}_{h}(X_{i})}m(X_{j})-m(X_{i})\right\}^{2}\,w(X_{i})$ (4.44)

and variance

$\displaystyle v(h)=\frac{1}{n}\sum_{i=1}^{n}\left[\frac{1}{n^{2}}\sum_{j=1}^{n}...
...i}-X_{j})}{\widehat{f}_{h}(X_{i})}\right\}^{2}\sigma(X_{j})^{2}\right]w(X_{i}).$ (4.45)

The following example shows the dependence of squared bias, variance and its sum $ \mse$ on the bandwidth $ h$.

Figure: $ \mase$ (thick line), squared bias (thin solid line) and variance part (thin dashed line) for simulated data, weights $ w(x)=\Ind(x\in[0.05,0.95])$
\includegraphics[width=0.03\defepswidth]{quantlet.ps}SPMsimulmase
\includegraphics[width=1.2\defpicwidth]{SPMsimulmaseA.ps}

EXAMPLE 4.12  
The squared bias is increasing in $ h$ as can be seen in Figure 4.10 where $ b^{2}(h)$ is plotted along with the decreasing $ v(h)$ and their sum $ \mase$ (thick line). Apparently, there is the familiar trade-off that increasing $ h$ will reduce the variance but increase the squared bias. The minimum $ \mase$ is achieved at $ h=0.085$.

You may wonder how we are able to compute these quantities since they involve the unknown $ m(\bullet)$. The answer is simple: We have generated the data ourselves, determining the regression function

$\displaystyle m(x)=\{\sin(2\pi x^3)\}^3$

beforehand. The data have been generated according to

$\displaystyle Y_{i}=m(X_{i})+\varepsilon_{i}, \quad X_{i}\sim U[0,1], \quad
\varepsilon_{i}\sim N(0,0.01),$

see Figure 4.11$ \Box$

Figure: Simulated data with true and estimated curve
\includegraphics[width=0.03\defepswidth]{quantlet.ps}SPMsimulmase
\includegraphics[width=1.2\defpicwidth]{SPMsimulmaseB.ps}

What is true for $ \mase$ is also true for $ \ase(h)$: it involves $ m(\bullet)$, the function we want to estimate. Therefore, we have to replace $ \ase(h)$ with an approximation that can be computed from the data. A naive way of replacing $ m(\bullet)$ would be to use the observations of $ Y$ instead, i.e.

$\displaystyle \index{resubstitution estimate} p(h)=\frac{1}{n}\sum_{i=1}^{n}\left\{Y_{i}-\widehat{m}_h(X_{i})\right\}^{2}w(X_{i}),$ (4.46)

which is called the resubstitution estimate and is essentially a weighted residual sum of squares ($ \rss$). However, there is a problem with this approach since $ Y_{i}$ is used in $ \widehat{m}_h(X_{i})$ to predict itself. As a consequence, $ p(h)$ can be made arbitrarily small by letting $ h\rightarrow 0$ (in which case $ \widehat{m}(\bullet)$ is an interpolation of the $ Y_{i}$s).

To gain more insight into this matter let us expand $ p(h)$ by adding and subtracting $ m(X_{i})$:

$\displaystyle p(h)$ $\displaystyle =$ $\displaystyle \frac{1}{n}\sum_{i=1}^{n}\left[ \left\{ Y_{i}-m(X_{i}) \right\}
+\left\{ m(X_{i})-\widehat{m}_h(X_{i}) \right\}\right]^2 w(X_{i})$  
  $\displaystyle =$ $\displaystyle \frac{1}{n}\sum_{i=1}^{n}\varepsilon_{i}^{2}w(X_{i})+\ase(h)$  
    $\displaystyle \quad\quad-\frac{2}{n}\sum_{i=1}^{n}\varepsilon_{i}
\{\widehat{m}_{h}(X_{i})-m(X_{i})\} w(X_{i}),$ (4.47)

where $ \varepsilon_{i}=Y_{i}-m(X_{i})$. Note that the first term $ \frac{1}{n}\sum_{i=1}^{n}
\varepsilon_{i}^{2}w(X_{i})$ of (4.47) does not depend on $ h$, and the second term is $ \ase(h)$. Hence, minimizing $ p(h)$ would surely lead to the same result as minimizing $ \ase(h)$ if it weren't for the third term $ \sum_{i=1}^{n}\varepsilon_{i}
\{\widehat{m}_{h}(X_{i})-m(X_{i})\}w(X_{i})$. In fact, if we calculate the conditional expectation of $ p(h)$
$\displaystyle {E\left\{ p(h)\vert X_{1}=x_{1},\ldots,X_{n}=x_{n}\right\}}$
  $\displaystyle =$ $\displaystyle \frac{1}{n}\sum_{i=1}^{n}\sigma^{2}(x_{i})w(x_{i})
+E\left\{\ase(h)\right\vert X_{1}=x_{1},\ldots,X_{n}=x_{n}\}$  
    $\displaystyle \quad\quad -\frac{2}{n^2}\sum_{i=1}^{n}W_{hi}(x_{i})
\sigma^2(x_{i})w(x_{i})$ (4.48)

we observe that the third term (recall the definition of $ W_{hi}$ in (4.7)), which is the conditional expectation of

$\displaystyle -\frac{2}{n}\sum_{i=1}^{n}\varepsilon_{i}
\{\widehat{m}_{h}(X_{i})-m(X_{i})\}w(X_{i}),$

tends to zero at the same rate as the variance $ v(h)$ in (4.45) and has a negative sign. Therefore, $ p(h)$ is downwardly biased as an estimate of $ \ase(h)$, just as the bandwidth minimizing $ p(h)$ is downwardly biased as an estimate of the bandwidth minimizing $ \ase(h)$.

In the following two sections we will examine two ways out of this dilemma. The method of cross-validation replaces $ \widehat{m}_{h}(X_{i})$ in (4.46) with the leave-one-out-estimator $ \widehat{m}_{h,-i}(X_{i})$. In a different approach $ p(h)$ is multiplied by a penalizing function which corrects for the downward bias of the resubstitution estimate.

4.3.2 Cross-Validation

We already familiarized ourselves with know cross-validation in the context of bandwidth selection in kernel density estimation. This time around, we will use it as a remedy for the problem that in

$\displaystyle p(h)=\frac{1}{n}\sum_{i=1}^{n}\left\{Y_{i}-\widehat{m}_h(X_{i})\right\}^{2} w(X_{i})$ (4.49)

$ Y_{i}$ is used in $ \widehat{m}_h(X_{i})$ to predict itself. Cross-validation solves this problem by employing the leave-one-out-estimator

$\displaystyle \widehat{m}_{h,-i}(X_{i})=\frac{\sum_{j\neq i}K_h(X_{i}-X_{j})Y_{j}} {\sum_{j\neq i}K_h(X_{i}-X_{j})}.$ (4.50)

That is, in estimating $ \widehat{m}_{h}(\bullet)$ at $ X_{i}$ the $ i$th observation is left out (as reflected in the subscript ``$ -i$''). This leads to the cross-validation function

$\displaystyle CV(h)=\frac{1}{n}\sum_{i=1}^{n}\left\{Y_{i}- \widehat{m}_{h,-i}(X_{i})\right\}^2 w(X_{i}).$ (4.51)

In terms of the analysis of the previous section, it can be shown that the conditional expectation of the third term of (4.47), is equal to zero if we use $ \widehat{m}_{h,-i}(X_{i})$ instead of $ \widehat{m}_{h}(X_{i})$, i.e.

$\displaystyle E\left[-\frac{2}{n}\sum_{i=1}^{n}\varepsilon_{i}
\{\widehat{m}_{h...
...X_{i})-m(X_{i})\} w(X_{i})\Big\vert X_{1}=x_{1},
\ldots,X_{n}=x_{n}\right] = 0.$

This means minimizing $ CV(h)$ is (on average) equivalent to minimizing $ \ase(h)$ since the first term in (4.47) is independent of $ h$. We can conclude that with the bandwidth selection rule ``choose $ \widehat{h}$ to minimize $ CV(h)$'' we have found a rule that is both theoretically desirable and applicable in practice.

Figure: Nadaraya-Watson kernel regression with cross-validated bandwidth $ \widehat h_{CV}=0.15$, U.K. Family Expenditure Survey 1973
\includegraphics[width=0.03\defepswidth]{quantlet.ps}SPMnadwaest
\includegraphics[width=1.2\defpicwidth]{SPMnadwaest.ps}

EXAMPLE 4.13  
Let us apply the cross-validation method to the Engel curve example now. Figure 4.12 shows the Nadaraya-Watson kernel regression curve (recall that we always used the Quartic kernel for the figures) with the bandwidth chosen by minimizing the cross-validation criterion $ CV(h)$.

Figure: Local polynomial regression ($ p=1$) with cross-validated bandwidth $ \widehat h_{CV}=0.56$, U.K. Family Expenditure Survey 1973
\includegraphics[width=0.03\defepswidth]{quantlet.ps}SPMlocpolyest
\includegraphics[width=1.2\defpicwidth]{SPMlocpolyest.ps}

For comparison purposes, let us consider bandwidth selection for a different nonparametric smoothing method. You can easily see that applying the cross-validation approach to local polynomial regression presents no problem. This is what we have done in Figure 4.13. Here we show the local linear estimate with cross-validated bandwidth for the same data. As we already pointed out in Subsection 4.1.3 the estimate shows more stable behavior in the high net-income region (regions with small number of observations) and outperforms the Nadaraya-Watson estimate at the boundaries. $ \Box$

4.3.3 Penalizing Functions

Recall the formula (4.48)for the conditional expectation of $ p(h)$. That is,

$\displaystyle E\left\{p(h)\vert X_{1}=x_{1},\ldots,X_{n}=x_{n}\right\}\neq
E\{\ase(h)\vert X_{1}=x_{1},\ldots,X_{n}=x_{n}\}.$

You might argue that this inequality is not all that important as long as the bandwidth minimizing $ E\{p(h)\vert X_{1}=x_{1},\ldots,X_{n}=x_{n}\}$ is equal to the bandwidth minimizing $ E\{\ase(h)\vert X_{1}=x_{1},\ldots,X_{n}=x_{n}\}$. Unfortunately, one of the two terms causing the inequality, the last term of (4.48), depends on $ h$ and is causing the downward bias. The penalizing function approach corrects for the downward bias by multiplying $ p(h)$ by a correction factor that penalizes too small $ h$. The ``corrected version'' of $ p(h)$ can be written as

$\displaystyle G(h)=\frac{1}{n}\sum_{i=1}^{n} \left\{Y_{i}-\widehat{m}_{h}(X_{i})\right\}^2 \,\Xi \left(\frac{1}{n} W_{hi}(X_{i})\right)\, w(X_{i}),$ (4.52)

with a correction function $ \Xi$. As we will see in a moment, a penalizing function $ \Xi (u)$ with first-order Taylor expansion $ \Xi (u)=1+2u+O(u^2)$ for $ u\to 0$, will work well. Using this Taylor expansion we can write (4.52) as
$\displaystyle G(h)$ $\displaystyle =$ $\displaystyle \frac{1}{n}\sum_{i=1}^{n}\left[\varepsilon_{i}^{2}
+\left\{m(X_{i...
...ht\}^{2}
-2\varepsilon_{i}\left\{m(X_{i})-\widehat{m}_{h}(X_{i})\right\}\right]$  
    $\displaystyle \quad\quad\cdotp\left\{1+\frac{2}{n}W_{hi}(X_{i})\right\}w(X_{i})
+O\left((nh)^{-2}\right).$ (4.53)

Multiplying out and ignoring terms of higher order, we get
$\displaystyle G(h)$ $\displaystyle \approx$ $\displaystyle \frac{1}{n}\sum_{i=1}^{n}\varepsilon_{i}^{2}w(X_{i})+\ase(h)
-\fr...
...{i=1}^{n}\varepsilon_{i}
\left\{\widehat{m}(X_{i})-m_{h}(X_{i})\right\}w(X_{i})$  
    $\displaystyle \quad +\frac{2}{n^2}\sum_{i=1}^{n}\varepsilon_{i}^{2}W_{hi}(X_{i})w(X_{i}).$ (4.54)

The first term in (4.54) does not depend on $ h$. The expectation of the third term, conditional on $ X_{1}=x_{1},\ldots,X_{n}=x_{n}$, is equal to the negative value of the last term of (4.48). But this is just the conditional expectation of the last term in (4.54), with a negative sign in front. Hence, the last two terms cancel each other out asymptotically and $ G(h)$ is roughly equal to $ \ase(h)$.

The following list presents a number of penalizing functions that satisfy the expansion $ \Xi (u)=1+2u+O(u^2),\quad u\to 0$:

(1)
Shibata's model selector (Shibata, 1981),

$\displaystyle \Xi _S(u)=1+2u; $

(2)
Generalized cross-validation (Craven and Wahba, 1979; Li, 1985),

$\displaystyle \Xi _{GCV}\ (u)=(1-u)^{-2}; $

(3)
Akaike's Information Criterion (Akaike, 1970),

$\displaystyle \Xi _{AIC}(u)=\exp \ (2u); $

(4)
Finite Prediction Error (Akaike, 1974),

$\displaystyle \Xi _{FPE}(u)=(1+u)/(1-u); $

(5)
Rice's $ T$ (Rice, 1984),

$\displaystyle \Xi _T(u)=(1-2u)^{-1}. $

To see how these various functions differ in the degree of penalizing small values of $ h$, consider Figure 4.14.

Figure: Penalizing functions $ \Xi(h^{-1})$ as a function of $ h$ (from left to right: $ S$, $ AIC$, $ FPE$, $ GCV$, $ T$)
\includegraphics[width=0.03\defepswidth]{quantlet.ps}SPMpenalize
\includegraphics[width=1.2\defpicwidth]{SPMpenalize.ps}

The functions differ in the relative weight they give to variance and bias of $ \widehat{m}_{h}(x)$. Rice's $ T$ gives the most weight to variance reduction while Shibata's model selector stresses bias reduction the most. The differences displayed in the graph are not substantial, however. If we denote the bandwidth minimizing $ G(h)$ with $ \widehat{h}$ and the minimizer of $ \ase(h)$ with $ \widehat{h}_{0}$ then for $ n\to\infty$

$\displaystyle \frac{\ase(\widehat{h})}{\ase(\widehat{h}_{0})}
\mathrel{\mathop...
...widehat{h}}{\widehat{h}_{0}}
\mathrel{\mathop{\longrightarrow}\limits_{}^{P}}1.$

Thus, regardless of which specific penalizing function we use, we can assume that with an increasing number of observations $ \widehat{h}$ approximates the $ \ase$ minimizing bandwidth $ \widehat{h}_{0}$. Hence, choosing the bandwidth minimizing $ G(h)$ is another ``good'' rule for bandwidth-selection in kernel regression estimation.

Note that

$\displaystyle CV(h)$ $\displaystyle =$ $\displaystyle \frac{1}{n}\sum_{i=1}^n \{Y_{i}-\widehat{m}_{h,-i}(X_{i})\}^2 w(X_{i})$  
  $\displaystyle =$ $\displaystyle \frac{1}{n}\sum_{i=1}^n \{Y_{i}-\widehat{m}_{h}(X_{i})\}^2
\left\...
...}-\widehat{m}_{h,-i}(X_{i})}
{Y_{i}-\widehat{m}_{h}(X_{i})} \right\}^2 w(X_{i})$  

and
$\displaystyle \frac{Y_{i}-\widehat{m}_{h}(X_{i})}
{Y_{i}-\widehat{m}_{h,-i}(X_{i})}$ $\displaystyle =$ $\displaystyle \frac{\sum_{j} K_{h}(X_{i}-X_{j})Y_{j}
-Y_{i}\sum_{j} K_{h}(X_{i}...
...
{\sum_{j\ne i} K_{h}(X_{i}-X_{j})Y_{j}
-Y_{i}\sum_{j\ne i} K_{h}(X_{i}-X_{j})}$  
    $\displaystyle \quad\quad\cdotp
\frac{\sum_{j\ne i} K_{h}(X_{i}-X_{j})}
{\sum_{j} K_{h}(X_{i}-X_{j})}$  
  $\displaystyle =$ $\displaystyle 1\cdotp \left\{ 1- \frac{K_{h}(0)}{\sum_{j}
K_{h}(X_{i}-X_{j})}\right\}= 1-\frac{1}{n} W_{hi}(X_{i}).$  

Hence
$\displaystyle CV(h)$ $\displaystyle =$ $\displaystyle \frac{1}{n}\sum_{i=1}^n \{Y_{i}-\widehat{m}_{h}(X_{i})\}^2
\left\{1-\frac{1}{n} W_{hi}(X_{i})\right\}^{-2}w(X_{i})$  
  $\displaystyle =$ $\displaystyle \frac{1}{n}\sum_{i=1}^n \{Y_{i}-\widehat{m}_{h}(X_{i})\}^2
\,\Xi_{GCV}\left(\frac{1}{n} W_{hi}(X_{i})\right)w(X_{i})$  

i.e. $ CV(h)=G(h)$ with $ \Xi_{GCV}$. An analogous result is possible for local polynomial regression, see Härdle & Müller (2000). Therefore the cross-validation approach is equivalent to the penalizing functions concept and has the same asymptotic properties. (Note, that this equivalence does not hold in general for other smoothing approaches.)