5.4 Statistics for Linear Smoothers: Bandwidth Selection and Inference

We also want to perform statistical inference based on the smoothers. As for parametric regression, we want to construct confidence bands and prediction intervals based on the smooth curve. Given a new car that weighs

pounds, what is its fuel economy? Tests of hypotheses can also be posed: for example, is the curvature observed in Fig. 5.2 significant, or would a linear regression be adequate? Given different classifications of car (compact, sporty, minivan etc.) is there differences among the categories that cannot be explained by weight alone?

5.4.1 Choosing Smoothing Parameters

All smoothing methods have one or more smoothing parameters: parameters that control the `amount' of smoothing being performed. For example, the bandwidth

in the kernel and local regression estimates. Typically, bandwidth selection methods are based on an estimate of some goodness-of-fit criterion. Bandwidth selection is a special case of model selection, discussed more deeply in Chap. III.1.

How should smoothing parameters be used? At one extreme, there is full automation: optimization of the goodness-of-fit criterion produces a single `best' bandwidth. At the other extreme is purely exploratory and graphical methods, using goodness-of-fit as a guide to help choose the best method.

Automation has the advantage that it requires much less work; a computer can be programmed to perform the optimization. But the price is a lack of reliability: fits with very different bandwidths can produce similar values of the goodness-of-fit criterion. The result is either high variability (producing fits that look undersmoothed) or high bias (producing fits that miss obvious features in the data).

5.4.1.1 Cross Validation.

Cross validation (CV) focuses on the prediction problem: if the fitted regression curve is used to predict new observations, how good will the prediction be? If a new observation is made at

, and the response

is predicted by $\widehat{Y}_0 = \hat{\mu}(x_0)$ , what is the prediction error? One measure is

The method of CV can be used to estimate this quantity. In turn, each observation

is omitted from the dataset, and is `predicted' by smoothing the remaining

observations. This leads to the CV score

$\displaystyle {\mathrm{CV}}(\hat{\mu}) = \frac{1}{n} \sum_{i=1}^n (Y_i - \hat{\mu}_{-i}(x_i))^2{},$

(5.17)

Formally computing each of the leave-one-out regression estimates $\hat{\mu}_{-i}(\cdot)$ would be highly computational, and so at a first glance computation of the CV score (5.17) looks prohibitively expensive. But there is a remarkable simplification, valid for nearly all common linear smoothers (and all those discussed in Sect. 5.2):

$\displaystyle \hat{\mu}_{-i}(x_i) = \frac{\hat{\mu}(x_i) - l_i(x_i) Y_i}{1-l_i(x_i)}{}.$

$\displaystyle {\mathrm{CV}}(\hat{\mu}) = \frac{1}{n} \sum_{i=1}^n \frac{ (Y_i - \hat{\mu}(x_i))^2 }{ (1-l_i(x_i))^2 }{}.$

Generalized cross validation (GCV) replaces each of the influence values

by the average, $\nu_1 / n$ . This leads to

$\displaystyle {\mathrm{GCV}}(\hat{\mu}) = n \frac{ \sum_{i=1}^n (Y_i - \hat{\mu}(x_i))^2 }{ (n - \nu_1)^2 }{}.$

**Figure 5.4:** GCV scores for the fuel economy dataset. Points marked 0 are for kernel smoothers with a range of bandwidths , and points marked are for a local linear smoother
$\includegraphics[width=8cm]{text/3-5/lrfig3.eps}$

Figure 5.4 shows the GCV scores for the fuel economy dataset, and using kernel and local linear smoothers with a range of bandwidths. Note the construction of the plot: the fitted degrees of freedom $\nu_1$ are used as the

axis. This allows us to meaningfully superimpose and compare the GCV curves arising from different smoothing methods. From right to left, the points marked `0' represent a kernel smoother with

and

, and points marked `

' represent a local linear smoother with

and $\infty$ .

The interpretation of Fig. 5.4 is that for any fixed degrees of freedom, the local linear fit outperforms the kernel fit. The best fits obtained are the local linear, with

to 3.5 degrees of freedom, or

between

and

5.4.1.2 Unbiased Risk Estimation.

A risk function measures the distance between the true regression function and the estimate; for example,

$\displaystyle R(\mu,\hat{\mu}) = \frac{1}{\sigma^2} \sum_{i=1}^n E \left( (\hat{\mu}(x_i) - \mu(x_i))^2 \right){}.$

(5.18)

$\displaystyle \hat{R}(\mu,\hat{\mu}) = \frac{1}{\sigma^2} \sum_{i=1}^n (Y_i - \hat{\mu}(x_i))^2 - n + 2 \nu_1$

The unbiased risk estimate can be used similarly to GDV. One computes $\hat{R}(\mu,\hat{\mu})$ for a range of different fits $\hat{\mu}$ , and plots the resulting risk estimates versus the degrees of freedom. Fits producing a small risk estimate are considered best.

5.4.1.3 Bias Estimation and Plug-in Methods.

An entirely different class of bandwidth selection methods, often termed plug-in methods, attempt to directly estimate a risk measure by estimating the bias and variance. The method has been developed mostly in the context of kernel density estimation, but adaptations to kernel regression and local polynomial regression can be found in [7] and [24].

Again focusing on the squared-error risk, we have the bias-variance decomposition

$\displaystyle \sigma^2 R(\mu,\hat{\mu})$	$\displaystyle = \sum_{i=1}^n {\mathrm{bias}}(\hat{\mu}(x_i))^2 + \sum_{i=1}^n {\mathrm{Var}}(\hat{\mu}(x_i))$
	$\displaystyle = \sum_{i=1}^n \left( \sum_{j=1}^n l_j(x_i) \mu(x_j) - \mu(x_i)\right)^2 + \sigma^2 \sum_{i=1}^n \Vert l(x_i)\Vert^2{}.$	(5.19)

There are many variants of the plug-in idea in the statistics literature. Most simplify the risk function using asymptotic approximations such as (5.13) and (5.15) for the bias and variance; making these substitutions in (5.19) gives

$\displaystyle \sigma^2 R(\mu,\hat{\mu}) \approx h^4 \left( \frac{ \int v^2 W(v)... ...d}}v}{ \left(\int W(v){\mathrm{d}}v\right)^2 } \sum_{i=1}^n \frac{1}{f(x_i)}{}.$

$\displaystyle \sigma^2 R(\mu,\hat{\mu}) \approx n h^4 \left( \frac{ \int v^2 W(... ...} \frac{ \int W(v)^2 {\mathrm{d}}v}{ \left(\int W(v){\mathrm{d}}v\right)^2 }{}.$

$\displaystyle h_{{\mathrm{opt}}}^5 = \frac{\sigma^2 (b-a)^2 \int W(v)^2 {\mathr... ...left(\int v^2 W(v){\mathrm{d}}v\right)^2 \int_a^b \mu''(x)^2 {\mathrm{d}}x }{}.$

Evaluation of $h_{{\text{opt}}}$ requires substitution of estimates for $\int_a^b \mu''(x)^2 {\mathrm{d}}x$ and of $\sigma ^2$ . The estimate (5.16) can be used to estimate $\sigma ^2$ , but estimating $\int_a^b \mu''(x)^2 {\mathrm{d}}x$ is more problematic. One technique is to estimate the second derivative using a `pilot' estimate of the smooth, and then use the estimate

But the use of a pilot estimate to estimate the second derivative is problematic. The pilot estimate itself has a bandwidth that has to be selected, and the estimated optimal bandwidth $\hat{h}_{{\text{opt}}}$ is highly sensitive to the choice of pilot bandwidth. Roughly, if the pilot estimate smooths out important features of $\mu$ , so will the estimate $\hat{\mu}$ with bandwidth $\hat{h}_{{\text{opt}}}$ . More discussion of this point may be found in [20].

5.4.2 Normal-based Inference

Inferential procedures for smoothers include the construction of confidence bands for the true mean function, and procedures to test the adequacy of simpler models. In this section, some of the main ideas are briefly introduced; more extensive discussion can be found in the books [3], [11], [12] and [21].

5.4.2.1 Confidence Intervals.

If the errors $\epsilon_i$ are normally distributed, then confidence intervals for the true mean can be constructed as

$\displaystyle \hat{\mu}(x) \pm c \hat{\sigma} \Vert l(x)\Vert{}.$

$\displaystyle P\left( \vert \hat{\mu}(x) - E(\hat{\mu}(x)) \vert < c \hat{\sigma} \Vert l(x)\Vert \right) = 1-\alpha{}.$

5.4.2.2 Tests of Hypothesis.

Consider the problem of testing for the adequacy of a linear model. For example, in the fuel economy dataset of Figs. 5.1 and 5.2, one may be interested in knowing whether a linear regression, $\mu(x) = a+bx$ is adequate, or alternatively whether the departure from linearity indicated by the smooth is significant. This hypothesis testing problem can be stated as

$\displaystyle H_0 : \mu(x) = a+bx\quad \mathrm{for\;some}\quad a,b$
$\displaystyle H_1 : \mathrm{otherwise}{}.$

In analogy with the theory of linear models, an

ratio can be formed by fitting both the null and alternative models, and considering the difference between the fits. Under the null model, parametric least squares is used; the corresponding fitted values are $\boldsymbol{M} Y$ where $\boldsymbol{M}$ is the hat matrix for the least squares fit. Under the alternative model, the fitted values are $\boldsymbol{H}Y$ , where $\boldsymbol{H}$ is the hat matrix for a local linear regression. An

ratio can then be formed as

$\displaystyle F = \frac{ \Vert \boldsymbol{H}Y - \boldsymbol{M} Y\Vert^2 / \nu }{ \hat{\sigma}^2 }{},$

What is the distribution of

when

is true? Since $\boldsymbol{H}$ is not a perpendicular projection operator, the numerator does not have a $\chi^2$ distribution, and

does not have an exact

distribution. None-the-less, we can use an approximating

distribution. Based on a one-moment approximation, the degrees of freedom are $\nu$ and $n- 2\nu_1 + \nu_2$ .

Better approximations are obtained using the two-moment Satterwaite approximation, as described in [5]. This method matches both the mean and variance of chi-square approximations to the numerator and denominator. Letting $\Lambda = (\boldsymbol{H}-\boldsymbol{M})^{\top}(\boldsymbol{H}-\boldsymbol{M})$ , the numerator degrees of freedom for the

distribution are given by ${\mathrm{trace}}{\Lambda}^2 / {\mathrm{trace}}{\Lambda^2}$ . A similar adjustment is made to the denominator degrees of freedom. Simulations reported in [5] suggest the two-moment approximation is adequate for setting critical values.

For the fuel economy dataset, we obtain

, $\nu=1.0866$ and $n-2\nu_1+\nu_2 = 55.997$ . Using the one-moment approximation, the

-value is

. The two-moment approximation gives a

-value of

. Both methods indicate that the nonlinearity is significant, although there is some discrepancy between the

-values.

5.4.3 Bootstrapping

The

-tests in the previous section are approximate, even when the errors $\epsilon_i$ are normally distributed. Additionally, the degrees-of-freedom computations (particularly for the two-moment approximation) require

computations, which is prohibitively expensive for

more than a few hundred.

An alternative to the

approximations is to simulate the null distribution of the

ratio. A bootstrap method (Chap. III.2) performs the simulations using the empirical residuals to approximate the true error distribution:

This procedure is repeated a large number of times (say

) and tabulation of the resulting $F^{\ast}$ values provides an estimate of the true distribution of the

ratio.

Remark. Since the degrees of freedom do not change with the replication, there is no need to actually compute the normalizing constant. Instead, one can simply work with the modified

ratio,

$\displaystyle F_B = \frac{ \Vert\boldsymbol{H}Y^{\ast} - \boldsymbol{M}Y^{\ast}\Vert^2 }{ \Vert(I-\boldsymbol{H})Y^{\ast}\Vert^2 }{}.$

**Figure 5.5:** Estimated density of the ratio, based on the bootstrap method (*solid line*); 1-moment approximation (*short dashed line*) and 2-moment approximation (*long dashed line*)
$\includegraphics[width=7.7cm]{text/3-5/lrfigb.eps}$

Figure 5.5 compares the bootstrap distribution of the

ratio and the 1 and 2 moment

approximations for the fuel economy dataset. The bootstrap method uses

bootstrap replications, and the density is estimated using the Local Likelihood method (Sect. 5.5.2 below). Except at the left end-point, there is generally good agreement between the bootstrap density and the two-moment density. The upper $5\,{\%}$ quantiles are 3.21 based on the two-moment approximation, and 3.30 based on the bootstrap sample. The one-moment approximation has a critical value of 3.90. Based on the observed

, the bootstrap

-value is 0.0023, again in close agreement with the two-moment method.