next up previous contents index
Next: 5.5 Multivariate Smoothers Up: 5. Smoothing: Local Regression Previous: 5.3 Statistical Properties of

Subsections



5.4 Statistics for Linear Smoothers: Bandwidth Selection and Inference

We also want to perform statistical inference based on the smoothers. As for parametric regression, we want to construct confidence bands and prediction intervals based on the smooth curve. Given a new car that weighs $ 2800$ pounds, what is its fuel economy? Tests of hypotheses can also be posed: for example, is the curvature observed in Fig. 5.2 significant, or would a linear regression be adequate? Given different classifications of car (compact, sporty, minivan etc.) is there differences among the categories that cannot be explained by weight alone?

5.4.1 Choosing Smoothing Parameters

All smoothing methods have one or more smoothing parameters: parameters that control the `amount' of smoothing being performed. For example, the bandwidth $ h$ in the kernel and local regression estimates. Typically, bandwidth selection methods are based on an estimate of some goodness-of-fit criterion. Bandwidth selection is a special case of model selection, discussed more deeply in Chap. III.1.

How should smoothing parameters be used? At one extreme, there is full automation: optimization of the goodness-of-fit criterion produces a single `best' bandwidth. At the other extreme is purely exploratory and graphical methods, using goodness-of-fit as a guide to help choose the best method.

Automation has the advantage that it requires much less work; a computer can be programmed to perform the optimization. But the price is a lack of reliability: fits with very different bandwidths can produce similar values of the goodness-of-fit criterion. The result is either high variability (producing fits that look undersmoothed) or high bias (producing fits that miss obvious features in the data).

5.4.1.1 Cross Validation.

Cross validation (CV) focuses on the prediction problem: if the fitted regression curve is used to predict new observations, how good will the prediction be? If a new observation is made at $ x=x_0$, and the response $ Y_0$ is predicted by $ \widehat{Y}_0 =
\hat{\mu}(x_0)$, what is the prediction error? One measure is

$\displaystyle E( (Y_0 - \widehat{Y}_0)^2 ){}.$    

The method of CV can be used to estimate this quantity. In turn, each observation $ (x_i,Y_i)$ is omitted from the dataset, and is `predicted' by smoothing the remaining $ n-1$ observations. This leads to the CV score

$\displaystyle {\mathrm{CV}}(\hat{\mu}) = \frac{1}{n} \sum_{i=1}^n (Y_i - \hat{\mu}_{-i}(x_i))^2{},$ (5.17)

where $ \hat{\mu}_{-i}(\cdot)$ denotes the smoothed estimate when the single data point $ (x_i,Y_i)$ are omitted from the dataset; only the remaining $ n-1$ data points are used to compute the estimate.

Formally computing each of the leave-one-out regression estimates $ \hat{\mu}_{-i}(\cdot)$ would be highly computational, and so at a first glance computation of the CV score (5.17) looks prohibitively expensive. But there is a remarkable simplification, valid for nearly all common linear smoothers (and all those discussed in Sect. 5.2):

$\displaystyle \hat{\mu}_{-i}(x_i) = \frac{\hat{\mu}(x_i) - l_i(x_i) Y_i}{1-l_i(x_i)}{}.$    

With this simplification, the CV criterion becomes

$\displaystyle {\mathrm{CV}}(\hat{\mu}) = \frac{1}{n} \sum_{i=1}^n \frac{ (Y_i - \hat{\mu}(x_i))^2 }{ (1-l_i(x_i))^2 }{}.$    

Generalized cross validation (GCV) replaces each of the influence values $ l_i(x_i)$ by the average, $ \nu_1 / n$. This leads to

$\displaystyle {\mathrm{GCV}}(\hat{\mu}) = n \frac{ \sum_{i=1}^n (Y_i - \hat{\mu}(x_i))^2 }{ (n - \nu_1)^2 }{}.$    

Figure 5.4: GCV scores for the fuel economy dataset. Points marked 0 are for kernel smoothers with a range of bandwidths $ h$, and points marked $ 1$ are for a local linear smoother
\includegraphics[width=8cm]{text/3-5/lrfig3.eps}

Figure 5.4 shows the GCV scores for the fuel economy dataset, and using kernel and local linear smoothers with a range of bandwidths. Note the construction of the plot: the fitted degrees of freedom $ \nu_1$ are used as the $ x$ axis. This allows us to meaningfully superimpose and compare the GCV curves arising from different smoothing methods. From right to left, the points marked `0' represent a kernel smoother with $ h=300, 400, 500, 600, 800$ and $ 1000$, and points marked `$ 1$' represent a local linear smoother with $ h=400, 500, 700, 1000, 1500, 2000$ and $ \infty $.

The interpretation of Fig. 5.4 is that for any fixed degrees of freedom, the local linear fit outperforms the kernel fit. The best fits obtained are the local linear, with $ 3$ to 3.5 degrees of freedom, or $ h$ between $ 1000$ and $ 1500$.

5.4.1.2 Unbiased Risk Estimation.

A risk function measures the distance between the true regression function and the estimate; for example,

$\displaystyle R(\mu,\hat{\mu}) = \frac{1}{\sigma^2} \sum_{i=1}^n E \left( (\hat{\mu}(x_i) - \mu(x_i))^2 \right){}.$ (5.18)

Ideally, a good estimate would be one with low risk. But since $ \mu$ is unknown, $ R(\mu,\hat{\mu})$ cannot be evaluated directly.

Instead, the risk must be estimated. An unbiased estimate is

$\displaystyle \hat{R}(\mu,\hat{\mu}) = \frac{1}{\sigma^2} \sum_{i=1}^n (Y_i - \hat{\mu}(x_i))^2 - n + 2 \nu_1$    

([22,5]). The unbiased risk estimate is equivalent to Akaike's Information Criterion ([1], [2]). To implement the unbiased risk estimate one needs to substitute an estimate for $ \sigma ^2$; Cleveland and Devlin recommend using (5.16) with a small bandwidth.

The unbiased risk estimate can be used similarly to GDV. One computes $ \hat{R}(\mu,\hat{\mu})$ for a range of different fits $ \hat{\mu}$, and plots the resulting risk estimates versus the degrees of freedom. Fits producing a small risk estimate are considered best.

5.4.1.3 Bias Estimation and Plug-in Methods.

An entirely different class of bandwidth selection methods, often termed plug-in methods, attempt to directly estimate a risk measure by estimating the bias and variance. The method has been developed mostly in the context of kernel density estimation, but adaptations to kernel regression and local polynomial regression can be found in [7] and [24].

Again focusing on the squared-error risk, we have the bias-variance decomposition

$\displaystyle \sigma^2 R(\mu,\hat{\mu})$ $\displaystyle = \sum_{i=1}^n {\mathrm{bias}}(\hat{\mu}(x_i))^2 + \sum_{i=1}^n {\mathrm{Var}}(\hat{\mu}(x_i))$    
  $\displaystyle = \sum_{i=1}^n \left( \sum_{j=1}^n l_j(x_i) \mu(x_j) - \mu(x_i)\right)^2 + \sigma^2 \sum_{i=1}^n \Vert l(x_i)\Vert^2{}.$ (5.19)

A plug-in estimate begins by constructing a preliminary pilot estimate of the mean function $ \mu(\cdot)$. This is then substituted into the risk estimate (5.19), which can then be minimized over the bandwidth $ h$.

There are many variants of the plug-in idea in the statistics literature. Most simplify the risk function using asymptotic approximations such as (5.13) and (5.15) for the bias and variance; making these substitutions in (5.19) gives

$\displaystyle \sigma^2 R(\mu,\hat{\mu}) \approx h^4 \left( \frac{ \int v^2 W(v)...
...d}}v}{ \left(\int W(v){\mathrm{d}}v\right)^2 } \sum_{i=1}^n \frac{1}{f(x_i)}{}.$    

If the design points are uniformly distributed on an interval $ [a,b]$ say, then approximating the sums by integrals gives

$\displaystyle \sigma^2 R(\mu,\hat{\mu}) \approx n h^4 \left( \frac{ \int v^2 W(...
...} \frac{ \int W(v)^2 {\mathrm{d}}v}{ \left(\int W(v){\mathrm{d}}v\right)^2 }{}.$    

Minimizing this expression over $ h$ yields an asymptotically optimal bandwidth:

$\displaystyle h_{{\mathrm{opt}}}^5 = \frac{\sigma^2 (b-a)^2 \int W(v)^2 {\mathr...
...left(\int v^2 W(v){\mathrm{d}}v\right)^2 \int_a^b \mu''(x)^2 {\mathrm{d}}x }{}.$    

Evaluation of $ h_{{\text{opt}}}$ requires substitution of estimates for $ \int_a^b \mu''(x)^2 {\mathrm{d}}x$ and of $ \sigma ^2$. The estimate (5.16) can be used to estimate $ \sigma ^2$, but estimating $ \int_a^b \mu''(x)^2 {\mathrm{d}}x$ is more problematic. One technique is to estimate the second derivative using a `pilot' estimate of the smooth, and then use the estimate

$\displaystyle \int_a^b \hat \mu''(x)^2 {\mathrm{d}}x{}.$    

If a local quadratic estimate is used at the pilot stage, the curvature coefficient $ \hat{a}_2$ can be used as an estimate of $ \mu''(x)$.

But the use of a pilot estimate to estimate the second derivative is problematic. The pilot estimate itself has a bandwidth that has to be selected, and the estimated optimal bandwidth $ \hat{h}_{{\text{opt}}}$ is highly sensitive to the choice of pilot bandwidth. Roughly, if the pilot estimate smooths out important features of $ \mu$, so will the estimate $ \hat{\mu}$ with bandwidth $ \hat{h}_{{\text{opt}}}$. More discussion of this point may be found in [20].

5.4.2 Normal-based Inference

Inferential procedures for smoothers include the construction of confidence bands for the true mean function, and procedures to test the adequacy of simpler models. In this section, some of the main ideas are briefly introduced; more extensive discussion can be found in the books [3], [11], [12] and [21].

5.4.2.1 Confidence Intervals.

If the errors $ \epsilon_i$ are normally distributed, then confidence intervals for the true mean can be constructed as

$\displaystyle \hat{\mu}(x) \pm c \hat{\sigma} \Vert l(x)\Vert{}.$    

The constant $ c$ can be chosen from the Student's $ t$ distribution with degrees of freedom equal to $ n- 2\nu_1 + \nu_2$ (alternative choices are discussed below in the context of testing). These confidence intervals are pointwise intervals for $ E(\hat{\mu}(x))$:

$\displaystyle P\left( \vert \hat{\mu}(x) - E(\hat{\mu}(x)) \vert < c \hat{\sigma} \Vert l(x)\Vert \right) = 1-\alpha{}.$    

To construct confidence intervals for $ \mu(x)$, one must either choose the bandwidth sufficiently small so that the bias can be ignored, or explicitly estimate the bias. The latter approach suffers from the same weaknesses observed in plug-in bandwidth selection.

5.4.2.2 Tests of Hypothesis.

Consider the problem of testing for the adequacy of a linear model. For example, in the fuel economy dataset of Figs. 5.1 and 5.2, one may be interested in knowing whether a linear regression, $ \mu(x) = a+bx$ is adequate, or alternatively whether the departure from linearity indicated by the smooth is significant. This hypothesis testing problem can be stated as

$\displaystyle H_0 : \mu(x) = a+bx\quad \mathrm{for\;some}\quad a,b$    
$\displaystyle H_1 : \mathrm{otherwise}{}.$    

In analogy with the theory of linear models, an $ F$ ratio can be formed by fitting both the null and alternative models, and considering the difference between the fits. Under the null model, parametric least squares is used; the corresponding fitted values are $ \boldsymbol{M} Y$ where $ \boldsymbol{M}$ is the hat matrix for the least squares fit. Under the alternative model, the fitted values are $ \boldsymbol{H}Y$, where $ \boldsymbol{H}$ is the hat matrix for a local linear regression. An $ F$ ratio can then be formed as

$\displaystyle F = \frac{ \Vert \boldsymbol{H}Y - \boldsymbol{M} Y\Vert^2 / \nu }{ \hat{\sigma}^2 }{},$    

where $ \nu = {\mathrm{trace}}{ (\boldsymbol{H}-\boldsymbol{M})^{\top}(\boldsymbol{H}-\boldsymbol{M}) }$.

What is the distribution of $ F$ when $ H_0$ is true? Since $ \boldsymbol{H}$ is not a perpendicular projection operator, the numerator does not have a $ \chi^2$ distribution, and $ F$ does not have an exact $ F$ distribution. None-the-less, we can use an approximating $ F$ distribution. Based on a one-moment approximation, the degrees of freedom are $ \nu $ and $ n- 2\nu_1 + \nu_2$.

Better approximations are obtained using the two-moment Satterwaite approximation, as described in [5]. This method matches both the mean and variance of chi-square approximations to the numerator and denominator. Letting $ \Lambda =
(\boldsymbol{H}-\boldsymbol{M})^{\top}(\boldsymbol{H}-\boldsymbol{M})$, the numerator degrees of freedom for the $ F$ distribution are given by $ {\mathrm{trace}}{\Lambda}^2 /
{\mathrm{trace}}{\Lambda^2}$. A similar adjustment is made to the denominator degrees of freedom. Simulations reported in [5] suggest the two-moment approximation is adequate for setting critical values.

For the fuel economy dataset, we obtain $ F=7.247$, $ \nu=1.0866$ and $ n-2\nu_1+\nu_2 = 55.997$. Using the one-moment approximation, the $ p$-value is $ 0.0079$. The two-moment approximation gives a $ p$-value of $ 0.0019$. Both methods indicate that the nonlinearity is significant, although there is some discrepancy between the $ P$-values.

5.4.3 Bootstrapping

The $ F$-tests in the previous section are approximate, even when the errors $ \epsilon_i$ are normally distributed. Additionally, the degrees-of-freedom computations (particularly for the two-moment approximation) require $ O(n^3)$ computations, which is prohibitively expensive for $ n$ more than a few hundred.

An alternative to the $ F$ approximations is to simulate the null distribution of the $ F$ ratio. A bootstrap method (Chap. III.2) performs the simulations using the empirical residuals to approximate the true error distribution:

This procedure is repeated a large number of times (say $ B=1000$) and tabulation of the resulting $ F^{\ast}$ values provides an estimate of the true distribution of the $ F$  ratio.

Remark. Since the degrees of freedom do not change with the replication, there is no need to actually compute the normalizing constant. Instead, one can simply work with the modified $ F$ ratio,

$\displaystyle F_B = \frac{ \Vert\boldsymbol{H}Y^{\ast} - \boldsymbol{M}Y^{\ast}\Vert^2 }{ \Vert(I-\boldsymbol{H})Y^{\ast}\Vert^2 }{}.$    

Figure 5.5: Estimated density of the $ F$ ratio, based on the bootstrap method (solid line); 1-moment $ F$ approximation (short dashed line) and 2-moment $ F$ approximation (long dashed line)
\includegraphics[width=7.7cm]{text/3-5/lrfigb.eps}

Figure 5.5 compares the bootstrap distribution of the $ F$ ratio and the 1 and 2 moment $ F$ approximations for the fuel economy dataset. The bootstrap method uses $ 10{,}000$ bootstrap replications, and the density is estimated using the Local Likelihood method (Sect. 5.5.2 below). Except at the left end-point, there is generally good agreement between the bootstrap density and the two-moment density. The upper $ 5\,{\%}$ quantiles are 3.21 based on the two-moment approximation, and 3.30 based on the bootstrap sample. The one-moment approximation has a critical value of 3.90. Based on the observed $ F=7.248$, the bootstrap $ p$-value is 0.0023, again in close agreement with the two-moment method.


next up previous contents index
Next: 5.5 Multivariate Smoothers Up: 5. Smoothing: Local Regression Previous: 5.3 Statistical Properties of