We also want to perform statistical inference based on the smoothers.
As for parametric regression, we want to construct confidence bands
and prediction intervals based on the smooth curve. Given a new car
that weighs pounds, what is its fuel economy? Tests of
hypotheses can also be posed: for example, is the curvature observed
in Fig. 5.2 significant, or would a linear regression be
adequate? Given different classifications of car (compact, sporty,
minivan etc.) is there differences among the categories that cannot
be explained by weight alone?
All smoothing methods have one or more smoothing parameters:
parameters that control the `amount' of smoothing being performed.
For example, the bandwidth in the kernel and local regression
estimates. Typically, bandwidth selection methods are based on an
estimate of some goodness-of-fit criterion. Bandwidth selection is
a special case of model selection, discussed more deeply in
Chap. III.1.
How should smoothing parameters be used? At one extreme, there is full automation: optimization of the goodness-of-fit criterion produces a single `best' bandwidth. At the other extreme is purely exploratory and graphical methods, using goodness-of-fit as a guide to help choose the best method.
Automation has the advantage that it requires much less work; a computer can be programmed to perform the optimization. But the price is a lack of reliability: fits with very different bandwidths can produce similar values of the goodness-of-fit criterion. The result is either high variability (producing fits that look undersmoothed) or high bias (producing fits that miss obvious features in the data).
Cross validation (CV) focuses on the prediction problem:
if the fitted regression curve is used to predict new observations,
how good will the prediction be? If a new observation is made at
, and the response
is predicted by
, what is the prediction error? One measure is
![]() |
The method of CV can be used to estimate this quantity. In turn, each
observation is omitted from the dataset, and is
`predicted' by smoothing the remaining
observations. This leads
to the CV score
Formally computing each of the leave-one-out regression estimates
would be highly computational, and so at
a first glance computation of the CV score (5.17) looks
prohibitively expensive. But there is a remarkable simplification,
valid for nearly all common linear smoothers (and all those discussed
in Sect. 5.2):
![]() |
![]() |
Generalized cross validation
(GCV) replaces each of the influence values by the average,
. This leads to
![]() |
![]() |
Figure 5.4 shows the GCV scores for the fuel economy
dataset, and using kernel and local linear smoothers with a range of
bandwidths. Note the construction of the plot: the fitted degrees of
freedom are used as the
axis. This allows us to
meaningfully superimpose and compare the GCV curves arising from
different smoothing methods. From right to left, the points marked
`0' represent a kernel smoother with
and
, and points marked `
' represent a local linear smoother with
and
.
The interpretation of Fig. 5.4 is that for any fixed degrees
of freedom, the local linear fit outperforms the kernel fit. The best
fits obtained are the local linear, with to 3.5 degrees of
freedom, or
between
and
.
A risk function measures the distance between the true regression function and the estimate; for example,
![]() |
(5.18) |
Instead, the risk must be estimated. An unbiased estimate is
![]() |
The unbiased risk estimate can be used similarly to GDV. One computes
for a range of different fits
,
and plots the resulting risk estimates versus the degrees of
freedom. Fits producing a small risk estimate are considered best.
An entirely different class of bandwidth selection methods, often termed plug-in methods, attempt to directly estimate a risk measure by estimating the bias and variance. The method has been developed mostly in the context of kernel density estimation, but adaptations to kernel regression and local polynomial regression can be found in [7] and [24].
Again focusing on the squared-error risk, we have the bias-variance decomposition
There are many variants of the plug-in idea in the statistics literature. Most simplify the risk function using asymptotic approximations such as (5.13) and (5.15) for the bias and variance; making these substitutions in (5.19) gives
![]() |
![]() |
![]() |
Evaluation of
requires substitution of estimates for
and of
. The
estimate (5.16) can be used to estimate
, but
estimating
is more problematic. One
technique is to estimate the second derivative using a `pilot'
estimate
of the smooth, and then use the estimate
![]() |
But the use of a pilot estimate to estimate the second derivative is
problematic. The pilot estimate itself has a bandwidth that has to be
selected, and the estimated optimal bandwidth
is
highly sensitive to the choice of pilot bandwidth. Roughly, if the
pilot estimate smooths out important features of
, so will the
estimate
with bandwidth
. More discussion
of this point may be found in [20].
Inferential procedures for smoothers include the construction of confidence bands for the true mean function, and procedures to test the adequacy of simpler models. In this section, some of the main ideas are briefly introduced; more extensive discussion can be found in the books [3], [11], [12] and [21].
If the errors
are normally distributed, then confidence
intervals for the true mean can be constructed as
![]() |
![]() |
Consider the problem of testing for the adequacy of a linear model.
For example, in the fuel economy dataset of Figs. 5.1
and 5.2, one may be interested in knowing whether a linear
regression,
is adequate, or alternatively whether the
departure from linearity indicated by the smooth is significant. This
hypothesis testing problem can be stated as
![]() |
|
![]() |
In analogy with the theory of linear models, an ratio can be formed
by fitting both the null and alternative models, and considering the
difference between the fits. Under the null model, parametric least
squares is used; the corresponding fitted values are
where
is the hat matrix for the least squares fit. Under
the alternative model, the fitted values are
, where
is the hat matrix for a local linear regression. An
ratio can then be formed as
![]() |
What is the distribution of when
is true? Since
is not a perpendicular projection operator, the numerator does not
have a
distribution, and
does not have an exact
distribution. None-the-less, we can use an approximating
distribution. Based on a one-moment approximation, the degrees of
freedom are
and
.
Better approximations are obtained using the two-moment Satterwaite
approximation,
as described in [5]. This method matches both the mean and
variance of chi-square approximations to the numerator and
denominator. Letting
, the numerator degrees
of freedom for the
distribution are given by
. A similar adjustment is made to the denominator
degrees of freedom. Simulations reported in [5] suggest
the two-moment approximation is adequate for setting critical values.
For the fuel economy dataset, we obtain ,
and
. Using the
one-moment approximation, the
-value is
. The
two-moment approximation gives a
-value of
. Both
methods indicate that the nonlinearity is significant, although there
is some discrepancy between the
-values.
The -tests in the previous section are approximate, even when the
errors
are normally distributed. Additionally, the
degrees-of-freedom computations (particularly for the two-moment
approximation) require
computations, which is prohibitively
expensive for
more than a few hundred.
An alternative to the approximations is to simulate the null
distribution of the
ratio. A bootstrap method
(Chap. III.2) performs the
simulations using the empirical residuals to approximate the true
error distribution:
This procedure is repeated a large number of times (say ) and
tabulation of the resulting
values provides an estimate of
the true distribution of the
ratio.
Remark. Since the degrees of freedom do not change with the
replication, there is no need to actually compute the normalizing
constant. Instead, one can simply work with the modified ratio,
![]() |
![]() |
Figure 5.5 compares the bootstrap distribution of the
ratio and the 1 and 2 moment
approximations for the fuel economy
dataset. The bootstrap method uses
bootstrap
replications, and the density is estimated using the Local Likelihood
method (Sect. 5.5.2 below). Except at the left end-point,
there is generally good agreement between the bootstrap density and
the two-moment density. The upper
quantiles are 3.21
based on the two-moment approximation, and 3.30 based on the
bootstrap sample. The one-moment approximation has a critical value of
3.90. Based on the observed
, the bootstrap
-value is 0.0023, again in close agreement with the
two-moment method.