Each of the smoothing methods discussed in the previous section has
one or more `smoothing parameters'
that control the amount of smoothing being performed. For example, the
bandwidth
in the kernel smoother or local regression methods, and the
parameter
in the penalized likelihood criterion. In
implementing the smoothers, the first question to be asked is how
should the smoothing parameters be chosen? More generally, how can the
performance of a smoother with given smoothing parameters be assessed?
A deeper question is in comparing fits from different smoothers. For
example, we have seen for the fuel economy dataset that a local linear
fit with
(Fig. 5.2) produces a fit similar to
a smoothing spline with
(Fig. 5.3). Somehow, we want to be able to say these two
smoothing parameters are equivalent.
As a prelude to studying methods for bandwidth selection and other statistical inference procedures, we must first study some of the properties of linear smoothers. We can consider measures of goodness-of-fit, such as the mean squared error,
![]() |
Intuitively, as the bandwidth increases, more data is used to
construct the estimate
, and so the variance
decreases. On the other hand, the local
polynomial approximation
is best over small intervals, so we expect the bias to increase as the
bandwidth increases. Choosing
is a tradeoff between small bias and
small variance,
but we need more precise characterizations to derive and study
selection procedures.
The bias of a linear smoother is given by
For illustration, consider the bias of the local linear regression estimate defined by (5.6). A three-term Taylor series gives
![]() |
![]() |
![]() |
|
![]() |
![]() |
|
![]() |
The next step is to approximate summations by integrals, both
in (5.12) and in the matrix equation (5.9) defining
. This leads to
Bias expansions like (5.13) are derived much more generally
by [25]; their results cover arbitrary degree local
polynomials and multidimensional fits also. Their results imply that
when , the degree of the local polynomial, is odd, the dominant
term of the bias is proportional to
. When
is even, the first-order term can disappear, leading to bias of
order
.
To derive the variance of a linear smoother, we need to make
assumptions about the random errors
in (5.1).
The most common assumption is that the errors are independent and
identically distributed, with variance
.
The variance of a linear smoother (5.3) is
As with bias, informative approximations to the variance can be derived by replacing sums by integrals. For local linear regression, this leads to
Under the model (5.1) the observation has
variance
, while the estimate
has variance
. The quantity
measures the
variance reduction of the smoother at a data point
. At one
extreme, if the `smoother' interpolates the data, then
and
. At the other extreme, if
,
. Under mild conditions on the weight
function, a local polynomial smoother satisfies
![]() |
A global measure of the amount of smoothing is provided by
![]() |
![]() |
An alternative representation of is as follows. Let
be the `hat matrix',
which maps the data to fitted values:
![]() |
The diagonal elements of
,
provide another
measure of the amount of smoothing at
. If the smooth
interpolates the data, then
is the corresponding unit vector
with
. If the smooth is simply the global average,
. The corresponding definition of degrees of freedom
is
![]() |
For the local linear regression in Fig. 5.2, the degrees
of freedom are
and
. For the
smoothing spline smoother in Fig. 5.3,
and
. By either measure the degrees of
freedom are similar for the two fits. The degrees of freedom provides
a mechanism by which different smoothers, with different smoothing
parameters, can be compared: we simply choose smoothing parameters
producing the same number of degrees of freedom. More extensive
discussion of the degrees of freedom of a smoother can be found in
[5] and [14].
The final component needed for many statistical procedures is an
estimate of the error variance . One such estimate is