5.3 Statistical Properties of Linear Smoothers

Each of the smoothing methods discussed in the previous section has one or more `smoothing parameters' that control the amount of smoothing being performed. For example, the bandwidth in the kernel smoother or local regression methods, and the parameter in the penalized likelihood criterion. In implementing the smoothers, the first question to be asked is how should the smoothing parameters be chosen? More generally, how can the performance of a smoother with given smoothing parameters be assessed? A deeper question is in comparing fits from different smoothers. For example, we have seen for the fuel economy dataset that a local linear fit with (Fig. 5.2) produces a fit similar to a smoothing spline with (Fig. 5.3). Somehow, we want to be able to say these two smoothing parameters are equivalent.

As a prelude to studying methods for bandwidth selection and other statistical inference procedures, we must first study some of the properties of linear smoothers. We can consider measures of goodness-of-fit, such as the mean squared error,

where .

Intuitively, as the bandwidth increases, more data is used to construct the estimate , and so the variance decreases. On the other hand, the local polynomial approximation is best over small intervals, so we expect the bias to increase as the bandwidth increases. Choosing is a tradeoff between small bias and small variance, but we need more precise characterizations to derive and study selection procedures.

The bias of a linear smoother is given by

As this depends on the unknown mean function , it is not very useful by itself, although it may be possible to estimate the bias by substituting an estimate for . To gain more insight, approximations to the bias are derived. The basic tools are

- A low order Taylor series expansion of around the fitting point .
- Approximation of the sums by integrals.

For illustration, consider the bias of the local linear regression estimate defined by (5.6). A three-term Taylor series gives

for . Substituting this into (5.11) gives

For local linear regression, it can be shown that

This is a mathematical statement of the heuristically obvious property of the local linear regression: if data fall on a straight line, the local linear regression will reproduce that line. See [21], p. 37, for a formal proof. With this simplification, the bias reduces to

This expression characterizes the dependence of the bias on the mean function: the dominant term of the bias is proportional to the second derivative of the mean function.

The next step is to approximate summations by integrals, both in (5.12) and in the matrix equation (5.9) defining . This leads to

In addition to the dependence on , we now see the dependence on : as the bandwidth increases, the bias increases quadratically with the bandwidth.

Bias expansions like (5.13) are derived much more generally by [25]; their results cover arbitrary degree local polynomials and multidimensional fits also. Their results imply that when , the degree of the local polynomial, is odd, the dominant term of the bias is proportional to . When is even, the first-order term can disappear, leading to bias of order .

To derive the variance of a linear smoother, we need to make assumptions about the random errors in (5.1). The most common assumption is that the errors are independent and identically distributed, with variance . The variance of a linear smoother (5.3) is

As with bias, informative approximations to the variance can be derived by replacing sums by integrals. For local linear regression, this leads to

where is the density of the design points . The dependence on the sample size, bandwidth and design density through is universal, holding for any degree of local polynomial. The term depending on the weight function varies according to the degree of local polynomial, but generally increases as the degree of the polynomials increases. See [25] for details.

Under the model (5.1) the observation has variance , while the estimate has variance . The quantity measures the variance reduction of the smoother at a data point . At one extreme, if the `smoother' interpolates the data, then and . At the other extreme, if , . Under mild conditions on the weight function, a local polynomial smoother satisfies

and is usually a decreasing function of the bandwidth .

A global measure of the amount of smoothing is provided by

This is one definition of the `degrees of freedom' or `effective number of parameters' of the smoother. It satisfies the inequalities

An alternative representation of is as follows. Let be the `hat matrix', which maps the data to fitted values:

For a linear smoother, has rows , and .

The diagonal elements of , provide another measure of the amount of smoothing at . If the smooth interpolates the data, then is the corresponding unit vector with . If the smooth is simply the global average, . The corresponding definition of degrees of freedom is

For a least-squares fit, the hat matrix is a perpendicular projection operator, which is symmetric and idempotent. In this case, , and . For linear smoothers, the two definitions of degrees-of-freedom are usually not equal, but they are often of similar magnitude.

For the local linear regression in Fig. 5.2, the degrees of freedom are and . For the smoothing spline smoother in Fig. 5.3, and . By either measure the degrees of freedom are similar for the two fits. The degrees of freedom provides a mechanism by which different smoothers, with different smoothing parameters, can be compared: we simply choose smoothing parameters producing the same number of degrees of freedom. More extensive discussion of the degrees of freedom of a smoother can be found in [5] and [14].

The final component needed for many statistical procedures is an estimate of the error variance . One such estimate is

The normalizing constant is chosen so that if the bias of is neglected, is unbiased. See [5].