next up previous contents index
Next: 5.4 Statistics for Linear Up: 5. Smoothing: Local Regression Previous: 5.2 Linear Smoothing


5.3 Statistical Properties of Linear Smoothers

Each of the smoothing methods discussed in the previous section has one or more `smoothing parameters' that control the amount of smoothing being performed. For example, the bandwidth $ h$ in the kernel smoother or local regression methods, and the parameter $ \lambda $ in the penalized likelihood criterion. In implementing the smoothers, the first question to be asked is how should the smoothing parameters be chosen? More generally, how can the performance of a smoother with given smoothing parameters be assessed? A deeper question is in comparing fits from different smoothers. For example, we have seen for the fuel economy dataset that a local linear fit with $ h=1000$ (Fig. 5.2) produces a fit similar to a smoothing spline with $ \lambda = 1.5 \times 10^8$ (Fig. 5.3). Somehow, we want to be able to say these two smoothing parameters are equivalent.

As a prelude to studying methods for bandwidth selection and other statistical inference procedures, we must first study some of the properties of linear smoothers. We can consider measures of goodness-of-fit, such as the mean squared error,

$\displaystyle {\mathrm{MSE}}(x) = E \left((\hat{\mu}(x) - \mu(x))^2\right) = {\...
...{var}}\left(\hat{\mu}(x)\right) + {\mathrm{bias}}\left(\hat{\mu}(x)\right)^2{},$    

where $ {\mathrm{bias}}(\hat{\mu}(x)) = E(\hat{\mu}(x)) - \mu(x)$.

Intuitively, as the bandwidth $ h$ increases, more data is used to construct the estimate $ \hat{\mu}(x)$, and so the variance $ {\mathrm{var}}(\hat{\mu}(x))$ decreases. On the other hand, the local polynomial approximation is best over small intervals, so we expect the bias to increase as the bandwidth increases. Choosing $ h$ is a tradeoff between small bias and small variance, but we need more precise characterizations to derive and study selection procedures.

5.3.1 Bias

The bias of a linear smoother is given by

$\displaystyle E (\hat{\mu}(x)) - \mu(x) = \sum_{i=1}^n l_i(x) E(Y_i) - \mu(x) = \sum_{i=1}^n l_i(x) \mu(x_i) - \mu(x){}.$ (5.11)

As this depends on the unknown mean function $ \mu(x)$, it is not very useful by itself, although it may be possible to estimate the bias by substituting an estimate for $ \mu(x)$. To gain more insight, approximations to the bias are derived. The basic tools are
  1. A low order Taylor series expansion of $ \mu(\,\cdot\,)$ around the fitting point $ x$.
  2. Approximation of the sums by integrals.

For illustration, consider the bias of the local linear regression estimate defined by (5.6). A three-term Taylor series gives

$\displaystyle \mu(x_i) = \mu(x) + (x_i-x) \mu'(x) + \frac{(x_i-x)^2}{2} \mu''(x) + o\left(h^2\right)$    

for $ \vert x_i-x\vert \le h$. Substituting this into (5.11) gives

$\displaystyle E(\hat{\mu}(x))-\mu(x)$ $\displaystyle = \mu(x) \sum_{i=1}^n l_i(x) + \mu'(x) \sum_{i=1}^n (x_i-x) l_i(x)$    
  $\displaystyle \quad{} + \frac{\mu''(x)}{2} \sum_{i=1}^n (x_i-x)^2 l_i(x) - \mu(x) + o\left(h^2\right){}.$    

For local linear regression, it can be shown that

$\displaystyle \sum_{i=1}^n l_i(x) = 1$    
$\displaystyle \sum_{i=1}^n (x_i-x) l_i(x) = 0{}.$    

This is a mathematical statement of the heuristically obvious property of the local linear regression: if data $ Y_i$ fall on a straight line, the local linear regression will reproduce that line. See [21], p. 37, for a formal proof. With this simplification, the bias reduces to

$\displaystyle E(\hat{\mu}(x))-\mu(x) = \frac{\mu''(x)}{2} \sum_{i=1}^n (x_i - x)^2 l_i(x) + o\left(h^2\right){}.$ (5.12)

This expression characterizes the dependence of the bias on the mean function: the dominant term of the bias is proportional to the second derivative of the mean function.

The next step is to approximate summations by integrals, both in (5.12) and in the matrix equation (5.9) defining $ l_i(x)$. This leads to

$\displaystyle E(\hat{\mu}(x))-\mu(x) \approx \mu''(x) h^2 \frac{ \int v^2 W(v) {\mathrm{d}}v }{ 2\int W(v) {\mathrm{d}}v }{}.$ (5.13)

In addition to the dependence on $ \mu''(x)$, we now see the dependence on $ h$: as the bandwidth $ h$ increases, the bias increases quadratically with the bandwidth.

Bias expansions like (5.13) are derived much more generally by [25]; their results cover arbitrary degree local polynomials and multidimensional fits also. Their results imply that when $ p$, the degree of the local polynomial, is odd, the dominant term of the bias is proportional to $ h^{p+1} \mu^{(p+1)}(x)$. When $ p$ is even, the first-order term can disappear, leading to bias of order $ h^{p+2}$.

5.3.2 Variance

To derive the variance of a linear smoother, we need to make assumptions about the random errors $ \epsilon_i$ in (5.1). The most common assumption is that the errors are independent and identically distributed, with variance $ {\mathrm{Var}}(\epsilon_i) = \sigma^2$. The variance of a linear smoother (5.3) is

$\displaystyle {\mathrm{Var}}(\hat{\mu}(x)) = \sum_{i=1}^n l_i(x)^2 {\mathrm{Var}}(Y_i) = \sigma^2 \Vert l(x)\Vert^2{}.$ (5.14)

As with bias, informative approximations to the variance can be derived by replacing sums by integrals. For local linear regression, this leads to

$\displaystyle {\mathrm{Var}}(\hat{\mu}(x)) \approx \frac{\sigma^2}{nh f(x)} \frac{ \int W(v)^2 {\mathrm{d}}v}{ \left(\int W(v){\mathrm{d}}v\right)^2 }{},$ (5.15)

where $ f(x)$ is the density of the design points $ x_i$. The dependence on the sample size, bandwidth and design density through $ 1/(nhf(x))$ is universal, holding for any degree of local polynomial. The term depending on the weight function varies according to the degree of local polynomial, but generally increases as the degree of the polynomials increases. See [25] for details.

5.3.3 Degrees of Freedom

Under the model (5.1) the observation $ Y_i$ has variance $ \sigma ^2$, while the estimate $ \hat{\mu}(x_i)$ has variance $ \sigma^2 \Vert l(x_i)\Vert^2$. The quantity $ \Vert l(x_i)\Vert^2$ measures the variance reduction of the smoother at a data point $ x_i$. At one extreme, if the `smoother' interpolates the data, then $ \hat{\mu}(x_i)
= Y_i$ and $ \Vert l(x_i)\Vert^2=1$. At the other extreme, if $ \hat{\mu}(x_i)
= \bar{Y}$, $ \Vert l(x_i)\Vert^2 = 1/n$. Under mild conditions on the weight function, a local polynomial smoother satisfies

$\displaystyle \frac{1}{n} \le \Vert l(x_i)\Vert^2 \le 1{},$    

and $ \Vert l(x_i)\Vert^2$ is usually a decreasing function of the bandwidth $ h$.

A global measure of the amount of smoothing is provided by

$\displaystyle \nu_2 = \sum_{i=1}^n \Vert l(x_i)\Vert^2{}.$    

This is one definition of the `degrees of freedom' or `effective number of parameters' of the smoother. It satisfies the inequalities

$\displaystyle 1 \le \nu_2 \le n{}.$    

An alternative representation of $ \nu_2$ is as follows. Let $ \boldsymbol{H}$ be the `hat matrix', which maps the data to fitted values:

$\displaystyle \begin{pmatrix}\hat{\mu}(x_1) \\ \vdots \\ \hat{\mu}(x_n) \end{pmatrix} = \boldsymbol{H} Y{}.$    

For a linear smoother, $ \boldsymbol{H}$ has rows $ l(x_i)^{\top}$, and $ \nu_2
= {\mathrm{trace}}{\boldsymbol{H}^{\top}\boldsymbol{H}}$.

The diagonal elements of $ \boldsymbol{H}$, $ l_i(x_i)$ provide another measure of the amount of smoothing at $ x_i$. If the smooth interpolates the data, then $ l(x_i)$ is the corresponding unit vector with $ l_i(x_i)=1$. If the smooth is simply the global average, $ l_i(x_i) = 1/n$. The corresponding definition of degrees of freedom is

$\displaystyle \nu_1 = \sum_{i=1}^n l_i(x_i) = {\mathrm{trace}}{\boldsymbol{H}}{}.$    

For a least-squares fit, the hat matrix is a perpendicular projection operator, which is symmetric and idempotent. In this case, $ \boldsymbol{H} =
\boldsymbol{H}^{\top}\boldsymbol{H}$, and $ \nu_1 = \nu_2$. For linear smoothers, the two definitions of degrees-of-freedom are usually not equal, but they are often of similar magnitude.

For the local linear regression in Fig. 5.2, the degrees of freedom are $ \nu_1 = 3.54$ and $ \nu_2 = 3.09$. For the smoothing spline smoother in Fig. 5.3, $ \nu_1 =
3.66$ and $ \nu_2 = 2.98$. By either measure the degrees of freedom are similar for the two fits. The degrees of freedom provides a mechanism by which different smoothers, with different smoothing parameters, can be compared: we simply choose smoothing parameters producing the same number of degrees of freedom. More extensive discussion of the degrees of freedom of a smoother can be found in [5] and [14]. Variance Estimation.

The final component needed for many statistical procedures is an estimate of the error variance $ \sigma ^2$. One such estimate is

$\displaystyle \hat{\sigma}^2 = \frac{1}{n-2\nu_1+\nu_2} \sum_{i=1}^n (Y_i - \hat{\mu}(x_i))^2{}.$ (5.16)

The normalizing constant is chosen so that if the bias of $ \hat{\mu}(x_i)$ is neglected, $ \hat{\sigma}^2$ is unbiased. See [5].

next up previous contents index
Next: 5.4 Statistics for Linear Up: 5. Smoothing: Local Regression Previous: 5.2 Linear Smoothing