next up previous contents index
Next: 5.3 Statistical Properties of Up: 5. Smoothing: Local Regression Previous: 5.1 Smoothing

Subsections



5.2 Linear Smoothing

In this section, some of the most common smoothing methods are introduced and discussed.

5.2.1 Kernel Smoothers

The simplest of smoothing methods is a kernel smoother. A point $ x$ is fixed in the domain of the mean function $ \mu(\cdot)$, and a smoothing window is defined around that point. Most often, the smoothing window is simply an interval $ (x-h,x+h)$, where $ h$ is a fixed parameter known as the bandwidth.

The kernel estimate is a weighted average of the observations within the smoothing window:

$\displaystyle \hat\mu(x) = \frac{ \sum_{i=1}^n W \left( \frac{x_i-x}{h}\right) Y_i} {\sum_{j=1}^n W \left( \frac{x_j-x}{h} \right)}{},$ (5.2)

where $ W(\cdot)$ is a weight function. The weight function is chosen so that most weight is given to those observations close to the fitting point $ x$. One common choice is the bisquare function,

$\displaystyle W(x) = \begin{cases}\left(1-x^2\right)^2 & -1 \le x \le 1\\ 0 & x > 1\quad \text{or}\quad x < -1 \end{cases}{}.$    

The kernel smoother can be represented as

$\displaystyle \hat\mu(x) = \sum_{i=1}^n l_i(x) Y_i{},$ (5.3)

where the coefficients $ l_i(x)$ are given by

$\displaystyle l_i(x) = \frac{ W \left( \frac{x_i-x}{h}\right) }{ \sum_{j=1}^n W \left( \frac{x_j-x}{h}\right) }{}.$    

A linear smoother is a smoother that can be represented in the form (5.3) for appropriately defined weights $ l_i(x)$. This linear representation leads to many nice statistical and computational properties, which will be discussed later.

The kernel estimate (5.2) is sometimes called the Nadaraya-Watson estimate ([23,33]). Its simplicity makes it easy to understand and implement, and it is available in many statistical software packages. But its simplicity leads to a number of weaknesses, the most obvious of which is boundary bias. This can be illustrated through an example.

Figure 5.1: Kernel smooth of the fuel economy dataset. The bisquare kernel is used, with bandwidth $ h=600$ pounds
\includegraphics[width=8.0cm]{text/3-5/lrfig1.eps}

The fuel economy dataset consists of measurements of fuel usage (in miles per gallon) for sixty different vehicles. The predictor variable is the weight (in pounds) of the vehicle. Figure 5.1 shows a scatterplot of the sixty data points, together with a kernel smooth. The smooth is constructed using the bisquare kernel and bandwidth $ h=600$ pounds.

Over much of the domain of Fig. 5.1, the smooth fit captures the main trend of the data, as required. But consider the left boundary region; in particular, vehicles weighing less than $ 2200$ pounds. All these data points lie above the fitted curve; the fitted curve will underestimate the economy of vehicles in this weight range. When the kernel estimate is applied at the left boundary (say, at Weight$ =1800$), all the data points used to form the average have Weight$ >1800$, and correspondingly slope of the true relation induces boundary bias into the estimate.

More discussion of this and other weaknesses of the kernel smoother can be found in [13]. Many modified kernel estimates have been proposed, but one obtains more parsimonious solutions by considering alternative estimation procedures.

5.2.2 Local Regression

Local regression estimation was independently introduced in several different fields in the late nineteenth and early twentieth century ([15,27]). In the statistical literature, the method was independently introduced from different viewpoints in the late 1970's ([4,18,29]). Books on the topic include [8] and [21].

The underlying principle is that a smooth function can be well approximated by a low degree polynomial in the neighborhood of any point $ x$. For example, a local linear approximation is

$\displaystyle \mu(x_i) \approx a_0 + a_1 (x_i-x)$ (5.4)

for $ x-h \le x_i \le x+h$. A local quadratic approximation is

$\displaystyle \mu(x_i) \approx a_0 + a_1 (x_i-x) + \frac{a_2}{2} (x_i-x)^2{}.$    

The local approximation can be fitted by locally weighted least squares. A weight function and bandwidth are defined as for kernel regression. In the case of local linear regression, coefficient estimates $ \hat{a}_0,\hat{a}_1$ are chosen to minimize

$\displaystyle \sum_{i=1}^n W \left( \frac{x_i-x}{h}\right) \left(Y_i - (a_0 + a_1 (x_i-x))\right)^2{}.$ (5.5)

The local linear regression estimate is defined as

$\displaystyle \hat{\mu}(x) = \hat{a}_0{}.$ (5.6)

Each local least squares problem defines $ \hat{\mu}(x)$ at one point $ x$; if $ x$ is changed, the smoothing weights $ W
\left( \frac{x_i-x}{h}\right)$ change, and so the estimates $ \hat{a}_0$ and $ \hat{a}_1$ change.

Since (5.5) is a weighted least squares problem, one can obtain the coefficient estimates by solving the normal equations

$\displaystyle \boldsymbol{X}^{\top}\boldsymbol{W} \left( Y - \boldsymbol{X} \begin{pmatrix}\hat{a}_0 \\ \hat{a}_1 \end{pmatrix}\right)=0{},$ (5.7)

where $ \boldsymbol {X}$ is the design matrix:

$\displaystyle \boldsymbol{X} = \begin{pmatrix}1 & x_1-x \\ \vdots & \vdots \\ 1 & x_n-x\end{pmatrix}$    

for local linear regression, $ \boldsymbol{W}$ is a diagonal matrix with entries $ W
\left( \frac{x_i-x}{h}\right)$ and $ Y = \begin{pmatrix}Y_1 & \ldots &
Y_n\end{pmatrix}^{\top}$.

When $ \boldsymbol{X}^{\top}\boldsymbol{W} \boldsymbol{X}$ is invertible, one has the explicit representation

$\displaystyle \begin{pmatrix}\hat{a}_0 \\ \hat{a}_1 \end{pmatrix} = \left( \bol...
...dsymbol{W} \boldsymbol{X} \right)^{-1} \boldsymbol{X}^{\top}\boldsymbol{W} Y{}.$ (5.8)

This shows that the local regression estimate is a linear estimate, as defined by (5.3). Explicitly, the coefficients $ l_i(x)$ are given by

$\displaystyle l(x)^{\top} = \begin{pmatrix}l_1(x) & \ldots & l_n(x) \end{pmatri...
...oldsymbol{W} \boldsymbol{X} \right)^{-1} \boldsymbol{X}^{\top}\boldsymbol{W}{},$ (5.9)

where $ e_1^{\top}$ is the unit vector $ \begin{pmatrix}1 & 0
\end{pmatrix}$.

For local quadratic regression and higher order fits, one simply adds additional columns to the design matrix $ \boldsymbol {X}$ and vector  $ e_1^{\top}$.

Figure 5.2: Local Linear Regression fitted to the fuel economy dataset. A bandwidth $ h=1000$ pounds is used
\includegraphics[width=8cm]{text/3-5/lrfig2.eps}

Figure 5.2 shows a local linear regression fit to the fuel economy dataset. This has clearly fixed the boundary bias problem observed in Fig. 5.1. With the reduction in boundary bias, it is also possible to substantially increase the bandwidth, from $ h=600$ pounds to $ h=1000$ bounds. As a result, the local linear fit is using much more data, meaning the estimate has less noise.

5.2.3 Penalized Least Squares (Smoothing Splines)

An entirely different approach to smoothing is through optimization of a penalized least squares criterion, such as

$\displaystyle \sum_{i=1}^n \left( Y_i - \mu(x_i)\right)^2 + \lambda \int \mu''(x)^2 {\text{d}}x{},$ (5.10)

where $ \lambda $ is specified constant. This criterion trades off fidelity to the data (measured by the residual sum-of-squares) versus roughness of the mean function (measured by the penalty term). The penalized least squares method chooses $ \hat{\mu}$ from the class of twice differentiable functions to minimize the penalized least squares criterion.

The solution to this optimization problem is a piecewise polynomial, or spline function, and so penalized least squares methods are also known as smoothing splines. The idea was first considered in the early twentieth century ([34]). Modern statistical literature on smoothing splines began with work including [32] and [28]. Books devoted to spline smoothing include [10] and [31].

Suppose the data are ordered; $ x_i \le x_{i+1}$ for all $ i$. Let $ \hat{a}_i = \hat{\mu}(x_i)$, and $ \hat{b}_i = \hat{\mu}'(x_i)$, for $ i=1,\ldots,n$. Given these values, it is easy to show that between successive data points, $ \hat{\mu}(x)$ must be the unique cubic polynomial interpolating these values:

$\displaystyle \hat{\mu}(x) = a_i \phi_0(u) + b_i \Delta_i \psi_0(u) + a_{i+1} \phi_1(u) + b_{i+1} \Delta_i \psi_1(u){},$    

where $ \Delta_i = x_{i+1}-x_i$; $ u = (x-x_i)/\Delta_i$ and

$\displaystyle \phi_0(u) = 1- u^2(3-2u)$    
$\displaystyle \psi_0(u) = u(1-u(2-u))$    
$\displaystyle \phi_1(u) = u^2(3-2u)$    
$\displaystyle \psi_1(u) = u^2(u-1){}.$    

Letting $ \alpha^{\top} = \begin{pmatrix}a_1 & b_1 & \ldots & a_n &
b_n \end{pmatrix}$, the penalty term $ \int \mu''(x)^2{\text{d}}x$ is a quadratic function of the parameters, and so (5.10) can be written as

$\displaystyle \Vert Y - \boldsymbol{X} \alpha\Vert^2 + \lambda \alpha^{\top}\boldsymbol{M} \alpha{},$    

for appropriate matrices $ \boldsymbol{M}$ and  $ \boldsymbol {X}$. The parameter estimates are given by

$\displaystyle \widehat{\alpha} = \left( \boldsymbol{X}^{\top}\boldsymbol{X} + \lambda \boldsymbol{M} \right)^{-1} \boldsymbol{X}^{\top}Y{}.$    

Figure 5.3 shows a smoothing spline fitted to the fuel economy dataset. Clearly, the fit is very similar to the local regression fit in Fig. 5.2. This situation is common for smoothing problems with a single predictor variable; with comparably chosen smoothing parameters, local regression and smoothing spline methods produce similar results. On the other hand, kernel methods can struggle to produce acceptable results, even on relatively simple datasets.

Figure 5.3: Smoothing Spline fitted to the fuel economy dataset. The penalty is $ \lambda = 1.5 \times 110^8 {\text {pounds}}^3$
\includegraphics[width=7.8cm]{text/3-5/lrfigsp.eps}

5.2.4 Regression Splines

Regression splines begin by choosing a set of knots (typically, much smaller than the number of data points), and a set of basis functions spanning a set of piecewise polynomials satisfying continuity and smoothness constraints.

Let the knots be $ v_1 < \ldots < v_k$ with $ v_1 = \min(x_i)$ and $ v_k
= \max(x_i)$. A linear spline basis is

$\displaystyle f_j(x) = \begin{cases}\displaystyle \frac{x-v_{j-1}}{v_j-v_{j-1}}...
..._{j+1}-v_j} & v_j < x \le v_{j+1}\\ [3mm] 0 & {\text{otherwise}} \end{cases}{};$    

note that these functions span the space of piecewise linear functions with knots at $ v_1,\ldots,v_k$. The piecewise linear spline function is constructed by regressing the data onto these basis functions.

The linear spline basis functions have discontinuous derivatives, and so the resulting fit may have a jagged appearance. It is more common to use piecewise cubic splines, with the basis functions having two continuous derivatives. See Chap. 3 of [26] for a more detailed discussion of regression splines and basis functions.

5.2.5 Orthogonal Series

Orthogonal series methods represent the data with respect to a series of orthogonal basis functions, such as sines and cosines. Only the low frequency terms are retained. The book [6] provides a detailed discussion of this approach to smoothing.

Suppose the $ x_i$ are equally spaced; $ x_i = i/n$. Consider the basis functions

$\displaystyle f_{\omega}(x) = a_{\omega} \cos( 2 \pi \omega x){}; \quad \omega = 0,1,\ldots,\left\lfloor n/2 \right\rfloor$    
$\displaystyle g_{\omega}(x) = b_{\omega} \sin( 2 \pi \omega x){}; \quad\omega = 1,\ldots,\left\lfloor (n-1)/2 \right\rfloor{},$    

where the constants $ a_{\omega},b_{\omega}$ are chosen so that $ \sum_{i=1}^n f_{\omega}(x_i)^2 = \sum_{i=1}^n g_{\omega}(x_i)^2 = 1$. Then the regression coefficients are

$\displaystyle c_{\omega} = \sum_{i=1}^n f_{\omega}(x_i) Y_i$    
$\displaystyle s_{\omega} = \sum_{i=1}^n g_{\omega}(x_i) Y_i$    

and the corresponding smooth estimate is

$\displaystyle \hat{\mu}(x) = \sum_{\omega} h(\omega) \left( c_{\omega} f_{\omega}(x) + s_{\omega} g_{\omega}(x)\right){}.$    

Here, $ h(\omega)$ is chosen to `damp' high frequencies in the observations; for example,

$\displaystyle h(\omega) = \begin{cases}1 & \omega \le \omega_0 \\ 0 & \omega > \omega_0 \end{cases}$    

is a low-pass filter, passing all frequencies less than or equal to $ \omega_0$.

Orthogonal series are widely used to model time series, where the coefficients $ c_{\omega}$ and $ s_{\omega}$ may have a physical interpretation: non-zero coefficients indicate the presence of cycles in the data. A limitation of orthogonal series approaches is that they are more difficult to apply when the $ x_i$ are not equally spaced.


next up previous contents index
Next: 5.3 Statistical Properties of Up: 5. Smoothing: Local Regression Previous: 5.1 Smoothing