In this section, some of the most common smoothing methods are introduced and discussed.
The simplest of smoothing methods is a kernel smoother. A point is fixed in the domain of the mean function , and a smoothing window is defined around that point. Most often, the smoothing window is simply an interval , where is a fixed parameter known as the bandwidth.
The kernel estimate is a weighted average of the observations within the smoothing window:
The kernel smoother can be represented as
The kernel estimate (5.2) is sometimes called the Nadaraya-Watson estimate ([23,33]). Its simplicity makes it easy to understand and implement, and it is available in many statistical software packages. But its simplicity leads to a number of weaknesses, the most obvious of which is boundary bias. This can be illustrated through an example.
|
The fuel economy dataset consists of measurements of fuel usage (in miles per gallon) for sixty different vehicles. The predictor variable is the weight (in pounds) of the vehicle. Figure 5.1 shows a scatterplot of the sixty data points, together with a kernel smooth. The smooth is constructed using the bisquare kernel and bandwidth pounds.
Over much of the domain of Fig. 5.1, the smooth fit captures the main trend of the data, as required. But consider the left boundary region; in particular, vehicles weighing less than pounds. All these data points lie above the fitted curve; the fitted curve will underestimate the economy of vehicles in this weight range. When the kernel estimate is applied at the left boundary (say, at Weight), all the data points used to form the average have Weight, and correspondingly slope of the true relation induces boundary bias into the estimate.
More discussion of this and other weaknesses of the kernel smoother can be found in [13]. Many modified kernel estimates have been proposed, but one obtains more parsimonious solutions by considering alternative estimation procedures.
Local regression estimation was independently introduced in several different fields in the late nineteenth and early twentieth century ([15,27]). In the statistical literature, the method was independently introduced from different viewpoints in the late 1970's ([4,18,29]). Books on the topic include [8] and [21].
The underlying principle is that a smooth function can be well approximated by a low degree polynomial in the neighborhood of any point . For example, a local linear approximation is
The local approximation can be fitted by locally weighted least squares. A weight function and bandwidth are defined as for kernel regression. In the case of local linear regression, coefficient estimates are chosen to minimize
Since (5.5) is a weighted least squares problem, one can obtain the coefficient estimates by solving the normal equations
When is invertible, one has the explicit representation
For local quadratic regression and higher order fits, one simply adds additional columns to the design matrix and vector .
Figure 5.2 shows a local linear regression fit to the fuel economy dataset. This has clearly fixed the boundary bias problem observed in Fig. 5.1. With the reduction in boundary bias, it is also possible to substantially increase the bandwidth, from pounds to bounds. As a result, the local linear fit is using much more data, meaning the estimate has less noise.
An entirely different approach to smoothing is through optimization of a penalized least squares criterion, such as
The solution to this optimization problem is a piecewise polynomial, or spline function, and so penalized least squares methods are also known as smoothing splines. The idea was first considered in the early twentieth century ([34]). Modern statistical literature on smoothing splines began with work including [32] and [28]. Books devoted to spline smoothing include [10] and [31].
Suppose the data are ordered; for all . Let , and , for . Given these values, it is easy to show that between successive data points, must be the unique cubic polynomial interpolating these values:
Figure 5.3 shows a smoothing spline fitted to the fuel economy dataset. Clearly, the fit is very similar to the local regression fit in Fig. 5.2. This situation is common for smoothing problems with a single predictor variable; with comparably chosen smoothing parameters, local regression and smoothing spline methods produce similar results. On the other hand, kernel methods can struggle to produce acceptable results, even on relatively simple datasets.
Regression splines begin by choosing a set of knots (typically, much smaller than the number of data points), and a set of basis functions spanning a set of piecewise polynomials satisfying continuity and smoothness constraints.
Let the knots be with and . A linear spline basis is
The linear spline basis functions have discontinuous derivatives, and so the resulting fit may have a jagged appearance. It is more common to use piecewise cubic splines, with the basis functions having two continuous derivatives. See Chap. 3 of [26] for a more detailed discussion of regression splines and basis functions.
Orthogonal series methods represent the data with respect to a series of orthogonal basis functions, such as sines and cosines. Only the low frequency terms are retained. The book [6] provides a detailed discussion of this approach to smoothing.
Suppose the are equally spaced; . Consider the basis functions
Orthogonal series are widely used to model time series, where the coefficients and may have a physical interpretation: non-zero coefficients indicate the presence of cycles in the data. A limitation of orthogonal series approaches is that they are more difficult to apply when the are not equally spaced.