4.2 Other Smoothers

This section intends to give a short overview of nonparametric smoothing techniques that differ from the kernel method. For further references on these and other nonparametric smoothers see the bibliographic notes at the end of this chapter.

4.2.1 Nearest-Neighbor Estimator

As we have seen above, kernel regression estimation can be viewed as a method of computing weighted averages of the response variables in a fixed neighborhood around $ x$, the width of this neighborhood being governed by the bandwidth $ h$. The $ k$-nearest-neighbor ($ k$-NN) estimator can also be viewed as a weighted average of the response variables in a neighborhood around $ x$, with the important difference that the neighborhood width is not fixed but variable. To be more specific, the values of $ Y$ used in computing the average, are those which belong to the $ k$ observed values of $ X$ that are nearest to the point $ x$, at which we would like to estimate $ m(x)$. Formally, the $ k$-NN estimator can be written as

$\displaystyle \widehat{m}_k(x)=\frac{1}{n}\sum_{i=1}^{n}W_{ki}(x)Y_{i},$ (4.23)

where the weights $ \{W_{ki}(x)\}_{i=1}^{n}$ are defined as

$\displaystyle W_{ki}(x)=\left\{ \begin{array}{ll} n/k & \textrm{if}\quad i \in J_{x} \\ 0 & \textrm{otherwise} \end{array} \right.$ (4.24)

with the set of indices

$\displaystyle J_{x}=\{i : X_{i} \textrm{ is one of the $k$ nearest
observations to $x$}\}.$

If we estimate $ m(\bullet)$ at a point $ x$ where the data are sparse then it might happen that the $ k$ nearest neighbors are rather far away from $ x$ (and each other), thus consequently we end up with a wide neighborhood around $ x$ for which an average of the corresponding values of $ Y$ is computed. Note that $ k$ is the smoothing parameter of this estimator. Increasing $ k$ makes the estimate smoother.

EXAMPLE 4.5  
A $ k$-NN estimation of the Engel curve (net-income vs. food) is shown in Figure 4.5$ \Box$

Figure: $ k$-Nearest-neighbor regression, $ k=101$, U.K. Family Expenditure Survey 1973
\includegraphics[width=0.03\defepswidth]{quantlet.ps}SPMknnreg
\includegraphics[width=1.2\defpicwidth]{SPMknnreg.ps}

The $ k$-NN estimator can be viewed as a kernel estimator with uniform kernel $ K(u)=\frac 12$I $ (\vert u\vert\le 1)$ and variable bandwidth $ R=R(k)$, with $ R(k)$ being the distance between $ x$ and its furthest $ k$-nearest neighbor:

$\displaystyle \widehat{m}_k(x)=\frac{\sum_{i=1}^nK_{R}(x-X_i) Y_{i}}{\sum_{i=1}^nK_{R}(x-X_i)}.$ (4.25)

The $ k$-NN estimator can be generalized in this sense by considering kernels other than the uniform kernel. Bias and variance of this more general $ k$-NN estimator are given in the following theorem by Mack (1981).

THEOREM 4.4  
Let $ k\to \infty $, $ k/n\to 0$ and $ n\rightarrow \infty$. Then
$\displaystyle E\{\widehat{m}_{k}(x)\}-m(x)$ $\displaystyle \approx$ $\displaystyle \frac{\mu_2(K)}{8 f_X(x)^2}\,
\left\{m''(x)+2\,\frac{m'(x)f_X'(x)}{f_X(x)}\right\}\,\left(\frac{k}{n}\right)^{2}$ (4.26)
$\displaystyle Var\{\widehat{m}_{k}(x)\}$ $\displaystyle \approx$ $\displaystyle 2 \Vert K\Vert^{2}_{2}\,\frac{\sigma^2(x)}{k}.$ (4.27)

Obviously, unlike the variance of the Nadaraya-Watson kernel regression estimator, the variance of the $ k$-NN regression estimator does not depend on $ f_X(x)$, which makes sense since the $ k$-NN estimator always averages over $ k$ observations, regardless of how dense the data is in the neighborhood of the point $ x$ where we estimate $ m(\bullet)$. Consequently, $ Var\{\widehat{m}_{k}(x)\}\sim\frac{1}{k}$. By choosing

$\displaystyle k=2nhf_X(x),$ (4.28)

we obtain a $ k$-NN estimator that is approximately identical to a kernel estimator with bandwidth $ h$ in the leading terms of the $ \mse$.

4.2.2 Median Smoothing

Median smoothing may be described as the nearest-neighbor technique to solve the problem of estimating the conditional median function, rather than the conditional expectation function, which has been our target so far. The conditional median $ \med(Y\vert X=x)$ is more robust to outliers than the conditional expectation $ E(Y\vert X=x)$. Moreover, median smoothing allows us to model discontinuities in the regression curve $ \med(Y\vert X)$. Formally, the median smoother is defined as

$\displaystyle \index{median smoothing} \widehat{m}(x)=\med\{Y_{i}:i \in J_{x}\},$ (4.29)

where

$\displaystyle J_{x}=\{i:X_{i}\quad \textrm{is one of the $k$ nearest neighbors
of $x$}\}.$

That is, the median of those $ Y_{i}$s is computed, for which the corresponding $ X_{i}$ is one of the $ k$ nearest neighbors of $ x$.

EXAMPLE 4.6  
We display such a median smoother for our running Engle curve example in Figure 4.6. Note that in contrast to the $ k$-NN estimator, extreme values of food expenditures do no longer affect the estimator. $ \Box$

Figure: Median smoothing regression, $ k=101$, U.K. Family Expenditure Survey 1973
\includegraphics[width=0.03\defepswidth]{quantlet.ps}SPMmesmooreg
\includegraphics[width=1.2\defpicwidth]{SPMmesmooreg.ps}

4.2.3 Spline Smoothing

Spline smoothing can be motivated by considering the residual sum of squares ($ \rss$) as a criterion for the goodness of fit of a function $ {m}(\bullet)$ to the data. The residual sum of squares is defined as

$\displaystyle \rss=\sum_{i=1}^n\left\{ Y_i-{m}(X_i)\right\}^2.$

Yet, one can define the function $ {m}(X_{i})=Y_{i},
\ i=1,\ldots,n$ that is minimizing the $ \rss$ but is merely interpolating the data, without exploiting any structure that might be present in the data. Spline smoothing solves this problem by adding a stabilizer that penalizes non-smoothness of $ {m}(\bullet)$. One possible stabilizer is given by

$\displaystyle \Vert{m}''\Vert^{2}_{2}=\int \left\{{m}''(x)\right\}^{2}\;dx.$ (4.30)

The use of $ m''$ can be motivated by the fact that the curvature of $ m(x)$ increases with $ \vert m''(x)\vert$. Using the penalty term (4.30) we may restate the minimization problem as

$\displaystyle \widehat{m}_\lambda= \arg\min_{{m}}\, S_{\lambda}({m})$ (4.31)

with

$\displaystyle S_{\lambda}({m})=\sum_{i=1}^n \left\{Y_{i}-{m}(X_{i}) \right\}^2+\lambda \Vert{m}''\Vert^{2}_{2}.$ (4.32)

If we consider the class of all twice differentiable functions on the interval $ [a,b]=[X_{(1)},X_{(n)}]$ (where $ X_{(i)}$ denotes $ i$th order statistic) then the (unique) minimizer of (4.32) is given by the cubic spline estimator $ \widehat{m}_\lambda(x)$, which consists of cubic polynomials

$\displaystyle p_{i}(x)=\alpha_i+\beta_i x + \gamma_i x^2 + \delta_i x^3, \quad
i=1,\ldots, n-1,$

between adjacent $ X_{(i)},X_{(i+1)}$-values.

The parameter $ \lambda$ controls the weight given to the stabilizer in the minimization. The higher $ \lambda$ is, the more weight is given to $ \Vert{m}''\Vert^{2}_{2}$ and the smoother the estimate. As $ \lambda \rightarrow 0,$ $ \widehat{m}_\lambda(\bullet)$ is merely an interpolation of the observations of $ Y$. If $ \lambda \rightarrow
\infty,$ $ \widehat{m}_{\lambda}(\bullet)$ tends to a linear function.

Let us now consider the spline estimator in more detail. For the estimator to be twice continuously differentiable we have to make sure that there are no jumps in the function, as well in its first and second derivative if evaluated at $ X_{(i)}$. Formally, we require

$\displaystyle p_i\left(X_{(i)}\right) = p_{i-1}\left(X_{(i)}\right), \
p'_i\le...
...t(X_{(i)}\right), \
p''_i\left(X_{(i)}\right) = p''_{i-1}\left(X_{(i)}\right).$

Additionally a boundary condition has to be fulfilled. Typically this is

$\displaystyle p''_1\left(X_{(1)}\right) = p''_{n-1}\left(X_{(n)}\right) = 0.$

These restrictions, along with the conditions for minimizing $ S_{\lambda}({m})$ w.r.t. the coefficients of $ p_{i}$, define a system of linear equations which can be solved in only $ O(n)$ calculations.

To illustrate this, we present some details on the computational algorithm introduced by Reinsch, see Green & Silverman (1994). Observe that the residual sum of squares

$\displaystyle \rss=\sum_{i=1}^n \{Y_{(i)} - {m}(X_{(i)})\}^2 = ({\boldsymbol{Y}}- {\boldsymbol{m}})^\top({\boldsymbol{Y}}-{\boldsymbol{m}})$ (4.33)

where $ {\boldsymbol{Y}}=(Y_{(1)},\ldots,Y_{(n)})^\top$ with $ Y_{(i)}$ the corresponding value to $ X_{(i)}$ and

$\displaystyle {\boldsymbol{m}}
=\left({m}(X_{(1)}),\ldots,{m}(X_{(n)})\right)^\top.$

If $ {m}(\bullet)$ were indeed a piecewise cubic polynomial on intervals $ [X_{(i)},X_{(i+1)}]$ then the penalty term could be expressed as a quadratic form in $ {\boldsymbol{m}}$

$\displaystyle \int \{{m}''(x)\}^2\, dx = {\boldsymbol{m}}^\top{\mathbf{K}} {\boldsymbol{m}}$ (4.34)

with a matrix $ {\mathbf{K}}$ that can be decomposed to

$\displaystyle {\mathbf{K}}= {\mathbf{Q}}\MR^{-1}{\mathbf{Q}}^\top.$

Here, $ {\mathbf{Q}}$ and $ \MR$ are band matrices and functions of $ h_i = X_{(i+1)} -X_{(i)}$. More precisely, $ {\mathbf{Q}}$ is a $ n \times (n-1)$ matrix with elements

$\displaystyle q_{j-1,j}=-\,\frac{1}{h_{j-1}},\ q_{jj}=-\,\frac{1}{h_{j-1}}-\,\frac{1}{h_{j-1}},
\ q_{j+1,j}=-\,\frac{1}{h_{j}}$

and $ q_{ij}=0$ for $ \vert i-j\vert>1$. $ \MR$ is a symmetric $ (n-1) \times (n-1)$ matrix with elements

$\displaystyle r_{jj}=\frac{1}{3}\left(h_{j-1}+h_j\right),\ r_{j,j+1}=r_{j+1,j}=
\frac{1}{6} h_j,$

and $ r_{ij}=0$ for $ \vert i-j\vert>1$. From (4.33) and (4.34) it follows that the smoothing spline is obtained by

$\displaystyle {\widehat{{\boldsymbol{m}}}_\lambda}=({\mathbf{I}}+\lambda {\mathbf{K}})^{-1} {\boldsymbol{Y}},$ (4.35)

with $ {\mathbf{I}}$ denoting the $ n$-dimensional identity matrix. Because of the band structure of $ {\mathbf{Q}}$ and $ \MR$ (4.35) can be solved indeed in $ O(n)$ steps using a Cholesky decomposition.

EXAMPLE 4.7  
In Figure 4.7 we illustrate the resulting cubic spline estimate for our running Engel curve example. $ \Box$

Figure: Spline regression, $ \lambda=0.005$, U.K. Family Expenditure Survey 1973
\includegraphics[width=0.03\defepswidth]{quantlet.ps}SPMspline
\includegraphics[width=1.2\defpicwidth]{SPMspline.ps}

From (4.35) we see that the spline smoother is a linear estimator in $ Y_i$, i.e. weights $ W_{\lambda i}(x)$ exist, such that

$\displaystyle \widehat{m}_{\lambda}(x)=
\frac{1}{n}\sum_{i=1}^{n}W_{\lambda i}(x)Y_{i}.$

It can be shown that under certain conditions the spline smoother is asymptotically equivalent to a kernel smoother that employs the so-called spline kernel

$\displaystyle K_{S}(u)=\frac{1}{2}\exp\left(-\frac{\vert u\vert}{\sqrt{2}}\right)\sin
\left( \frac{\vert u\vert}{\sqrt{2}}+\frac{\pi}{4}\right),$

with local bandwidth $ h(X_{i})=\lambda^{1/4}n^{-1/4}\ f(X_{i})^{-1/4}$ (Silverman, 1984).

4.2.4 Orthogonal Series

Under regularity conditions, functions can be represented as a series of basis functions (e.g. a Fourier series). Suppose that $ m(\bullet)$ can be represented by such a Fourier series. That is, suppose that

$\displaystyle m(x)=\sum_{j=0}^{\infty} \beta_{j}\varphi_{j}(x),$ (4.36)

where $ \left\{\varphi_{j}\right\}_{j=0}^{\infty}$ is a known basis of functions and $ \left\{\beta_{j}\right\}_{j=0}^{\infty}$ are the unknown Fourier coefficients. Our goal is to estimate the unknown Fourier coefficients. Note that we indeed have an infinite sum in (4.36) if there are infinitely many non-zero $ \beta_{j}$s.

Obviously, an infinite number of coefficients cannot be estimated from a finite number of observations. Hence, one has to choose the number of terms $ N$ (which, as indicated, is a function of the sample size $ n$) that will be included in the Fourier series representation. Thus, in principle, series estimation proceeds in three steps:

(a)
select a basis of functions,
(b)
select $ N$, where $ N$ is an integer less than $ n$, and
(c)
estimate the $ N$ unknown coefficients by a suitable method.
$ N$ is the smoothing parameter of series estimation. The larger $ N$ is the more terms are included in the Fourier series and the estimate tends toward interpolating the data. On the contrary, small values of $ N$ will produce relatively smooth estimates.

Regarding the estimation of the coefficients $ \left\{\beta_{j}\right\}_{j=0}^{N}$ there are basically two methods: One method involves looking at the finite version (i.e. the sum up to $ N=N$) of (4.36) as a regression equation and estimating the coefficients by regressing the $ Y_{i}$ on $ \varphi_{0}(X_{i}),\ldots,\varphi_{N}(X_{i})$.

EXAMPLE 4.8  
We have applied this for the Engel curve again. As functions $ \varphi_{j}$ we used the Legendre polynomials (orthogonalized polynomials, see Example 4.9 below) in Figure 4.8$ \Box$

Figure: Orthogonal series regression using Legendre polynomials, $ N=9$, U.K. Family Expenditure Survey 1973
\includegraphics[width=0.03\defepswidth]{quantlet.ps}SPMorthogon
\includegraphics[width=1.2\defpicwidth]{SPMorthogon.ps}

An alternative approach is concerned with choosing the basis of functions $ \left\{\varphi_{j}\right\}_{j=0}^{\infty}$ to be orthonormal. The orthonormality requirement can be formalized as

$\displaystyle \int\varphi_{j}(x)\varphi_{k}(x)dx=\delta_{jk}=
\left\{\begin{arr...
... \textrm{if} \quad j \neq k; \\ 1,
& \textrm{if} \quad j=k. \end{array}\right.$

The following two examples show such orthonormal bases.

EXAMPLE 4.9  
Consider the Legendre polynomials on $ [-1,1]$

$\displaystyle p_0(x) = \frac{1}{\sqrt{2}},\
p_1(x) = \frac{\sqrt{3}x}{\sqrt{2}},\
p_2(x) = \frac{\sqrt{2}}{2\sqrt{5}}\,(3x^2-1),\ \ldots
$

Higher order Legendre polynomials can be computed by $ (m+1)p_{m+1}(x)=(2m+1)\,x\,p_m(x) - m\,p_{m-1}(x).$  $ \Box$

EXAMPLE 4.10  
Consider the wavelet basis $ \{\psi_{jk}\}$ on $ \mathbb{R}$ generated from a mother wavelet $ \psi(\bullet)$. A wavelet $ \psi_{jk}$ can be computed by

$\displaystyle \psi_{jk}(x)=2^{j/2}\psi(2^jx-k),$

where $ 2^j$ is a scale factor and $ k$ a location shift. A simple example of a mother wavelet is the Haar wavelet

\begin{displaymath}
\psi(x)=\left\{
\begin{array}{rl}
-1 \quad & x\in[0,1/2]\\ ...
...x\in(1/2,1]\\ 0 \quad & \textrm{otherwise}.
\end{array}\right.
\end{displaymath}

which is a simple step function. $ \Box$

The coefficients $ \beta_{j}$ can be calculated from

$\displaystyle \beta_{j} = \sum_{k=0}^{\infty}\beta_{k}\delta_{jk} = \sum_{k=0}^{\infty}\beta_{k}\int \varphi_{k}(x)\varphi_{j}(x)\,dx = \int m(x)\varphi_{j}(x)dx$ (4.37)

If we find a way to estimate the unknown function $ m(x)$ in (4.37) then we will end up with an estimate of $ \beta_{j}$. For Fourier series expansions and wavelets the $ \beta_{j}$ can be approximated using the fast Fourier and fast wavelet transform (FFT and FWT), respectively. See Härdle, Kerkyacharian, Picard & Tsybakov (1998) for more details.

EXAMPLE 4.11  
Wavelets are particularly suited to fit regression functions that feature varying frequencies and jumps. Figure 4.9 shows the wavelet fit (Haar basis) for simulated data from a regression curve that combines a sine part with varying frequency and a constant part. To apply the fast wavelet transform, $ n=2^8=256$ data points were generated. $ \Box$

Figure: Wavelet regression and original curve for simulated data, $ n=256$
\includegraphics[width=0.03\defepswidth]{quantlet.ps}SPMwavereg
\includegraphics[width=1.2\defpicwidth]{SPMwavereg.ps}