3.6 Spline smoothing

A common measure of ``fidelity to the data" for a curve $g$ is the residual sum of squares

\begin{displaymath}\sum_{i=1}^n (Y_i-g(X_i))^2.\end{displaymath}

If $g$ is allowed to be any curve - unrestricted in functional form - then this distance measure can be reduced to zero by any $g$ that interpolates the data. Such a curve would not be acceptable on the grounds that it is not unique and that it is too wiggly for a structure-oriented interpretation. The spline smoothing approach avoids this implausible interpolation of the data by quantifying the competition between the aim to produce a good fit to the data and the aim to produce a curve without too much rapid local variation.

There are several ways to quantify local variation. One could define measures of roughness based, for instance, on the first, second, and so forth derivative. In order to explicate the main ideas the integrated squared second derivative is most convenient, that is, the roughness penalty

\begin{displaymath}\int (g''(x))^2\ dx \end{displaymath}

is used here to quantify local variation. Using this measure, define the weighted sum
\begin{displaymath}
S_{\lambda }(g)=\sum_{i=1}^n(Y_i-g(X_i))^2+\lambda \int (g''(x))^2\ dx,
\end{displaymath} (3.6.33)

where $\lambda $ denotes a smoothing parameter. The smoothing parameter $\lambda $ represents the rate of exchange between residual error and roughness of the curve $g$. The problem of minimizing $S_{\lambda } (\cdot)$ over the class of all twice differentiable functions on the interval $\vert a,b\vert=\vert X_{(1)},X_{(n)}\vert$ has a unique solution $\hat m_{\lambda }(x)$ which is defined as the cubic spline; see Schoenberg (1964), Reinsch (1967), Good and Gaskins (1971) and Boneva et al. (1970). The basic idea dates back at least to Whittaker (1923) who called this smoothing process a graduation or adjustment of the observations. The estimated curve $\hat m_{\lambda }(\cdot)$ has the following properties:

$\hat m_{\lambda }(x)$ is a cubic polynomial between two successive $X$-values; at the observation points $X_i,$ the curve

$\hat m_{\lambda }(\cdot)$ and its first two derivatives are continuous but there may be a discontinuity in the third derivative;

at the boundary points $X_{(1)}$ and $X_{(n)}$ the second derivative of $\hat m_{\lambda }(x)$ is zero.

pt It should be noted that these properties follow from the particular choice of the roughness penalty. It is possible to define, say, quintic splines by considering a roughness penalty that involves higher-order derivatives.

An example of a spline smooth is given in Figure 3.9, a spline smooth of the so-called motorcycle data set, see Table 1 in the Appendix 2 for a complete listing of this data set.

Figure 3.9: A spline smooth of the Motorcycle data set. From Silverman (1985) with permission of the Royal Statistical Society. 6063 ANRmotspline.xpl
\includegraphics[scale=0.7]{ANRmotspline.ps}

Recall that the spline is a cubic polynomial between the knot points. The ``local cubic polynomial" property is illustrated in Figure 3.10 where at three significant points the (local) cubic polynomial fit is superimposed on the spline smooth, computed from the motorcycle data set. The curve shown is computed at the points $\{X_i\}^n_{i=1}$ itself using the IMSL routine ICSSCU. The motorcycle data set is listed in Table 1 in the Appendix 2.

Figure 3.10: Spline smooth with cubic polynomial fit. The motorcycle data (Table 1 in the Appendix 2) with a spline smooth $ (\scriptstyle \lambda =2)$ and three (local) cubic polynomial approximations. The raw data $\scriptstyle
\{ (X_i,Y_i) \}^n_{i=1}$, $\scriptstyle n=150$, are shown as squares. The smooth as a solid line, local fits at $\scriptstyle x=21,40,55$. Units are $\scriptstyle g$ (earth-acceleration) for $\scriptstyle Y$ and ms (milliseconds after impact in a simulated experiment) for $\scriptstyle
X$. 6067 ANRmotcubic.xpl
\includegraphics[scale=0.7]{ANRmotcubic.ps}

A conceptual difficulty in spline smoothing is the fact that $\hat m_{\lambda }$ is defined implicitly as the solution to a functional minimization problem. This makes it hard to judge the behavior of the estimate and to see what $\hat m_{\lambda }$ is actually doing to the data values. The following argument shows that $\hat m_{\lambda }$ is, in fact, a weighted average of the $Y$-observations.

The minimum of $S_{\lambda}(g)$ is unique, so we must have $S_{\lambda}
(\hat m_{\lambda}+ \alpha g) \geq S_{\lambda}(\hat m_{\lambda})$ for any $g \in C^2$ and $\alpha \in R$. This means that the real function $T(\alpha)=S_{\lambda}(\hat m_{\lambda}+ \alpha g)$ has a local minimum at $\alpha=0$. In particular, the Euler-Lagrange condition (Hadley and Kemp 1971, p. 30-31),

\begin{displaymath}T'(0)=\sum_{i=1}^n (Y_i-\hat m_{\lambda } (X_i))\ g(X_i) +
\l...
...m_{\lambda }^{\prime \prime } (x)\
g^{\prime \prime}(x)\ dx =0,\end{displaymath}

(for all twice differentiable $g$) must be satisfied. Consider now two splines $\hat m_{\lambda }^{(1)}$, $\hat m_{\lambda }^{(2)} $ for the data $\{ (X_i,Y_i^{(1)}) \}^n_{i=1}$, $( \{ (X_i, Y_i^{(2)} ) \}^n_{i=1} )$. Using the above Euler-Lagrange condition one sees that $\hat m_{\lambda }^{(1+2)}=\hat m_{\lambda }^{(1)} +
\hat m_{\lambda }^{(2)} $ is the spline for the data set $\{ (X_i,Y_i^{(1)} + Y_i^{(2)} )\}^n_{i=1}$. If the data vector $\{ Y_i\}^n_{i=1} $ is written as a linear combination of the $n$ coordinate vectors it is easily seen that there exist weights $W_{\lambda i} (x)$ such that
\begin{displaymath}
\hat m_{\lambda }(x)=n^{-1} \sum_{i=1}^n W_{\lambda i}(x)Y_i.
\end{displaymath} (3.6.34)

The spline is thus linear in the $Y$-observations. The weight function can be plotted by applying the spline smoothing method to $Y$-values which are all zero, except one response variable which is set equal to the sample size $n$. However, the functional form of $\{ W_{\lambda i}(x)\}^n_{i=1} $ is hard to write down explicitly and the dependence on the smoothing parameter $\lambda $ and the design points is extremely complicated. Recall that $W_{\lambda i}(\cdot)$ is the spline for the data $(X_1,0),\ldots ,(X_i,n),\ldots ,(X_n,0)$. More generally, define the spline at point $t$ as $W_{\lambda}(\cdot,t)$. Silverman (1984, theorem A) showed that the effective weight function $W_{\lambda}(\cdot,t)$ looks like a kernel $K_s$, where the kernel function $K_s$ is given by

\begin{displaymath}K_s(u) =1/2\ \exp (-\left\vert u\right\vert /\sqrt 2)
\sin (\left\vert u\right\vert /\sqrt 2 +\pi /4).\end{displaymath}

Theorem 3.6.1   (Silverman; 1984) Assume the fixed design model with a design density as in Section 2.1. Apart from very technical conditions the smoothing parameter $\lambda $ depends on $n$ in such a way that $\lambda n^{1- \varepsilon} \to \infty$ for some $\varepsilon > 0$. Choose any fixed $t$ such that $a < t < b$. Then

\begin{displaymath}\lambda^{1/4}n^{-1/4}f(t)^{-1/4}
W_{\lambda }(t+ \lambda^{1/4}n^{-1/4}f(t)^{-1/4}x,t) \to K_s(x)/f(t)\end{displaymath}

as $n \to \infty,$ uniformly over all $x$ for which $t+ \lambda^{1/4}
n^{-1/4}f(t)^{-1/4}x$ lies in $\vert a,b\vert.$

pt The approximation of $W_{\cdot ,\lambda}(X_i)=W_{\lambda i}(\cdot)$ in the above Theorem says that for large $n$ and small $\lambda $ and $X_i$ not too close to the boundary,

\begin{displaymath}
W_{\lambda i}(\cdot) \approx f(X_i)^{-1}h(X_i)^{-1}K_s\left({{\cdot-X_i} \over
{h( X_i)}}\right)
\end{displaymath} (3.6.35)

and the local bandwidth $h(X_i) $ satisfies

\begin{displaymath}h(X_i) =\lambda ^{1/4}n^{-1/4}f(X_i)^{-1/4}. \end{displaymath}

Indeed, setting $t=X_i$ and using the variable substitution $u=X_i + x h(X_i)$ in the Theorem 3.6.1 yields formula (3.6.35). A plot of the effective kernel function $K_s$ is given in Figure 3.11.

Figure 3.11: Silverman's approximation to the effective spline kernel. The asymptotic spline kernel function, $\scriptstyle K_s=1/2 \exp(-\vert u\vert/\sqrt2 ) \sin(\left\vert u \right\vert / \sqrt2
+ \pi /4).$ 6071 ANRsplinekernel.xpl
\includegraphics[scale=0.7]{ANRsplinekernel.ps}

One sees that $K_s$ is a symmetric kernel function with negative side lobes and that $K_s$ has vanishing second moment, that is, $\int u^2 \ K_s(u)\ du=0$. A graphical comparison of the exact weight function $W_{\lambda i} {(x)}$ and its asymptotic form $K_s$ is given in Silverman (1984, figure 2, p. 902).

The kernel form of this weight function changes as $x$ approaches the boundary of the observation interval. Engle et al. (1986) computed the effective spline weight function for the temperature response example (see Figure 1.6). Figure 3.12 is a reproduction from their article where they show the equivalent kernel function for an $x$ roughly in the middle of the observation interval.

Figure 3.12: Equivalent kernel function for the temperature range $40^\circ $ to $45^\circ $ Fahrenheit. From Engle et al. (1986, figure 6) with permission of the American Statistical Association.
\includegraphics[scale=0.2]{ANR3,12.ps}

As the observation point moves to the right the weight function becomes more asymmetric as Figure 3.13 indicates.

Figure 3.13: Equivalent kernel function for the temperature range $60^\circ $ to $65^\circ $ Fahrenheit. From Engle et al. (1986, figure 7) with permission of the American Statistical Association.
\includegraphics[scale=0.6]{ANR3,13.ps}

The question of how much to smooth is of course also to be posed for spline smoothing. A survey of the literature on the mean squared error properties of spline smoothing can be found in Eubank (1988). The smoothing parameter selection problem for this class of estimators has been mainly investigated by Grace Wahba. From her rich collection of results I would like to mention those which are directly connected with the optimization of $\lambda.$ In Wahba (1975), Wahba (1979) convergence rates of splines are considered. The pioneering article on cross-validation in this context is Wahba (1975), which was extended later to the smoothing of the log periodogram; (see Wahba; 1980). The term generalized cross-validation (GCV) was coined by Wahba (1977). A minimax type approach to the question of rates of convergence was done by Nussbaum (1985) who obtained exact bounds for the integrated squared error under a normality assumption on the error variables.

Some statistical software packages that compute the spline coefficients of the local cubic polynomials require a bound $\Lambda $ on the residual sum of squares $\displaystyle \sum^n_{i=1} (Y_i - g(X_i))^2.$ These programs solve the equivalent problem

\begin{displaymath}\int \vert g''(x) \vert^2 dx = \min\end{displaymath}

under the constraint $\displaystyle \sum^n_{i=1} (Y_i -g(X_i))^2 \le
\Lambda.$ The parameters $\lambda $ and $\Lambda $ have similar meanings. A very small $\lambda $ gives a very rough curve, as does a small $\Lambda $ since the spline curve is allowed to stay very close to the data. On the other hand, a large value of $\Lambda $ allows $g''$ to be nearly zero and the same holds true for the parameter $\lambda $. A connection between the two parameters can be derived as follows.

Proposition 3.6.1   Let $G(\Lambda) = \int (\hat m''_{\Lambda} (x))^2 dx$, where $\hat
m_{\Lambda}(x)$ solves the above minimization problem. Then the equivalent $\lambda $ is given by

\begin{displaymath}\lambda = -\vert G'(\Lambda)\vert^{-1}.\end{displaymath}

This correspondence is derived explicitly for the example given in Section 3.11. The same methods can be used to derive an equivalent $\Lambda $ for a given smoothing parameter $\lambda.$

The effective weights of spline smoothing can be more easily computed for equispaced uniform $X_i=i/n$. In this situation the $k$-$NN$ estimator $\hat m_k$ and the kernel estimator $\hat m_h$ coincide for $k$ roughly equal to $2nh$. Huber (1979) showed under a periodicity assumption on $m$ that the spline smoother is exactly equivalent to a weighted kernel-type average of the $Y$-observations. He considers the following function of $Z=(Z_1,....,Z_n)$

\begin{displaymath}\tilde S_{\delta }(Z)
= \sum_{i=1}^n \{ (Y_i-Z_i)^2 + \delta \ ( n^2 \Delta^2 Z_i)^2 \}, \end{displaymath}

where $\Delta^2 Z_i=Z_{i+1}-2Z_i+Z_{i-1}$ denotes the second difference quotient. Note that $\tilde S_{\delta }$ is different from $S_{\lambda }$ in two respects. First, the roughness penalty is defined by a sum of second difference quotients rather than an integral of second derivatives. Second, $\delta$ corresponds to $n^3\delta $ since $n^{-1}\sum_{i=1}^n ( n^2 \Delta^2 Z_i)^2$ is analogous to $\int ^1_0 \vert g''(x)\vert^2 dx$. Huber showed that the minimization of $\tilde S_{\delta } (Z)$ for circular data $\{ Y_i\} $ yields the weighted average
(3.6.36)

This asymptotics holds for fixed $j$ and $\delta$ as $n \to \infty$. It is interesting to compare this form of the weights with the approximation given by Silverman (1984). Figure 3.14 gives the form of the weights $\tilde W_{\delta j}$ for different values of $\delta$. The weights $\tilde W_{\delta j}$ are more concentrated around zero if $\delta$ is small.

Figure 3.14: Huber's approximation to the effective spline weight function,
$\scriptstyle \tilde
W_{\delta j}= \pi^{-1} \int^{\pi}_0 {{\cos(\omega j)} \over
{1 + \delta (2\sin \omega /2)^4}}d\omega .$ Shown are the weights $\scriptstyle \tilde W_{\delta j}$ for $\scriptstyle
\delta =0.5, 8, 40.5, 128, 312.5.$
\includegraphics[scale=0.14]{ANR3,14.ps}

The curves in Figure 3.14 look indeed very similar to the one in Figure 3.10

Exercises
3.6.1 Use the spline smoothing algorithm to plot the exact weight function $\{ W_{\lambda i} (x) \}.$ Compare with the approximate function given by $K_s$. How does the approximation change when $x$ moves to the boundary?

3.6.2Show that the kernel $K_S$ is a kernel in the sense of Section 3.1 and prove that

\begin{eqnarray*}
\int u^j K_s(u) du &=& 0 , \ \ 1 \le j \le 3, \\
\int u^4 K_s(u) du &=& -1.
\end{eqnarray*}