3.11 A comparison of kernel, k-NN and spline smoothers

The purpose of this section is to present a comparative study of the three most commonly used and easy-to-implement smoothers: the kernel, the

-nearest neighbor and the cubic spline smoothers. The comparison is performed theoretically and empirically. The practical comparison is based on a simulated data set, which is presented in Table 2 in Appendix 2.

The theoretical comparison is presented for the two design models - fixed equidistant and random design. Kernel estimators $\hat m_h$ with weight functions $\{ W_{hi}^{(3)}\}$ and $\{ W_{hi} \}$ (see Section 3.1) and the

smoother $\hat m_R$ are compared. Although the spline estimator also makes sense for stochastic predictor variables its statistical properties have been studied mostly for the fixed design case. We have found in Section 3.6 that the fixed design spline almost acts like a kernel smoother with design dependent bandwidth. Recall the kernel-spline equivalence theorem, Theorem 3.6.1 It says that the spline estimator $\hat m_\lambda$ for regularly distributed $X_i = F^{-1}((i-1/2)/n)$ behaves like a Priestley-Chao kernel smoother with effective local bandwidth

From Table 3.1 we saw that the

estimator $\hat m_R$ has mean squared error properties comparable to the kernel estimator if the bandwidth $R(x) \sim f(x)^{-1}.$ It therefore makes sense to consider kernel smoothers with bandwidths proportional to $f(x)^{- \alpha}, \ \alpha \in [0,1].$

Bias and variance behaviors of kernel smoothers with weigths $\{ W_{hi}^{(3)}\}$ and bandwidth $h \sim f^\alpha$ have been studied by Jennen-Steinmetz and Gasser (1988). The bias and variance of the above three estimators for the fixed equidistant design $(d=1, X_i = i/n, \alpha = 0 )$ are listed in Table 3.2. For the correct interpretation of this table recall the definition of

as given in Section 3.1. Table 3.2 shows clearly that the two kernel sequences $\{ W_{hi}^{(3)}\}$ and $\{ W_{hi} \}$ have the same mean squared error properties for the fixed equidistant design case. As noted before the

weight sequence can be seen as a kernel sequence if we make the identification $h = { k \over 2n}$ .

**Table 3.2:** *Pointwise bias and variance of k-NN and kernel smoothers for the fixed design case*
	bias	variance
kernel weights $\{ W_{hi} \}$	$h^2 \frac{m''(x)}{2} d_K$	$\frac{\sigma^2(x)}{nh}c_K$
kernel weights $\{ W_{hi}^{(3)}\}$	$h^2 \frac{m''(x)}{2} d_K$	$\frac{\sigma^2(x)}{nh}c_K$
k-NN weights	$(\frac{k}{n})^2\frac{m''(x)}{8}d_K$	$\frac{2\sigma^2(x)}{k}c_K$

The bias and variance for the random design case is drastically different, as Table 3.3 shows. Pointwise bias and variance are complicated functionals not only of the regression curve

but also of the marginal density

Pointwise bias and variance of k-NN and kernel smoothers for the random design case

**Table 3.3:**
	bias	variance
kernel weights $\{ W_{hi} \}$	$h^2\frac{(m''f+2m'f')(x)}{2f(x)}d_K$	$\frac{\sigma^2(x)}{nhf(x)}c_K$
kernel weights $\{ W_{hi}^{(3)}\}$	$h^2\frac{m''(x)}{2f(x)^{2\alpha}}d_K$	$\frac{2\sigma^2(x)}{nhf(x)^{1-\alpha}}c_K$
k-NN weights	$(\frac{k}{n})^2\frac{(m''f+2m'f')(x)}{8f^3(x)}d_K$	$\frac{2\sigma^2(x)}{k}c_K$

Note that essentially all three estimators coincide for the fixed design case when chosing

as indicated already in Section 3.4. This is not true for the random design. For $\alpha=0$ the variance of the kernel with weights $\{ W_{hi}^{(3)}\}$ is twice as big as for the Nadaraya-Watson estimator. By constrast, the bias of the Nadaraya-Watson smoother is a complicated expression of

and

. The same is true when comparing the bias of the

smoother for $\alpha =1$ .

The empirical study is based on the simulated data from Figure 3.20. It shows

data points simulated from the regression curve

**Figure 3.20:** A simulated data set. The raw data $\scriptstyle \{ (X_i,Y_i) \}^n_{i=1}$ , $\scriptstyle n=100$ were constructed from $\scriptstyle Y_i = m(X_i) + \varepsilon _i, \varepsilon _i \sim N(0,1), X_i \sim U(0,1)$ and $\scriptstyle m(x) = 1-x+e^{-200\ (x-1/2)^2}$ . `ANRsimul.xpl`
$\includegraphics[scale=0.7]{ANRsimul.ps}$

**Figure 3.21:** A kernel smooth of the simulated data set. The solid line (label 1) denotes the underlying regression curve $\scriptstyle m(x) = 1-x+e^{-200\ (x-1/2)^2}$ . The dashed line (label 2) is the kernel smooth $\scriptstyle \hat m_h(x)$ , $\scriptstyle h= 0.05$ , Gaussian kernel $\scriptstyle K(u) = \exp(-u^2/2)/(\sqrt{2 \pi})$ from the simulated data given in Table 2 in the Appendix. `ANRsimkernel.xpl`
$\includegraphics[scale=0.7]{ANRsimkernel.ps}$

Special attention must be paid to the boundary points 0 and 1 of the observation interval. Estimation points

that are close to the boundary have only a one-sided neighborhood over which to average the

-values. The kernel smoother must therefore be less accurate at the boundary (see Section 4.4 for a mathematical formulation of this problem). This inaccuracy can be seen from Figure 3.21 near the left boundary: most of the observations there lie below the true regression curve, the asymmetric average therefore considerably underestimates the true

near $x \approx 0$ .

The bandwidth

was chosen completely subjectively. I asked several colleagues and they felt that a higher amount of smoothing would ``wash out" too much structure and a smaller bandwidth would give too rough a curve. Had another kernel been used, for instance a kernel with compact support, the picture would have been different for this specific bandwidth of

. The reason is that different kernels are in general differently scaled. A way of adjusting the bandwidths and kernels to the same scale is discussed in Section 5.4.

The

estimate produced a slightly rougher curve. In Figure 3.22 the graph of the

smoother, as defined in (3.4.19), is displayed for

The reason for the ``wiggliness" is the so-called uniform weight sequence that is used in (3.4.19). Theoretically speaking, the

smooth is a discontinuous function.

**Figure 3.22:** A $\scriptstyle k$ - $\scriptstyle NN$ smooth of the simulated data set (Table 3.2). The solid line (label 1) denotes the underlying regression curve $\scriptstyle m(x) = 1-x+e^{-200\ (x-1/2)^2}$ . The dashed line (label 2) is the $\scriptstyle k$ - $\scriptstyle NN$ smooth $\scriptstyle \hat m_k(x), k= 11$ . Gaussian kernel $\scriptstyle K(u) = \exp(-u^2/2)/(\sqrt{2 \pi})$ from the simulated data given in Table 2 in Appendix 2. `ANRsimknn.xpl`
$\includegraphics[scale=0.7]{ANRsimknn.ps}$

In practice, this means that as the window of weights moves over the observation interval, new observations enter at the boundary of the ``uniform window" according to the nearest neighbor rule. Any entering observation that has a value different from the current average will cause an abrupt change of the

smoother. Such an effect would be diminished if a smoother

weight sequence, like the one defined in (3.4.19), were used.

Here also boundary effects need special discussion. As the estimation point

moves to the boundary, the interval of observations entering the determination of the smoother becomes asymmetric. This was also the case for the kernel smoothers. Note, however, that, in contrast to kernel smoothing, the asymmetric region left or right of

always has the same number of points. The averaging procedure near the boundary thus involves a lot more points which, in general, have different mean values. This is not so drastic in our example since the regression curve is relatively flat at the boundaries. In cases where the regression function is steeper at the boundary the

smoother is expected to be more biased than the kernel smoother.

The smoothing parameter

was also chosen subjectively here. A bigger

seemed to me to yield too smooth a curve compared to the raw observations. A smaller

produced too rough a curve, and amplified local spiky structures such as the one at $X \approx 0.85$ .

The spline smoother $\hat m_{\lambda }(x)$ is shown in Figure 3.23. The algorithm of Reinsch (1967) was used to generate the smooth $\hat m_{\lambda }$ . The smoothing parameter was chosen to be $\Lambda = 75$ ; that is, $\hat m_{ \Lambda}$ is the solution to the minimization problem

In Table 3.4 the relation between $\Lambda$ and $G(\Lambda) = \int (\hat m_{\Lambda}''(x))^2 dx$ is shown. As $\Lambda$ increases the spline curve becomes flatter and flatter. In the limit it is equivalent to fitting a least squares line.

**Figure 3.23:** A spline smooth of the simulated data set (Table 3.2). The solid line (label 1) denotes the underlying regression curve $\scriptstyle m(x) = 1-x+e^{-200\ (x-1/2)^2}$ . The dashed line (label 2) is the spline smooth $\scriptstyle \hat m_\Lambda(x)$ , $\scriptstyle \Lambda= 75$ , computed with the IMSL routine ICSSCU. `ANRsimspline.xpl`
$\includegraphics[scale=0.7]{ANRsimspline.ps}$

Table 3.5 presents the equivalent parameter $\Lambda$ for various values of $\lambda.$ The $\lambda$ equivalent to $\Lambda = 75$ is therefore $\lambda=0.6 \times 10^{-5}$ .

**Table:** *The function* $G(\Lambda) = \int (\hat m_{\Lambda}''(x))^2 dx$ *as a function of* $\Lambda$ .
$\Lambda$	$G(\Lambda) = \int (\hat m_{\Lambda}''(x))^2 dx$
	$0.18\times10^9$
	$0.72\times10^8$
	$0.29\times10^8$
	$0.11\times10^8$
	$0.36\times10^7$
	$0.10\times10^7$
	$0.21\times10^6$
	$0.24\times10^5$
	$0.35\times10^4$
	$0.38\times10^3$
	$0.23\times10^2$
	$0.23\times10^1$
	$0.68\times10^{-26}$

**Table 3.5:** *The parmater* $\lambda$ *as a function of* $\Lambda$ .
$\Lambda$	$-1/G'(\Lambda)$
	$0.47 \times 10^{-7}$
	$0.12 \times 10^{-6}$
	$0.28 \times 10^{-6}$
	$0.70 \times 10^{-6}$
	$0.20 \times 10^{-5}$
	$0.61 \times 10^{-5}$
	$0.27 \times 10^{-4}$
	$0.24 \times 10^{-3}$
	$0.16 \times 10^{-2}$
	$0.14 \times 10^{-1}$

	$0.22 \times 10^1$

By construction, the spline fit looks very smooth since it is a function that is glued together from pieces of cubic polynomials. The overall shape of the spline function is the same that as for the kernel and

smooth. The peak in the middle of the observation interval is relatively well approximated (Figure 3.24), but to the left of it all smoothers exhibit a somewhat smaller bump that is really only a random function of the data pattern.

Note that the spline smoother may produce a partially negative smooth even when all the response variables are positive. This can be understood from the asymptotic kernel representation (3.6.35) of $\hat m_{\lambda }$ . Since the kernel

has negative side lobes (Figure 3.10) it may happen that for a sparse data pattern the resulting spline smooth is negative although one averages over purely positive observations.

A comparison of the behavior of all three smoothers can be obtained from Figure 3.24, the residual curves (``fit" minus ``true") for the three approximation methods. All three smoothers have the artificial bump at $x \approx 0.2$ , introduced by the data pattern, but show essentially the same behavior in the residual curves.

**Figure 3.24:** Residual plot of $\scriptstyle k$ - $\scriptstyle NN$ , kernel and spline smoother for the simulated data set (Table 3.2). The fine dashed line (label 1) denotes the $\scriptstyle k$ - $\scriptstyle NN$ residual, $\scriptstyle \hat m_k(x) - m(x), k=11$ . The dashed line (label 2) is the spline residual with $\scriptstyle \hat m_\Lambda(x) - m(x), \Lambda = 75$ . The solid line (label 3) is the kernel residual $\scriptstyle \hat m_h(x) - m(x)$ with Gaussian kernel and bandwidth $\scriptstyle h=0.15$ . `ANRsimresidual.xpl`
$\includegraphics[scale=0.7]{ANRsimresidual.ps}$

Exercises
3.11.1Try the kernel,

and spline smoothing on the simulated data set with different smoothing parameters. Describe how you found a ``good" smoothing parameter.

3.11.2Quantify the behavior of the above smoothers at the boundary of the observation interval. Supppose $\vert m'(x) \vert$ is relatively large at the boundary. What do you expect from a local averaging method in this situation?

3.11.1 Complements

In the simulated data set from Table 2 no repeated observations (in the

-variable) have been observed. For the case of repeated observations the spline smoothing algorithm of Reinsch (1967) needs some preparatory steps. If one observes multiple responses for a fixed

one pools the corresponding

-values into one observation by averaging them. Suppose that there are $N_i \ge 1$ observations at each

. Then the spline algorithm solves the weighted minimization problem