3.1 Kernel Smoothing

A conceptually simple approach to a representation of the weight sequence $\{W_{n i} (x) \}^n_{i=1}$ is to describe the shape of the weight function $W_{n i}(x)$ by a density function with a scale parameter that adjusts the size and the form of the weights near $x$. It is quite common to refer to this shape function as a kernel K. The kernel is a continuous, bounded and symmetric real function $K$ which integrates to one,

\begin{displaymath}
\int K(u)du=1.
\end{displaymath} (3.1.1)

The weight sequence for kernel smoothers (for one-dimensional $x$) is defined by

\begin{displaymath}
W_{ni} (x) = K_{h_n} (x-X_i)/\hat f_{h_n}(x),
\end{displaymath} (3.1.2)

where
\begin{displaymath}
\hat f_{h_n}(x) = n^{-1} \sum^n_{i=1} K_{h_n}(x-X_i)
\end{displaymath} (3.1.3)

and where

\begin{displaymath}K_{h_n}(u)=h_n^{-1} K(u/h_n)\end{displaymath}

is the kernel with scale factor $h_n$. Supressing the dependence of $h=h_n$ on the sample size $n$, the kernel weight sequence 3.1.2 is conveniently abbreviated as $\{ W_{hi} (x) \}^n_{i=1}$. The function $\hat f_h(\cdot)$ is the Rosenblatt-Parzen kernel density estimator (Rosenblatt (1956);Parzen (1962)) of the (marginal) density of $X$. The form 3.1.2 of kernel weights $W_{hi}(x)$ has been proposed by Nadaraya (1964) and Watson (1964) and therefore

\begin{displaymath}\hat m_h(x) =
\frac{n^{-1} \sum^n_{i=1} K_h(x-X_i) Y_i}{n^{-1} \sum^n_{i=1} K_h(x-X_i)}\end{displaymath}

is often called the Nadaraya-Watson estimator. The shape of the kernel weights is determined by $K$, whereas the size of the weights is parameterized by $h$, which is called the bandwidth. The normalization of the weights $\hat f_h(x)$ makes it possible to adapt to the local intensity of the $X$-variables and, in addition, guarantees that the weights sum to one. A variety of kernel functions are possible in general, but both practical and theoretical considerations limit the choice. For instance, kernel functions that take on very small values can cause numerical underflow on a computer, so one might restrict attention to kernel functions that are zero outside some fixed interval. A commonly used kernel function, which enjoys some optimality properties to be discussed in Section 4.5, is of parabolic shape (Epanechnikov (1969); Bartlett (1963)):


\begin{displaymath}
K(u)=0.75 (1-u^2) I(\left\vert u \right\vert \le 1).
\end{displaymath} (3.1.4)

A plot of this so-called Epanechnikov kernel is given in Figure 3.1.

Figure 3.1: The Epanechnikov kernel. This kernel $ K(u)=0.75 (1-u^2) I(\left\vert u \right\vert \le 1) $ is of parabolic shape and has support [-1,1]. 3216 ANRepa.xpl
\includegraphics[scale=0.7]{ANRepa.ps}

Note that this kernel is not differentiable at $u=\pm 1$. The kernel smoother is not defined for a bandwidth with $\hat f_h(x) = 0 $. If such a ``0/0" case occurs one defines $\hat m_h(x)$ as being 0. Suppose that the kernel estimator is only evaluated at the observations $\{X_i\}^n_{i=1}$. Then, as $h \to 0,$

\begin{displaymath}\hat m_h(X_i) \to K(0)Y_i/K(0)=Y_i;\end{displaymath}

small bandwidths thus reproduce the data. Let us now investigate what happens as $h \to \infty$. Suppose that $K$ has support $[-1,1]$ as in Figure 3.1 Then $K({{x-X_i} \over {h}}) \to K(0)$ and thus

\begin{displaymath}{\hat m_h(x) \to n^{-1}\sum^n_{i=1} K(0)Y_i/n^{-1}\sum^n_{i=1} K(0)
= n^{-1} \sum^n_{i=1} Y_i;} \end{displaymath}

very large bandwidths thus result in an oversmooth curve, the average of the response variables.

How does this Epanechnikov kernel act on real data and what is the shape of the weights $\{ W_{hi}(x)\}_{i=1}^n $? To obtain some insight, consider the food versus net income data again (see Figures 2.1 and 2.2). The economist is interested in estimating the so-called statistical Engel curve, the average expenditure for food given a certain level of income. Kernel smoothing is a possible procedure for estimating this curve. The kernel weights $\{ W_{hi}(x)\} $ depend on the values of the $X$-observations through the density estimate $\hat f_h(x)$. In Figure 3.2 the effective weight function for estimating this Engel curve for food in 1973 is shown centered at $x=1$ for the bandwidths $h=$0.1, 0.2, 0.3. Note that the effective weight function depends only on the $X$-values.

Figure 3.2: The effective kernel weights for the food versus net income data set. $ K_h(x-\cdot)/\hat f_h(x)$ at $x=1$ and $x=2.5$ for $h=0.1$ (label 1), $h=0.2$ (label 2), $ h=0.3$ (label 3) with Epanechnikov kernel $ K(u)=0.75 (1-u^2) I(\left\vert u \right\vert \le 1) $ and density estimate as in Figure 1.5, year = 1973, $n=7125$. Survey (1968-1983). 3220 ANRpotakernel.xpl
\includegraphics[scale=0.7]{ANRpotakernel.ps}

One can learn two things from this picture. First, it is obvious that the smaller the bandwidth, the more concentrated are the weights around $x$. Second, in regions of sparse data where the marginal density estimate $\hat f_h$ is small, the sequence $\{ W_{hi}(x)\} $ gives more weight to observations around $x$. Indeed, around $x=1$ the density estimate $\hat f_h(x)$ reaches its maximum and at $x=2.5$ the density is roughly a tenth of $\hat f_h(1)$. (See Figure 1.5 for the year=1973 which is the fourth density contour counting from the front.)

For multidimensional predictor variables $X_i=(X_{i1},\ldots ,X_{id})$ one can use a multidimensional product kernel function

\begin{displaymath}K(u_1,\ldots ,u_d)=\prod _{j=1}^d K(u_j).\end{displaymath}

The kernel weights for this case are then defined as

\begin{displaymath}W_{hi} (x) ={\prod _{j=1}^d K_h (x_j-X_{ij})\over\hat f_h(x)}, \end{displaymath}

where in the definition of the Rosenblatt-Parzen density estimator a product kernel is used as well.

There are cases of applications for which the density $f(x)=F'(x)$ of the $X$-variables is known. The kernel weights that have been investigated for this sampling scheme are (Greblicki (1974); Johnston (1979) Johnston (1982); Greblicki and Krzyzak (1980) and Georgiev (1984a),Georgiev (1984b))


\begin{displaymath}
W^{(1)}_{hi}(x)=K_h(x-X_i)/f(x).
\end{displaymath} (3.1.5)

Often the $X$-observations are taken at regular distances and form an equidistant grid of points of some interval. Examples are observations from longitudinal data or discretized analog processes; see Müller (1987). Without loss of generality we can assume that the $X$-observations have been taken in the unit interval $[0,1]$. In this case, one could use the modified kernel weights $\{ W_{hi}^{(1)}(x)\} $ with $f=I_{[0,1]}$, the density of the uniform distribution on $[0,1]$. In the fixed design model of nearly equispaced, nonrandom $\{X_i\}^n_{i=1}$ on $[0,1]$, Priestley and Chao (1972) and Benedetti (1977) considered the weight sequence

\begin{displaymath}
W^{(2)}_{hi}(x) = n (X_i-X_{i-1}) K_h(x-X_i), \quad (X_0=0).
\end{displaymath} (3.1.6)

An interpretation of this weight sequence in terms of 3.1.2 is possible by setting $\widehat{f(x)} = {[n(X_i - X_{i-1})]}^{-1}$ for $x \in (X_{i-1},X_i]. $ Gasser and Müller (1979) defined a related weight sequence


\begin{displaymath}
W^{(3)}_{hi} (x)=n \int^{S_i}_{S_{i-1}} K_h(x-u)du,
\end{displaymath} (3.1.7)

where $X_{i-1} \le S_{i-1} \le X_i$ is chosen between the ordered $X$-data. The special case of $S_i=X_i$ has been investigated by Cheng and Lin (1981). A notion of an asymptotic equivalence of the weight sequences $\{ W_{hi}^{(2)}\} $ and $\{ W_{hi}^{(3)}\}$ is deferred to the Exercises. Note that $\{ W_{hi}^{(1)}\}$ and $\{ W_{hi}^{(2)}\} $ do not necessarily sum up to one, but $\{ W_{hi}^{(3)}\}$ does.

The weights $W^{(3)}_{hi}(x)$ are related to the so-called convolution smoothing as defined by Clark (1980); see Exercise 3.1.3 The weight sequences $\{ W_{hi}^{(2)}(x)\} $ and $\{ W_{hi}^{(3)}(x)\} $ have been mostly used in the fixed design model. Theoretical analysis of this stochastic behavior in the random design model indicates that they have different variance compared to the Nadaraya-Watson kernel smoother; see Section 3.11.

The consistency of the kernel smoother $\hat m_h$ with the Nadaraya-Watson weights $W_{hi}(x)$ defined by 3.1.2 is shown in the following proposition. The proof of consistency of the other weight sequences is very similar and is deferred to exercises.

Proposition 3.1.1   Assume the stochastic design model with a one-dimensional predictor variable X and

30pt to 30pt(A1) $\int \left\vert K(u) \right\vert du < \infty, $

30pt to 30pt(A2) $\lim_{\left\vert u \right\vert \to \infty} u K(u)=0, $

30pt to 30pt(A3) $ EY^2 < \infty, $

30pt to 30pt(A4) $ \quad n\to \infty, \quad h_n \to 0, \quad nh_n \to \infty$.

Then, at every point of continuity of $m(x)$, $f(x)$ and $\sigma ^2(x)$, with $f(x) > 0,$

\begin{displaymath}n^{-1} \sum^n_{i=1} W_{hi} (x)Y_i \ {\buildrel p \over \to}\ m(x). \end{displaymath}

The proof of this proposition is in the Complements of this section. The above result states that the kernel smoother converges in probability to the true response curve $m(x)$. It is natural to ask how fast this convergence is going to happen. The mean squared error

\begin{displaymath}d_M(x, h) = E [\hat m_h (x) - m(x)]^2\end{displaymath}

at a point $x$ is one way of quantifying this convergence. The following Theorem gives the speed of $d_M(x, h)$ as a function of $h$ and $n$. For simplicity it is stated for the fixed design model. The rate of convergence for the more complicated random design is the same. The constants are different though and are presented in Section 4.1.

Theorem 3.1.1   (Gasser et al.; 1984a) Assume the fixed design model with a one-dimensional predictor variable $X$ and define

\begin{displaymath}c_K = \int K^2(u) du\end{displaymath}


\begin{displaymath}d_K = \int u^2 K(u) du.\end{displaymath}

Take the kernel weights $\{ W_{hi}^{(3)}\}$ and assume

30pt to 30pt(A0) $K$ has support $[-1,1]$ with $K(-1)=K(1)=0,$

30pt to 30pt(A1) $m \in {\cal C}^2, $

30pt to 30pt(A2) $max_i \vert X_i - X_{i-1} \vert = O(n^{-1}), $

30pt to 30pt(A3) $ var(\varepsilon_i) = \sigma^2, \ i = 1, \ldots, n, $

30pt to 30pt(A4) $ \quad n\to \infty, \quad h \to 0, \quad nh \to \infty.$

Then

\begin{displaymath}d_M(x, h) \approx (nh)^{-1} \sigma^2 c_K + h^4 d_K^2 [m''(x)]^2 /4. \end{displaymath}

The mean squared error splits up into the two parts, variance and bias$^2$. The above theorem says that the bias, as a function of $h$, is increasing whereas the variance is decreasing. By this qualitative behavior one gets a feeling of what the smoothing problem is about:

Balance the variance versus the bias$^2$.
We will come back to this task in Chapter 4.

3.1.1 Kernel estimators are local polynomial fits

The kernel weights define a neighborhood of points around a grid point $x$. Let us investigate the question of fitting a polynomial in such a neighborhood.

The simplest polynomial to fit in such a neighborhood is a constant. There is a striking similarity between local polynomial fitting and kernel smoothing. For fixed $x$, the kernel estimator $\hat m_h(x)$ with positive weights $W_{hi}(x)$ is the solution to the following minimization problem

\begin{displaymath}
\min_{t} \sum^n_{i=1} K_h(x-X_i) (Y_i-t)^2 =
\sum^n_{i=1}K_h(x-X_i)(Y_i- \hat m_h(x))^2.
\end{displaymath} (3.1.8)

In this sense, the kernel smoother can be understood as a local constant polynomial fit: It minimizes, in a neighborhood around $x$ determined in shape and span by the sequence $K_h$, the sum of squared residuals. How are more complex polynomials related to kernel smoothing?

This question is investigated in the fixed design model. Consider equispaced $X_i=i/n$, and a local parabolic fit. Let us take a point $x$ that is not too close to the boundary of the observation interval. (The behavior of kernel smoothers at the boundary is discussed in Section 4.4.) Consider a uniform kernel $K^U(u)=1/2 \ I(\left\vert u\right\vert \le 1),$ which parameterizes the neighborhood around $x$. We have then to minimize

\begin{displaymath}n^{-1} \sum_i K^U_h (x-X_i) (Y_i-a-b(X_i-x)^2)^2 \end{displaymath}

with respect to $a$ and $b$. The linear term is not present here, since it is ``orthogonal" to the symmetric, uniform kernel. The normal equations for this problem are
\begin{displaymath}
n^{-1} \sum_i K_h^U (x-X_i) (Y_i-a-b(X_i-x)^2)=0,
\end{displaymath} (3.1.9)


\begin{displaymath}
n^{-1} \sum_i K_h^U (x-X_i) (Y_i-a-b(X_i-x)^2) (X_i-x)^2=0.
\end{displaymath} (3.1.10)

Define $\hat Y$ as $n^{-1}\sum_i K_h^U (x-X_i)Y_i$ and approximate $n^{-1} \sum_i K_h^U (x-X_i)$ by one. For large $n$, the sum

\begin{displaymath}n^{-1} \sum_i K_h^U (x-X_i) (x-X_i)^2 \end{displaymath}

can be replaced by

\begin{displaymath}\int K_h^U (x-u) (x-u)^2\ du. \end{displaymath}

Integration by substitution shows that this is equal to $h^2/3$. Using similar arguments for $n^{-1} \displaystyle \sum_i K_h^U (x-X_i)(x-X_i)^4$ shows that the normal equations (3.1.9-3.1.10) can be rewritten as

\begin{eqnarray*}
\hat Y -a-(h^2/3)\ b & = & 0, \\
A -(h^2/3)\ a-(h^4/5)\ b & = & 0,
\end{eqnarray*}



where

\begin{displaymath}A=n^{-1} \sum_i K_h^U (x-X_i) (x-X_i)^2\ Y_i .\end{displaymath}

Elementary algebraic calculations show that $a$ satisfies the equation

\begin{displaymath}3h^2\ \hat Y -5A+(-3+5/3)\ h^2a=0, \end{displaymath}

which is solved by
\begin{displaymath}
\hat a=3/4\ \left( n^{-1}\ \sum_i K_h^U (x-X_i)\left(3-5\left({x-X_i\over
h}\right)^2\right) \ Y_i \right).
\end{displaymath} (3.1.11)

Similary, an estimate $\hat b$ can be computed which leads to the ``local parabola" $\hat a + \hat b (x-u)^2$ in a small neighborhood around $x$. At the point $x$ itself the regression curve $\hat m$ is estimated by $\hat a.$ A closer look at 3.1.11 reveals that $\hat a$ can be written as

\begin{displaymath}\hat a= \hat m(x) = n^{-1}\ \sum_i K_h^* (x-X_i)\ Y_i, \end{displaymath}

where

\begin{displaymath}K_h^*(u) = 3/8\ (3-5u^2) I(\left\vert u \right\vert \le 1 ) \end{displaymath}

is a kernel with vanishing first, second and third moment. In this setting of equispaced predictor variables a kernel estimate with kernel $K^*$ is essentially a local parabolic fitting procedure. Figure 3.3 shows some of the ``local parabolas" together with a kernel estimate based on the kernel $K_h^*.$ The data set is a stretch from the acceleration versus time data as discussed later in Section 8.1.

Figure 3.3: Local parabolic fits. The kernel smooth for acceleration versus time data (see Section 8.1). The kernel smooth is compared to the local parabolic fits at $\scriptstyle x = 525 and 620 $.
\includegraphics[scale=0.15]{ANR3,3.ps}

The equivalence of local polynomial fitting and kernel smoothing has been studied in great detail by Müller (1987). Some numerical comparison has been done by Schmerling and Peil (1985, figure 1). They used a Gaussian kernel weight sequence with kernel

\begin{displaymath}K(u) = (2\pi )^{-1/2} \exp (-u^2/2) \end{displaymath}

and compared locally constant, linear and parabolic fits in an agricultural example.

3.1.2 Kernel estimators of derivatives

The technique of kernel estimation can also be used to estimate derivatives of the regression function. Kernel derivative estimators are defined by differentiating the weight function sequence with respect to $x$. If the weights are sufficiently smooth and the bandwidth sequence is correctly tuned then these estimators will converge to the corresponding derivatives of $m$. This can be easily illustrated in the equidistant design setting with the kernel smoother, using the Priestley-Chao weights $\{ W_{hi}^{(2)}(x)\} $. Taking the $k$-th derivative with respect to $x$ gives

\begin{displaymath}
\hat m_h^{(k)}(x)=n^{-1}h^{-(k+1)}\ \sum_{i=1}^n K^{(k)}\
\left({x-X_i \over h}\right)\ Y_i.
\end{displaymath} (3.1.12)

The kernel estimate of the $k$-th derivative of $m$ is thus a local average of the response variables in which the $k$-th derivatives of the kernel weights have been used as weights.

Proposition 3.1.2   Assume the fixed design model with a one-dimensional predictor variable X and define

\begin{displaymath}c^{(k)}_K = \int {[K^{(k)}]}^2(u) du,\end{displaymath}


\begin{displaymath}d^{(k)}_K = \int u^{k+2} K^{(k)}(u) du.\end{displaymath}

Take the Priestley-Chao kernel weights $\{ W_{hi}^{(2)}(x)\} $ and assume

30pt to 30pt(A0) $ K \in {\cal C}^{(k)}$ has support $[-1,1]$ with $ K^{(j)}(-1)=K^{(j)}(1)=0,\ j=0,\ldots ,k-1$,

30pt to 30pt(A1) $ m^{(k)}(x) \in {\cal C}^2, $

30pt to 30pt(A2) $ X_i = i/n ,\ \ i = 1, \ldots, n $,

30pt to 30pt(A3) $ var(\varepsilon_i) = \sigma^2, \ \ i = 1, \ldots, n $,

30pt to 30pt(A4) $ \quad n\to \infty, \quad h_n \to 0, \quad nh_n^{k+1}
\to \infty$.

Then

\begin{displaymath}d_M(x, h) \approx (nh^{2k+1})^{-1} \sigma^2 c^{(k)}_K
+ h^4 {d^{(k)}_K}^2 {[m^{(k+2)}(x)]}^2/(k+2)!^2. \end{displaymath}

A sketch of the proof of this proposition is given in the Complements to this section. Gasser and Müller (1984) studied slightly different weights based on derivatives of $\{ W_{hi}^{(3)}(x)\} $. In view of the asymptotic equivalence of the weight functions $\{ W_{hi}^{(2)}(x)\} $ and $\{ W_{hi}^{(3)}(x)\} $ (see Exercise 3.1.3) it is not surprising that the Gasser-Müller kernel estimator has the same mean squared error expansion as given in Proposition 3.1.2 Figure 3.4 is taken from an application of the Gasser-Müller method, in which they compute the velocity and acceleration of height growth. The upper graphs compare the growth velocity (first derivative) of boys to that of girls. The graphs below depicts the growth accelerations (second derivatives) for the two sexes.

Figure 3.4: First and second derivatives of kernel smoothers. Average velocity curves (above) and acceleration curves (below) for boys (dashed line) and girls (solid line). From Gasser et al. (1984a) with permission of the Institute of Mathematical Statistics.
\includegraphics[scale=0.25]{ANR3,4.ps}

In the case of non-equally spaced and random $X$-variables the weight sequence becomes more complicated. The principle of differentiating the kernel weights to obtain kernel estimates for derivatives of the regression function also works here. For instance, the first derivative $m^{\prime }(x)$ could be estimated using the effective weight sequence

\begin{displaymath}
W_{hi}(x)={K_h^{(1)}(x-X_i)\over \hat f_h(x)}-
{K_h(x-X_i)\hat f_h^{\prime }(x)\over (\hat f_h(x))^2},
\end{displaymath} (3.1.13)

where

\begin{displaymath}K_h^{(1)}(u)=h^{-2}K^{(1)}(u/h) \end{displaymath}

and

\begin{displaymath}\hat f_h^{\prime }(x)=n^{-1}\ \sum_{i=1}^n K_h^{(1)}(x-X_i) \end{displaymath}

is an estimate of the first derivative of the marginal density $f(x)$.

3.1.3 Computational aspects of kernel smoothing

Suppose that it is desired to compute the Nadaraya-Watson kernel estimate at $N$ distinct points. A direct application of formula (3.1.2) for a kernel with unbounded support would result in $O(Nn)$ operations for determination of the estimator at $N$ gridpoints. Some computer time can be saved by using kernels with bounded support, say $[-1,1]$. Local averaging is then performed only in a neighborhood of size $h$ around the gridpoints. The number of operations would then be $O(Nnh)$ since about $2nh$ points fall into an interval of length $2h$. Since $h=h_n$ tends to zero, the introduction of kernels with bounded support looks like a drastic improvement.

For optimization of the smoothing parameter one needs to repeat kernel smoothing several times and so even for moderate sample size the algorithm would still be extremely slow. More efficient kernel smoothing algorithms can be defined by first discretizing the data into bins of the form

\begin{displaymath}B(x;x_0,h) = [x_0+kh,x_0+(k+1)h]\end{displaymath}

for some integer $k$. This means that one replaces the response variables by a step function with heights equal to the average of the response in the bins. Similarly the predictor variable is replaced by its frequency in the respective bins. This discretization step takes $O(n)$ operations.

The computational advantage comes from building a weighted average of rounded points (WARP). In particular, consider the set of ``origins"

\begin{displaymath}\left\{ x_{0,k} = {kh\over M} \right\}, k=0,\ldots ,M-1,\end{displaymath}

and estimate, for example, the marginal density by an average over histograms with origin $ x_{0,k} $,

\begin{eqnarray*}
\hat f_{h,m}(x) & = & m^{-1} \sum_{k=0}^{M-1}
\char93 \{ i: X...
...\vert k \right\vert ) \char93 \{i: X_i \in B(x;x_0,h) \}/(nh).
\end{eqnarray*}



The triangular weights $(1 - \left\vert k \right\vert /M)$ can be generalized in an obvious way to other weight sequences. For example, the quartic kernel

\begin{displaymath}K(u)=(15/16)(1-u^2)^2 I( \left\vert u \right\vert \le 1) \end{displaymath}

corresponds to the weights

\begin{displaymath}W_M(k)=(15/16)(1-k^2/M^2)^2, \ \ \left\vert k \right\vert \le M. \end{displaymath}

Using this generalization we can rewrite the above formula in the general form

\begin{displaymath}\hat f(x) = M^{-1} \sum_{ \left\vert k \right\vert \le M }W_M(k)RP_{i(x)+k}, \end{displaymath}

where $i(x)$ is the bin in which $x$ falls and where in the above case of density smoothing, $RP_l$ is the frequency of rounded points $(=RP)$ in the $\ell $-th bin. Applying this idea to regression smoothing gives

\begin{displaymath}\hat m(x) = M^{-1} \sum_{ \left\vert k \right\vert \le M }
W_M(k)\bar{Y}_{i(x)+k}/\hat f(x), \end{displaymath}

where $\bar {Y_l}$ is the average of the response variable over the $\ell $-th bin. Estimates of this kind are discussed in Härdle and Scott (1988). After discretization of the data the operations are $O(NM)$.

Another technique uses Fourier transforms

\begin{displaymath}\tilde{g}(t) = \int g(x)\exp(-itx)dx. \end{displaymath}

Observe that for $g(x) = n^{-1} \sum_{i=1}^n K_h(x-X_i)Y_i$, the denominator of the Nadaraya-Watson estimator, one has the Fourier transform

\begin{displaymath}\tilde{g}(t) = \tilde{K}(th) \sum_{i=1}^n \exp(-itX_i)Y_i. \end{displaymath}

If one uses the Gaussian kernel

\begin{displaymath}K(u) = \exp(-u^2/2)/\sqrt{2 \pi} \end{displaymath}

one has for example $ \tilde{K}(t)=\exp(-t^2/2)$. The numerical efficiency comes from decoupling the smoothing operation from the Fourier transform of the data. The Fourier transform of the data

\begin{displaymath}\sum_{i=1}^n \exp(-itX_i)Y_i \end{displaymath}

can be computed via the Fast Fourier Transform. If the data is discretized into $N$ bins as above, the operation will be $O(N\log N)$. Note that for computing several smoothes only the rescaled Fourier transform of the kernel function has to be multiplied with the Fourier transform of the data which can be retained in the memory of the computer. An algorithm for this technique is presented in Härdle (1987a).

Exercises
3.1.1Recall the setting for the weight sequence $\{ W_{hi}^{(2)}(x)\} $. Consider linear interpolation between two successive observations $(X_{i-1},Y_{i-1})$ and $(X_i,Y_i)$ with $(X_0,Y_0)=(0,Y_1)$,

\begin{displaymath}g_i(u)={Y_i-Y_{i-1} \over X_i-X_{i-1}} (u-X_{i-1})+Y_{i-1}, \quad
i=1, \ldots, n. \end{displaymath}

The linear interpolant of the data can be written as

\begin{displaymath}G_n(u)=\sum^n_{i=1} g_i(u)\ I(X_{i-1}\le u<X_i).\end{displaymath}

Clark (1980) suggested convoling this linear interpolant with a kernel function with bandwidth $h$,

\begin{eqnarray*}
\hat m(x) & =& \int K_h(x-u) G_n(u)du \\
& = & \sum^n_{i=1} \...
...}_{X_{i-1}} K_h(x-u)(u-X_i)du
\frac{Y_i-Y_{i-1}}{X_i-X_{i-1}}.
\end{eqnarray*}



Show that if the $x$-variables are equispaced on $[0,1]$, that is, $X_i={i\over n}$, then the last term converges in probability to zero.

3.1.2Discuss the behavior of the kernel estimator when a single observation moves to a very large value, that is, study the case $(X_i,Y_i)\to (X_i,Y_i\pm c)$ with $c\to \infty$ for a fixed $i$. How does the curve change under such a distortion? What will happen for a distortion in $X$-direction $(X_i,Y_i)\to
(X_i\pm c,Y_i)$?

3.1.3When we had the situation of equispaced $X_i={i\over n}$ we said that a local linear fit would not make much sense with a symmetric kernel weight. Consider now the situation of random $X$s. Would you expect a gain in using a local linear fit now?

3.1.4Prove in analogy to Proposition 3.1.1 the asymptotic mean squared error decomposition of kernel smoothers with weight sequences $\{ W_{hi}^{(2)}\} $ and $\{ W_{hi}^{(3)}\}$, respectively.

3.1.5Recall the the weighted local fitting of polynomials. If the order of the approximating polynomial $\varphi_0 \equiv 1$ is $p=0,$ then $\hat m_h(x)$ is just the ordinary kernel estimate with weights $W^*_{hi}(x)=W_{hi}(x)/ \hat f_h(x).$ For a local linear approximation one has

\begin{eqnarray*}
\varphi_0(u)&=&1,\\
\varphi_1(u)&=&(u-x_0)-M_{11}(x_0)/M_{10}...
...{1j}(x)&=& \sum^n_{i=1}(X_i-x)^j W_{hi}(x),\quad j=0,\ldots,p.
\end{eqnarray*}



This results in

\begin{eqnarray*}
\hat m_h(x) &=& {M_{20}(x) M_{12}(x)-M_{21}(x) M_{11}(x)\over ...
...) & =& \sum^n_{i=1}(X_i-x)^j W_{hi}(x) Y_i,\quad j=0,\ldots,p.
\end{eqnarray*}



Try this method in practice. (Schmerling and Peil, 1977, present the ALGOL code for this procedure.) Comment on the difference from ordinary kernel smoothing.

3.1.6Verify that the kernel $K^*$ from the local parabolic fit (see 3.1.13) is indeed a kernel and has vanishing first, second and third moments.

3.1.7Consider the positive food versus net income data set. Suppose you are asked to do a kernel smooth at the right end. What can happen if the kernel $K$ has negative ``sidelobes", that is, the tails of $K$ are allowed to take on negative values?

3.1.8Give a rigorous proof of Proposition 3.1.2 (A sketch of the proof is in the Complements of this section.) Compare the remainder terms of the bias approximations for the weight sequence $\{W^{(2)}_{hi}(x) \}$ with those of the weight sequence $\{W^{(3)}_{hi}(x) \}.$

3.1.9Derive that the rate of convergence of $d_M(x, h)$ from Theorem 3.1.1 is $h$ and is chosen optimally, that is,

\begin{displaymath}h = h_{opt} = \arg\min_h d_M(x, h).\end{displaymath}

3.1.10Show the asymptotic equivalence of the weight sequences $\{W^{(2)}_{hi}(x) \}$ and $\{W^{(3)}_{hi}(x) \}$ in the following sense

\begin{displaymath}W^{(2)}_{hi}(x) = W^{(3)}_{hi}(x) + O(n^{-1}) .\end{displaymath}

3.1.11Give reasons why $\hat f(X_{(i)}) = n( X_{(i)} - X_{(i-1)}) $, as in the weight sequence (3.1.6), is a reasonable choice for a density estimate. [Hint: Consider the asymptotic distribution of the spacings $X_{(i)} - X_{(i-1)} $.]