4.2 Pointwise confidence intervals

The aim of this section is to develop the construction of pointwise confidence intervals for kernel estimators and to prepare the ground for the uniform confidence bands which are treated in the next section. The basic idea is to derive the asymptotic distribution of the kernel smoothers and then to use either asymptotic quantiles or bootstrap approximations for these quantiles for the confidence intervals. The shrinkage rate of the confidence intervals is proportional to $n^{-r}$, the optimal rate of convergence if the bandwidth is chosen so that the optimal rate is achieved. It is certainly desirable to use smoothers that are asymptotically optimal since they give the narrowest confidence intervals obtainable and keep the squared bias and variance at the same order.

The reader more interested in practical aspects should not be discouraged by the rather theoretical beginning of this section, but instead should jump to Algorithm 4.2.1, which describes the construction of confidence intervals at $k$ different points.

The asymptotic distribution is normal. The center of this distribution is shifted by the asymptotic bias which depends on derivatives of the regression curve and the marginal density of $X$. The asymptotic variance is a function of

the conditional variance $\sigma^2(x);$

the kernel $K$; and

the marginal density $f(x)$.

The asymptotic bias is a function of

the kernel $K$; and

derivatives of $m$ and $f$.

Before I come to the theoretical statement of the asymptotic distribution of kernel smoothers let me point out some simplifications. The kernel smoother ${\hat{m}}_h(x)$ is a ratio of random variables, direct central limit theorems therefore cannot be applied and the smoother has to be linearized. The kernel estimator has the same limit distribution as the right-hand side of the following linearization,

\begin{displaymath}{\hat{m}}_h(x)-m(x) \approx {{\hat{r}}_h(x)-m(x) {\hat{f}}(x) \over f(x)},\end{displaymath}

where

\begin{displaymath}{\hat{r}}_h=n^{-1} \sum_{i=1}^n K_h(x-X_i) Y_i\end{displaymath}

denotes, as in Section 3.1, the numerator of the Nadaraya-Watson kernel smoother. The following theorem states the asymptotic distribution of the Nadaraya-Watson kernel estimator for one-dimensional predictor variables.

Theorem 4.2.1   Suppose that

30pt to 30pt(A1) $\int \left\vert K(u) \right\vert ^{2+\eta} d u<\infty$ for some $\eta>0;$

30pt to 30pt(A2) $h \sim n^{-1/5};$

30pt to 30pt(A3)$m$ and $f$ are twice differentiable;

30pt to 30pt(A4)the distinct points $x_1, x_2,\ldots, x_k$ are continuity points of $\sigma ^2(x)$ and $E \{\left\vert Y \right\vert ^{2+\eta} \vert X=x \}$ and $f(x_j)>0, \quad j=1,2,\ldots,k$.

Then the suitably normalized Nadaraya-Watson kernel smoother
${\hat{m}}_h(x_j)$ at the $k$ different locations $x_1,\ldots, x_k$ converges in distribution to a multivariate normal random vector with mean vector $B$ and identity covariance matrix,

\begin{displaymath}\Bigl((n h)^{1/2} \left \{{({\hat{m}}_h(x_j)-m(x_j)) \over (\...
... \right \} \Bigr)_{j=1}^k \buildrel {\cal L}\over \to N(B, I),
\end{displaymath} (4.2.7)

where
\begin{displaymath}B=\Bigl(d_K \{m''(x_j)+2 m'(x_j) (f'(x_j)/f(x_j)) \Bigr)_{j=1}^k.
\end{displaymath} (4.2.8)

The proof is given in the Complements extending results of Johnston (1979) and of Schuster (1972). The proof is based on the linearization given above.

The asymptotic bias 4.2.8 is proportional to the second moment of the kernel and a measure of local curvature of $m$. This measure of local curvature is not a function of $m$ alone but also of the marginal density. At maxima or minima, the bias is a multiple of $m''(x)$ alone; at deflection points it is just a multiple of only $m'(x)(f'(x)/f(x)) $.

This theorem can be used to define confidence intervals. Suppose that the bias is of negligible magnitude compared to the variance, then the following algorithm yields approximate confidence intervals.

Algorithm 4.2.1
STEP 1.

Compute the kernel smoother ${\hat{m}}_h$ and the density estimate ${\hat{f}}_h$ at distinct points $x_1,\ldots, x_k$.

STEP 2.

Construct an estimate of $\sigma ^2(x)$,

\begin{eqnarray*}
{\hat{\sigma}}^2(x)=n^{-1} \sum^n_{i=1} W_{h i}(x)(Y_i-{\hat{m}}_h
(x))^2,\cr
W_{h i}=K_h(x-X_i)/{\hat{f}}_h(x).
\end{eqnarray*}



STEP 3.

Take $c_\alpha$, the $(100-\alpha)$-quantile of the normal distribution, and let

\begin{eqnarray*}
C L O&=&{\hat{m}}_h(x)-c_\alpha c_K^{1/2} {\hat{\sigma}}(x)/(n...
...c_\alpha c_K^{1/2} {\hat{\sigma}}(x)/(n h {\hat{f}}_h(x))^{1/2}.
\end{eqnarray*}



STEP 4.

Draw the interval $[ C L O, CUP ]$ around ${\hat{m}}_h(x)$ at the $k$ distinct points $x_1,\ldots, x_k$.

pt This algorithm does not take the bias of ${\hat{m}}_h(x)$ into account, since the bias is a complicated function of $m$ and $f$. Bias estimates could be built in by using estimates of the derivatives of $m$ and $f$ but would considerably complicate the algorithm. So if the bandwidth $h \sim n^{-1/5}$ then the above steps do not lead asymptotically to an exact confidence interval. However, if $h$ is chosen proportional to $n^{-1/5}$ times a sequence that tends slowly to zero, then the bias vanishes asymptotically.

If the variation in $m$ and $f$ is moderate, little difference is to be expected between such two sequences of bandwidths, so one could use the unshifted confidence intervals as well. However, at small peaks (high bias!) it may be desirable to shift the $[ C L O, CUP ]$ interval by a bias estimate. The decision of the occurrence of such peaks has to be made by the statistician. The analysis of expenditure data is a field where we do not expect sudden and abrupt changes of $m(x)$. In Figure 4.1 an estimate of the regression curve for the potato versus net income example (Figure 1.1) is shown together with ten pointwise confidence intervals.

Figure 4.1: Approximate confidence intervals for the potato versus net income data. 95 percent confidence intervals are shown for the kernel smoother ${\hat{m}}_h(x)$, $h=0.6$, $n=7125$, Epanechnikov kernel. The vertical axis is normalized by the mean expenditure. The horizontal axis is normalized by the mean net income for that year.Survey (1968-1983). 9291 ANRpotconf.xpl
\includegraphics[scale=0.7]{ANRpotconf.ps}

It is apparent from this picture that the confidence interval lengths increase as they move to the right boundary of the observation interval. Since the kernel is a fixed function, this must be due to the other factors controlling the variance of the kernel smoother. First, it is the conditional variance $\sigma ^2(x)$ which increases as $x$ goes to the right boundary (compare with Figure 1.1). Secondly, the inverse of the marginal $X$-distribution enters as a proportionality factor. Since the data are more sparse near the right boundary, (compare with Figure 1.5), the variance estimate also gets inflated for this reason. A plot of ${\hat{\sigma}}(x)$, an estimate of the conditional standard deviation curve, $\sigma (x)$ is presented in Figure 4.2.

Figure 4.2: Conditional standard deviation curve for the potato versus net income data. The curve shown is
$\displaystyle{\hat{\sigma}}(x)=\left(n^{-1}\sum^n_{i=1}W_{h
i}(x)(Y_i-{\hat{m}}_h(x))^2\right)^{1/2}$,
$n=7125$, $h=0.6$ an estimate of $\sigma (x)$. 9301 ANRpotdev.xpl Survey (1968-1983).
\includegraphics[scale=0.7]{ANRpotdev.ps}

As a possible way of visualizing both effects in one plot, I propose plotting confidence intervals at points $x_j$ such that the number of observations between $x_j$ and $x_{j+1}$ is constant. As the marginal density decreases, the Euclidean distance between neighboring points $x_j$ will become bigger. This can be seen from Figure 4.3, which shows a kernel smooth ($h=0.6$, Epanechnikov kernel) of the potato versus net income data together with confidence intervals. These intervals are placed such that in between successive intervals there are a fixed number of 700 data points.

Figure 4.3: Approximate confidence intervals for the potato versus net income data. 95 percent confidence intervals are shown for the kernel smoother ${\hat{m}}_h(x)$, $h=0.6$, $n=7125$, Epanechnikov kernel. The confidence intervals are placed such that the number of observations between successive intervals equals 700. The vertical axis is normalized by the mean expenditure. The horizontal axis is normalized by the mean net income for that year. 9307 ANRpotconf2.xpl Survey (1968-1983).
\includegraphics[scale=0.7]{ANRpotconf2.ps}

There are always 700 observations between neighboring gridpoints $x_j$. In this plot, one not only sees an increase of the variance but also a decrease of the marginal density $f(x)$. This becomes apparent at the right boundary: The peak to the right of the approximate confidence interval at $x \approx
1.9$ is produced by less than fifty observations.

Another method of constructing confidence bands is based on the bootstrap. The bootstrap is a resampling technique that prescribes taking ``bootstrap samples'' using the same random mechanism that generated the data. This prescription makes it necessary to handle the case of stochastic $X$s differently from the case of deterministic $X$-values. To be precise, in the fixed design model the stochastic part of the data is contained only in the observation errors, so resampling should take place from residuals. If both $X$ and $Y$ are random, resampling can be done from the data pairs $\{(X_i, Y_i) \}^n_{i=1}$ according to the following algorithm.

Algorithm 4.2.2
b=0

REPEAT

b=b+1

STEP 1.

Sample $\{(X^*_i, Y^*_i)\}^n_{i=1}$ from the empirical distribution function of the data.

STEP 2.

Construct the kernel smoother ${\hat{m}}^*_h(x)$ from the bootstrap sample $\{(X^*_i, Y^*_i)\}^n_{i=1}$.

UNTIL b=B=number of bootstrap samples.

STEP 3.

Define $C L O^*$ as the $\alpha/2$ empirical quantile of the $B$ bootstrap estimates ${\hat{m}}^*_h(x)$. Define $CUP^*$ analogously.

STEP 4.

Draw the interval $[ C L O^*, CUP^*]$ around ${\hat{m}}_h(x)$ at the $k$-distinct points $x_1,\ldots, x_k$.

This bootstrap algorithm was proposed by McDonald (1982) in his Orion workstation film. Theoretical properties of this so-called naive bootstrap have been considered by Dikta (1988), again by disregarding bias terms. The above bootstrap procedure was applied with $B=100$ to the simulated data set of Table 3.2. The result is depicted in Figure 4.4.

Figure 4.4: Bootstrap confidence intervals for the simulated data set, Table 3.2. The kernel smoother ${\hat{m}}_h(x)$, $h=0.1$, $n = 100$, $B=100$ with Epanechnikov kernel has been applied. 9325 ANRsimboot.xpl
\includegraphics[scale=0.7]{ANRsimboot.ps}

The kernel smooth for Figure 4.4 was computed with the Epanechnikov kernel ( $K(u)=(3/4)(1-u^2)I(\left\vert u \right\vert \le 1)$) and bandwidth $h=0.1$. Note that this bandwidth is bigger than the one chosen to produce the kernel smooth of Figure 3.21 but the curve from Figure 4.4 is smoother than that of Figure 3.21. The reason for this is the use of different kernels. A possible means of comparing these bandwidths is presented in Section 5.4.

In the case of the fixed design models with homoscedastic error structure one may use only the estimated residuals

\begin{displaymath}{\hat{\epsilon}}_i=Y_i-{\hat{m}}_h(X_i).\end{displaymath}

Resampling them as $\{\epsilon^*_i \}^n_{i=1}$ gives bootstrap observations $Y^*_i={\hat{m}}_h (x_i)+\epsilon^*_i$. Of course, this makes sense only if the error distribution does not depend on $x$. If this is the case one constructs bootstrap smoothers ${\hat{m}}^*_h(x)$ and studies the distribution of ${\hat{m}}^*_h(x)$ suitably centered around ${\hat{m}}_h(x)$. Details of this procedure are given in Section 5.3.

This bootstrap from residuals can also be applied in the stochastic design setting. It is then able also to incorporate the bias term and is called wild bootstrap by Härdle and Mammen (1988). It is called wild, since at each observation point $X_i$ (in the fixed or stochastic design setting) a bootstrap observation is drawn from one single estimated residual. This is done to better retain the conditional distributional characteristics of the estimate. It is not resampled from the entire set of residuals, as in Härdle and Bowman (1988).

A different possibility would be to resample from a set of residuals determined by a window function, but this has the disadvantage of requiring choice of the window width. To avoid this I propose wild bootstrapping, where each bootstrap residual is drawn from the two-point distribution which has mean zero, variance equal to the square of the residual, and third moment equal to the cube of the residual.

In particular, let

\begin{displaymath}{\hat{\epsilon}}_i=Y_i-{\hat{m}}_h(X_i)\end{displaymath}

be the observed residual at point $X_i$. Now define a new random variable $\epsilon^*_i$ having a two-point distribution ${\hat{G}}_i$, where
\begin{displaymath}{\hat{G}}_i=\gamma \delta_a+(1-\gamma) \delta_b
\end{displaymath} (4.2.9)

is defined through the three parameters $a, b, \gamma$, and where $\delta_a,
\delta_b$ denote point measures at $a, b$, respectively. Some algebra reveals that the parameters $a, b, \gamma$ at each location $X_i$ are given by
$\displaystyle a$ $\textstyle =$ $\displaystyle {\hat{\epsilon}}_i (1-\sqrt 5)/2,$  
$\displaystyle b$ $\textstyle =$ $\displaystyle {\hat{\epsilon}}_i (1+\sqrt 5)/2,$ (4.2.10)
$\displaystyle \gamma$ $\textstyle =$ $\displaystyle (5+\sqrt 5)/10.$  

These parameters ensure that $E \epsilon^*=0$, $E
{\epsilon^*}^2={\hat{\epsilon}}_i^2$ and $E {\epsilon^*}^3={\hat{\epsilon}}_i^3$. A geometrical interpretation of these parameters is related to the Golden Section method by Euclid (-300, Second book, prop. 11), see Exercise 4.2.3. In a certain sense the resampling distribution ${\hat{G}}_i$ can be thought of as attempting to reconstruct the distribution of each residual through the use of one single observation. Therefore, it is called the wild bootstrap. It is actually the cumulative effect of all these residuals though that make this bootstrap work. After resampling, new observations

\begin{displaymath}Y_i^*={\hat{m}}_g (X_i)+\epsilon^*_i\end{displaymath}

are defined, where ${\hat{m}}_g(x)$ is a kernel estimator with bandwidth $g$ taken to be larger than $h$ (a heuristic explanation of why it is essential to oversmooth $g$ is given below).

Then the kernel smoother is applied to the bootstrapped data $\{(X_i,\allowbreak
Y_i^*)\}_{i=1}^n$ using bandwidth $h$. Let ${\hat{m}}_h^*(x)$ denote this kernel smooth. A number of replications of ${\hat{m}}_h^*(x)$ can be used as the basis for a confidence interval because the distribution of ${\hat{m}}_h(x)-m(x)$ is approximated by the distribution of ${\hat{m}}_h^*(x)-{\hat{m}}_g(x)$, as Theorem 4.2.2 shows. Here the symbol $Y \mid X$ is used to denote the conditional distribution of $Y_1$, ..., $Y_n \mid X_1$, ..., $X_n$, and the symbol $*$ is used to denote the bootstrap distribution of $Y^*_1$, ..., $Y^*_n
\mid(X_1,Y_1)$, ..., $(X_n,Y_n)$.

Theorem 4.2.2   Given the assumptions of Theorem 4.2.1, along almost all sample sequences and for all $z \in \mathbb{R}$

\begin{eqnarray*}
&&\vert P^{Y \vert X} \{\sqrt{n h^d} [{\hat{m}}_h(x)-m(x)]<z \...
...{\sqrt{n h^d} [{\hat{m}}_h^*(x)-{\hat{m}}_g(x)]<z \} \vert \to 0
\end{eqnarray*}



For an intuitive understanding of why the bandwidth $g$ used in the construction of the bootstrap residuals should be oversmoothed, consider the means of ${\hat{m}}_h(x)-m(x)$ under the $Y \vert X$-distribution and ${\hat{m}}_h^*(x)-{\hat{m}}_g(x)$ under the $*$-distribution in the simple situation when the marginal density $f(x)$ is constant in a neighborhood of $x$. Asymptotic analysis as in Rosenblatt (1969) shows that

\begin{eqnarray*}
&&E^{Y \vert X} ({\hat{m}}_h(x)-m(x)) \approx h^2 d_K m''(x)/2...
...t{m}}^*_h(x)-{\hat{m}}_g(x)) \approx h^2 d_K {\hat{m}}_g''(x)/2.
\end{eqnarray*}



Hence for these two distributions to have the same bias one needs
${\hat{m}}_g''(x) \to m''(x)$. This requires choosing $g$ tending to zero at a rate slower than the optimal bandwidth $h$ (see Section 4.1) for estimating $m(x)$. The advantage of the bootstrapping technique is that one does not have to compute various complicated constants, such as the bias described in Theorem 4.2.1. Algorithm 4.2.3 is thus easy to program but requires quite a bit of computer intensive resampling. This computational burden can be made easier by employing a discretization technique (WARPing) or an FFT-based approach as described in Section 3.1.

Algorithm 4.2.3
b=0

REPEAT

b=b+1

STEP 1.

Sample $\epsilon^*_i$ from the two-point distribution ${\hat{G}}_i$ 4.2.9, where

\begin{displaymath}{\hat{G}}_i=\gamma \delta_a+(1-\gamma) \delta_b
\end{displaymath}

and as in 4.2.10

\begin{eqnarray*}
a&=&{\hat{\epsilon}}_i (1-\sqrt 5)/2,\cr
b&=&{\hat{\epsilon}}_i (1+\sqrt 5)/2,\cr
\gamma&=&(5+\sqrt 5)/10.
\end{eqnarray*}



STEP 2.

Construct new observations

\begin{displaymath}Y_i^*={\hat{m}}_g (X_i)+\epsilon^*_i,\end{displaymath}

where ${\hat{m}}_g(x)$ is a slightly oversmoothed kernel estimator with bandwidth $g$. Compute ${\hat{m}}^*_h(x)$ from the bootstrap sample
$\{(X_i,Y^*_i)\}^n_{i=1}$.

UNTIL b=B=number of bootstrap samples.

STEP 3.

Define $C L O^*$ as the $\alpha/2$ empirical quantile of the B bootstrap estimates ${\hat{m}}^*_h(x)$. Define $CUP^*$ analogously.

STEP 4.

Draw the interval $[ C L O^*, CUP^*]$ around ${\hat{m}}_h(x)$ at the $k$-distinct points $x_1,\ldots, x_k$.

Exercises
4.2.1 Show that the difference between

\begin{displaymath}{\hat{m}}_h(x)-m(x)\end{displaymath}

and its linearization

\begin{displaymath}{{\hat{r}}_h(x)-m(x) {\hat{f}}(x) \over f(x)}\end{displaymath}

is of order $o_p((n h)^{-1/2})$ under the assumptions of Theorem 4.2.1.

[Hint: Write the difference as

\begin{displaymath}\left({{\hat{r}}-m {\hat{f}}\over f}\right) \left({f \over {\hat{f}}}-1\right)\end{displaymath}

and combine terms.]

4.2.2 Prove formula 4.2.8, that is, show that

\begin{eqnarray*}
&& nh \hbox{var}\{H_n(x)\} \to (f(x))^{-1} \sigma^2(x) \int K^...
...ov}\{H_n(x)\ H_n(y)\} \to 0, \quad \hbox{\rm {as }}n \to \infty.
\end{eqnarray*}



4.2.3 Write an efficient program for wild bootstrapping using the WARPing technique as described in Section 3.1.

4.2.4 Use the wild bootstrap confidence intervals in a real example. Compare them with the asymptotic confidence intervals from Theorem 4.2.1.

4.2.5 Ricardo Cao has found that the wild bootstrap is related to the Golden Section method of Euclid (-300). Show that the two-point distribution of the wild bootstrap can be found by using the outer Golden Section of the interval $E=[0, \vert {\hat{\epsilon}}_i \vert]$.

[Hint: The outer Golden Section $E'$ is that interval containing $E$ such that the ratio of the length of $E'$ to the length of $E$ is the same as the ratio of the length of $E$ to that of $E'-E$.]

4.2.1 Complements

Define ${\hat{r}}_h (x)={\hat{m}}_h(x) {\hat{f}}_h(x)=n^{-1} \sum^n_{i=1} K_h (x-X_i) Y_i$. The difference between

\begin{displaymath}{\hat{m}}_h(x)-m(x)\end{displaymath}

and its linearization

\begin{displaymath}{{\hat{r}}_h-m {\hat{f}}\over f}(x)\end{displaymath}

is of lower order, that is, of order $o_p((n h)^{-1/2})$, see Exercise 4.2.3.

The bias can then be written as

\begin{displaymath}{E {\hat{r}}_h-m E {\hat{f}}\over f}(x).\end{displaymath}

This term equals

\begin{displaymath}\left({E {\hat{f}}\over f}(x)\right) \left({E {\hat{r}}_h \over E {\hat{f}}}-m\right)(x),\end{displaymath}

which is approximately equal to

\begin{displaymath}m_n(x)-m(x),\end{displaymath}

where $m_n(x)=E {\hat{r}}_h(x)/E {\hat{f}}_h(x)$. Observe that
$\displaystyle m_n(x)-m(x)$      
  $\textstyle =$ $\displaystyle (E \{K_h(x-X)\})^{-1} \left\{\int K_h(x-u) m(u) f(u)
du-m(x)f(x)\right.$  
    $\displaystyle \quad \quad \left. +m(x)f(x)-m(x) \int K_h (x-u) f(u) d u
\right\}\cr$  
  $\textstyle =$ $\displaystyle {h^2 \over 2} \int u^2 K(u)d u \{m''(x)+2 m'(x) (f'(x)/f(x)) \}.$ (4.2.11)

Note that

\begin{displaymath}{\hat{m}}_h-m_n=[ {\hat{r}}_h/f-({\hat{f}}_h/f) m_n ] (f/{\hat{f}}_h)\end{displaymath}

and $f(x)/{\hat{f}}_h(x) \buildrel p \over \to 1$, so that ${\hat{m}}_h(x)-m_n(x)$ will have the same asymptotic distribution as the term within square brackets above. Call this term $H_n(x)$. It can be shown in a straightforward but tedious way that if $x \not=y$
    $\displaystyle nh\hbox{var}\{H_n(x)\} \to (f(x))^{-1} \sigma^2(x) \int K^2(u)d
u;$  
    $\displaystyle n\hbox{cov}\{H_n(x)\ H_n(y)\} \to 0, \quad \hbox{\rm {as }}n \to
\infty.$ (4.2.12)

An application of the Cramer-Wold device (Serfling 1980, p. 18) then yields the asymptotic normality of the random vector

\begin{displaymath}\Bigl((n h)^{1/2} \left \{{({\hat{m}}_h(x_j)-m(x_j)) \over (\sigma^2 (x_j)
c_K/f(x_j))^{1/2}} \right \} \Bigr)_{j=1}^k.\end{displaymath}