4.3 Variability bands for functions

Variability bands for functions are intervals $[ C L O(x), CUP(x) ]$ (based on the sample $\{(X_i, Y_i) \}^n_{i=1}$) such that with probability $1-
\alpha$ the true curve is covered by the band $[ C L O(x), CUP(x) ]$, that is,
\begin{displaymath}
P \{C L O(x) \le m(x) \le CUP(x)\ \hbox{\rm {for all}}\ x \in {\cal X}\}=1-\alpha,
\end{displaymath} (4.3.13)

where $x$ ranges over some (compact) set ${\cal X}$ of interest. Even in parametric models such bands are hard to compute; see Working and Hotelling (1929) or Wynn (1984) for bounds in polynomial regression. In the more complicated nonparametric situation useful bands $[ C L O(\cdot), CUP(\cdot)
]$ have been computed, which are conservative, that is, with equality replaced by ``$\ge$'' in 4.3.13. The bands are usually of the form
$\displaystyle C L O(x)$ $\textstyle =$ $\displaystyle {\hat{m}}_h (x)-{\hat{b}}_n (x)-c_\alpha {\hat{D}}_n (x),$  
$\displaystyle CUP(x)$ $\textstyle =$ $\displaystyle {\hat{m}}_h (x)+{\hat{b}}_n (x)+c_\alpha {\hat{D}}_n (x),$ (4.3.14)

where ${\hat{b}}_n (x)$ denotes an estimate of bias (mostly zero in parametric models) and ${\hat{D}}_n (x)$ is a measure of dispersion and $c_\alpha$ is the quantile to achieve level $\alpha$ confidence bands.

There are several approaches to computing upper and lower bands, $CUP(\cdot)$ and $C L O(\cdot)$. One approach is to use pointwise confidence intervals on a very fine grid of the observation interval. The level of these confidence intervals can be adjusted by the Bonferroni method in order to obtain uniform confidence bands. The gaps between the grid points can be bridged via smoothness conditions on the regression curve.

Another approach is to consider ${\hat{m}}_h(x)-m(x)$ as a stochastic process (in $x$) and then to derive asymptotic Gaussian approximations to that process. The extreme value theory of Gaussian processes yields the level of these confidence bands.

A third approach is based on the bootstrap. By resampling one attempts to approximate the distribution of

\begin{displaymath}Z_n=\sup_{x \in {\cal X}}\left\vert {\hat{m}}_h (x)-m(x) \right\vert ,\end{displaymath}

which yields $C L O(\cdot)$ and $CUP(\cdot)$ as bands computed from the $(\alpha/2)$- and $(1-\alpha/2)$-quantiles of $Z_n$, respectively. Another bootstrap method is based on approximating the distribution of ${\hat{m}}_h(x)-m(x)$ at distinct points $x$ and then to simultaneously correct the pointwise confidence intervals in order to obtain the joint coverage probability $1-
\alpha$.

4.3.1 Connected error bars

The approach by Hall and Titterington (1986b) is based on the discretization method, that is, constructing simultaneous error bars at different locations. They considered a fixed design regression model on the unit interval, that is,

\begin{displaymath}Y_i=m(X_i)+\epsilon_i, \end{displaymath}

where $X_i=i/n$ and the observation errors are normal with mean zero and variance $\sigma^2$. Take the Priestley-Chao-type kernel estimator with weight sequence

\begin{displaymath}W^{(2)}_{h i}(x)=K_h (x-X_i),\end{displaymath}

and uniform kernel $K$. Their construction works as follows.

Divide the region of interest into $M$ cells, the $j$th cell containing those observations $(X_i,Y_i)$ such that

\begin{displaymath}(j-1)q \le i \le j q,\ 1 \le j \le M.\end{displaymath}

The kernel estimator with uniform kernel for the $M$ cells can be written as
\begin{displaymath}
{\hat{m}}_h(x_j)=(n h)^{-1} \sum^{j n h-1}_{i=(j-1)n h} Y_i,
\end{displaymath} (4.3.15)

where $x_j$ is in block $j$ and the bandwidth is chosen so that $q=n h$. The expected value of ${\hat{m}}_h(x_j)$ is

\begin{eqnarray*}
\mu_j&=&q^{-1}(m(q(j-1)/n)+m(q(j-1)/n+1/n)\cr
&&+\cdots+m(q(j-1)/n+(q-1)/n))
\end{eqnarray*}



and the variance is $\sigma^2/q$. The variability bands are computed from simultaneous intervals for the $\mu_j$s, with
\begin{displaymath}
P \{{\hat{\mu}}_j^L \le \mu_j \le {\hat{\mu}}_j^U, 1 \le j \le M \}=\alpha.
\end{displaymath} (4.3.16)

The gap between adjacent cells is handled through the smoothness of $m$. Assume that the first derivative of the regression function in adjacent cells is bounded by constants $c_j$, that is,
\begin{displaymath}
\sup_{(j-1)h \le u \le (j+1)h} \left\vert m'(u) \right\vert \le c_j,\ 1 \le j \le
M.
\end{displaymath} (4.3.17)

For $x$ in the $j$th cell the difference between $m(x)$ and $\mu_j$ can then be bounded by

\begin{displaymath}V_j(\eta)=(1/2)((2 \eta+1)h+1/n)c_j,\ 0<\eta<1.\end{displaymath}

This motivates defining the upper and lower confidence bands in a fixed cell $j$ by

\begin{eqnarray*}
CUP((j+\eta)h)&=&{\hat{\mu}}_j^U+V_j(\eta),\ 0<\eta<1;\\
C L O((j+\eta)h)&=&{\hat{\mu}}_j^L-V_j(\eta),\ 0<\eta<1.
\end{eqnarray*}



It remains to construct the simultaneous confidence intervals ${\hat{\mu}}_j^L$ and ${\hat{\mu}}_j^U$. Suppose first that the error variance $\sigma^2$ is known. Let $\Phi$ denote the standard normal distribution function, and $c_\gamma$ denote the solution of $2 \Phi (c_\gamma)-1=\gamma$. Define

\begin{eqnarray*}
{\hat{\mu}}_j^U&=&{\hat{m}}_h(x)+(n h)^{-1/2} \sigma c_\gamma,\cr
{\hat{\mu}}_j^L&=&{\hat{m}}_h(x)-(n h)^{-1/2} \sigma c_\gamma
\end{eqnarray*}



for $x$ in the $j$th cell. Note that ${\hat{m}}_h(x)$ is normal with variance $(n
h)^{-1} \sigma^2$ and mean $\mu_j$. Then
\begin{displaymath}
P \{{\hat{\mu}}_j^L \le \mu_j \le {\hat{\mu}}_j^U, 1 \le j \le M
\}=\gamma^M.
\end{displaymath} (4.3.18)

Taking $\gamma=\alpha^{1/M}$ will give simultaneous coverage probability $\alpha$.

If the error variance is not known, Hall and Titterington (1986b) recommend constructing an estimate of $\sigma^2$ using differences of observations, that is

\begin{displaymath}{\hat{\sigma}}^2=(2 n)^{-1} \sum_{i=2}^n (Y_i-Y_{i-1})^2.\end{displaymath}

Algorithm 4.3.1
Define the $M$ cells and compute the kernel estimator as in 4.3.15.

STEP 1.

Define

\begin{displaymath}{\hat{\mu}}_l^L={\hat{m}}_h (x_l)-(n h)^{-1/2}{\hat{\sigma}}c_\gamma\end{displaymath}

and

\begin{displaymath}{\hat{\mu}}_l^U={\hat{m}}_h (x_l)+(n h)^{-1/2} {\hat{\sigma}}c_\gamma,\end{displaymath}

where $2 \Phi (c_\gamma)-1=\gamma,\ \gamma^M=\alpha$.

STEP 2.

Let the bound of variation be

\begin{displaymath}V_l(\eta)=(1/2)((2 \eta+1)h+1/n)c_l,\ 0<\eta<1,\end{displaymath}

where $c_l$ the upper bound on $m'(x)$ as in 4.3.17.

STEP 3.

Define as in 4.3.18

\begin{eqnarray*}
CUP((l+\eta)h)&=&{\hat{\mu}}_l^U+V_l(\eta),\ 0<\eta<1;\cr
C L O((l+\eta)h)&=&{\hat{\mu}}_l^L-V_l(\eta),\ 0<\eta<1.
\end{eqnarray*}



STEP 4.

Draw $C L O(\cdot)$ and $CUP(\cdot)$ around ${\hat{m}}_h(x_{i})$.

The band $[ C L O(x), CUP(x) ]$ is a level $\alpha$ uniform confidence band for $m$ as was shown by Hall and Titterington (1986b). The discretization method was also employed by Knafl, Sacks and Ylvisaker (1985) and Knafl et al. (1984).

Figure 4.5 gives an example of a uniform confidence band for the radiocarbon data set (Suess 1980). The variables are those of radiocarbon age and tree-ring age, both measured in years before 1950 A.D. and thinned and rounded so as to achieve equal spacing of the tree-ring ages. For more background on this calibration problem see Scott, Baxter and Aitchison (1984). Altogether, 180 points were included. Hall and Titterington have chosen $M=30$ so that $q=6$. The bands are constructed with $\alpha=0.05$ and under the assumption of a single derivative with uniform bound $c=1$ on $m'(x)$.

Figure 4.5: Uniform confidence band for radiocarbon band data (Suess 1980). The uniform confidence band is constructed with bandwidth $h=15$. From Hall and Titterington (1986b).
\includegraphics[scale=0.2]{ANR4,5.ps}

A different approach to handling the fluctuation of $m(x)$ in between the grid points could be based on the arc length of $m(x)$ between two successive knot points. Adrian Bowman suggested that instead of bounding derivatives of $m$, one could assume an upper bound on the arc length,

\begin{displaymath}\int^{X_{i+1}}_{X_i}\sqrt{1+(m'(x))^2} d x,\end{displaymath}

between two grid points $X_i,X_{i+1}$. In this case the gaps between mesh points are bridged by a chain of ellipse shells with the two foci in the endpoints of neighboring confidence intervals.

4.3.2 Smooth confidence bands

It was shown in Theorem 4.2.1 that the suitably scaled kernel smoother ${\hat{m}}_h(x)$ has an asymptotic normal distribution,

\begin{displaymath}
\sqrt{n h} ({\hat{m}}_h(x)-m(x))\buildrel {\cal L}\over \to N(B(x),V(x)),
\end{displaymath} (4.3.19)

where $B(x)$ denotes the bias term and $V(x)$ is the asymptotic variance (depending on the kernel $K$ and the bandwidth). More generally consider now the left-hand side of 4.3.19 as a stochastic process in $x$. If this process converges in a suitable way to a Gaussian process $G(x)$ with known covariance structure, then a uniform confidence band can be constructed from the distribution of $\sup_x \left \vert G(x) \right \vert$. This latter distribution is well studied in extreme value theory of Gaussian processes.

In the setting of nonparametric regression use of this approach was made by Johnston (1982) for the kernel weights (with known marginal density $f$)

\begin{displaymath}W^{(1)}_{h i}(x)=K_h(x-X_i)/f(X_i).\end{displaymath}

Major (1973) used this method to find variability bands for the regressogram. Liero (1982) and Härdle (1987b) extended Johnston's results by allowing $W_{hi}(x)$ to be of the Nadaraya-Watson form.

The basic idea of the approach taken by these authors is to standardize the process $\sqrt{n h} [{\hat{m}}_h(x)-m(x)]$ and to approximate it by a suitable Gaussian process. More precisely, Johnston (1982) has shown that a suitably rescaled version of

\begin{displaymath}\sqrt{n h} [{\hat{m}}_h(x)-m(x)]\end{displaymath}

has approximately the same distribution as the stationary Gaussian process
\begin{displaymath}
G(x)=\int K(x-u) d W(u),
\end{displaymath} (4.3.20)

with covariance function $\int K(x) K(u-x) d x$ and standard Wiener process $W(x)$. For a definition of a Wiener process see, for instance, Serfling (1980 p. 41). Bickel and Rosenblatt (1973) derived the asymptotic distribution of the supremum $\sup_x \left \vert G(x) \right \vert$ of the process 4.3.20. This result permits construction of approximate confidence bands.

Theorem 4.3.1   Define

\begin{displaymath}{\hat{\sigma}}^2_h(x)=n^{-1} \sum^n_{i=1} K_h(x-X_i) Y^2_i/{\hat{f}}_h(x)-
{\hat{m}}_h^2(x).\end{displaymath}

Suppose that ${\cal X}=[0, 1]$ and

30pt to 30pt(A1) $m(x), f(x)$ and $\sigma ^2(x)$ are twice differentiable;

30pt to 30pt(A2) $K$ is a differentiable kernel with bounded support $[-A,A
]$;

30pt to 30pt(A3) $E(\left \vert Y \right \vert^k \vert X=x)<C_k,\quad k=1,2,\ldots$;

30pt to 30pt(A4) $f(\cdot)$ is strictly positive on ${\cal X};$

30pt to 30pt(A5) $h=n^{-\delta},\ 1/5<\delta<1/3$.

Then the maximal deviation between ${\hat{m}}_h(x)$ and $m(x)$ over ${\cal X}$ has the limiting distribution

\begin{eqnarray*}
&&P \left\{(2 \delta \log n )^{1/2} \left(\left({n h} \over
{c...
...vert-d_n\right)<z \right\}
\to \exp(-2 \exp(-z)),\ n \to \infty.
\end{eqnarray*}



Here

\begin{displaymath}d_n=(2 \delta \log n)^{1/2}+{1 \over {(2 \delta \log n)^{1/2}}} \{\log C_1
\pi^{-1/2}+(1/2)\log(\log n^\delta) \},\end{displaymath}

where

\begin{displaymath}C_1={K^2(A)+K^2(-A) \over 2 c_K}\end{displaymath}

if $C_1>0$, and otherwise

\begin{displaymath}d_n=(2 \delta \log n)^{1/2}+{1 \over {(2 \delta \log n)^{1/2}...
...ft\{\log
\left({{C_2} \over {2 \pi^2}}\right)^{1/2} \right\}, \end{displaymath}

where

\begin{displaymath}C_2={{\int (K'(x))^2 d x} \over {2 c_K}}.\end{displaymath}

From this theorem one can obtain approximate confidence bands for $m$. Take the quartic kernel $K(u)=(15/16) (1-u^2)^2 I(\vert u \vert \le 1)$, for instance. For this kernel $c_K=5/7$ and it vanishes at the boundary of its support $[-1,1]$, so $C_1=0$ in the Theorem 4.3.. The following algorithm is designed for the quartic kernel.

Algorithm 4.3.2
STEP 1.

Define from $c_K=5/7$ and $C_2=(15/7)/(2 c_K)$

\begin{eqnarray*}
d_n&=&(2 \log 1/h)^{1/2}+(2 \log 1/h)^{-1/2} \log\left({C_2 \o...
...
&&\times[{\hat{\sigma}}^2_h(x)/({\hat{f}}_h(x) n h)]^{1/2}/c_K
\end{eqnarray*}



and $CUP(x)$ analogously, where $c_\alpha$ is such that

\begin{displaymath}\exp(-2 \exp (-c_\alpha))=1-\alpha.\end{displaymath}

STEP 2.

Plot the asymptotic confidence bands $CUP(x)$ and $C L O(x)$ around ${\hat{m}}_h(x)$.

pt

Note that this theorem does not involve a bias correction as in Algorithm 4.3.1. The bias term is suppressed by assuming that the bandwidth $h$ tends to zero slightly faster than the optimal rate $n^{-1/5}$. A bias term could be included, but it has the rather complicated form 4.2.8; see Theorem 4.2.1. A means of automatically correcting for the bias is presented subsequently in the wild bootstrapping algorithm.

Figure 4.6 shows an application of Algorithm 4.3.2, a uniform confidence band for the expenditure Engel curve for food with $1-\alpha=0.95$.

Figure 4.6: Uniform confidence band for the food expenditure Engel curve for 1973. Shown is the kernel smooth $m_h(x)$, $h=0.35$ (quartic kernel), and a parametric fit, a third-order polynomial through zero. Family Expenditure Survey (1968-1981).
\includegraphics[scale=0.14]{ANR4,6.ps}

4.3.3 Bootstrap bands

The idea of bootstrap bands is to approximate the distribution of

\begin{displaymath}\sup_{x \in {\cal X}} \left \vert {\hat{m}}_h(x)-m(x) \right \vert\end{displaymath}

by the bootstrap technique. The following procedure was proposed by McDonald (1982) is based on the naive bootstrap. For this algorithm, unfortunately, no theoretical result is available.

Algorithm 4.3.3
b=0

REPEAT

b=b+1

STEP 1.

Sample $\{(X^*_i, Y^*_i)\}^n_{i=1}$ from the data $\{(X_i, Y_i) \}^n_{i=1}$.

STEP 2.

Construct a kernel smooth ${\hat{m}}_h^*(\cdot)$ from the bootstrap sample $\{(X^*_i, Y^*_i)\}^n_{i=1}$ and define $Z_b={\displaystyle \sup_{x \in
{\cal X}} \left \vert {\hat{m}}^*_h(x)-{\hat{m}}_h(x) \right \vert}$.

UNTIL b=B (=number of bootstrap samples).

STEP 3.

Define $C L O^*$ as the $\alpha/2$ quantile of the B bootstrap deviations $Z_b$. By analogy, define $CUP^*$ as the $1-\alpha/2$ quantile.

STEP 4.

Draw the interval $[ C L O^*, CUP^*]$ around every point $x$ of the observation interval.

pt The above algorithm is extremely computer intensive since on a fine grid of points the statistic $Z_b$ has to be computed $B$ times. A computationally less intensive procedure is to consider not a band but rather a collection of error bars based on bootstrapping. This method is based on a pointwise bootstrap approximation to the distribution of ${\hat{m}}_h(x)-m(x)$. In the following I describe the approach taken by Härdle and Marron (1988), which uses the wild bootstrap technique to construct pointwise confidence intervals. How can these pointwise confidence intervals be modified in order to cover the true curve $m(x)$ with simultaneous coverage probability $1-
\alpha$?

A straightforward way of extending bootstrap pointwise intervals to $M$ simultaneous confidence intervals is by applying the Bonferroni method. A drawback to the Bonferroni approach is that the resulting intervals will quite often be too long. The reason is that this method does not make use of the substantial positive correlation of the curve estimates at nearby points.

A more direct approach to finding simultaneous error bars is to consider the simultaneous coverage on pointwise error bars, and then adjust the pointwise level to give a simultaneous coverage probability of $1-
\alpha$. Fisher (1987, p. 394) called this a ``confidence ribbon'' since the pointwise confidence intervals are extended until they have the desired simultaneous coverage probability of $1-
\alpha$. A general framework, which includes both the Bonferroni and direct methods, can be formulated by thinking in terms of groups of grid points.

First, partition into $M$ groups as in the Hall and Titterington approach, the set of locations where error bars are to be computed. Suppose the groups are indexed by $j=1,\ldots,M$ and the locations within each group are denoted by $x_{j, k}, k=1,\ldots, N_j$. The groups should be chosen so that for each $j$ the $x_{j, k}$ values in each group are within $2h$ of each other. In the one-dimensional case this is easily accomplished by dividing the $x-$axis into intervals of length roughly $2h$.

In order to define a bootstrap procedure that takes advantage of this positive correlation consider a set of grid points $x_{j, k}$, $k=1,\ldots, N_j$ that have the same asymptotic location $c_k$ (not depending on $n$) relative to some reference point $x_{j,0}$ in each group $j$. Define

\begin{displaymath}x_{j, k}=c_k h+x_{j,0},\quad k=1,\ldots, N_j.\end{displaymath}

In the multidimensional case, the simplest formulation is to have each group lying in a hypercube with length $2h$.

Now, within each group $j$ use the wild bootstrap replications to approximate the joint distribution of

\begin{displaymath}\{{\hat{m}}_h(x_{j, k})-m(x_{j, k})\!: k=1,\ldots, N_j \}.\end{displaymath}

Recall Theorem 4.2.2 for the wild bootstrap approximation to this distribution. There is was shown that

\begin{displaymath}\sqrt{n h} [{\hat{m}}_h(x)-m(x)]\end{displaymath}

and

\begin{displaymath}\sqrt{n h}[{\hat{m}}_h^*(x)-{\hat{m}}_g(x)]\end{displaymath}

have the same limiting normal distribution. For each group $j$ this joint distribution is used to obtain simultaneous $1-\alpha/M$ error bars that are simultaneous over $k=1,\ldots, N_j$ as follows. Let $\beta>0$ denote a generic size for individual confidence intervals. The goal is to choose $\beta$ so that the resulting simultaneous size is $1-\alpha/M$.

For each $x_{j, k}$, $k=1,\ldots, N_j$, define the interval $I_{j, k} (\beta)$ to have endpoints which are the $\beta/2$ and the $1-\beta/2$ quantiles of the $({\hat{m}}_h^*(x_{j, k})-{\hat{m}}_g(x_{j, k}))$ distribution. Then define $\alpha_\beta$ to be the empirical simultaneous size of the $\beta$ confidence intervals, that is, the proportion of curves which lie outside at least one of the intervals in the group $j$. Next find the value of $\beta$, denoted by $\beta_j $, which makes $\alpha_{\beta_j}=\alpha/M$. The resulting $\beta_j $ intervals within each group $j$ will then have confidence coefficient $1-\alpha/M$. Hence, by the Bonferroni bound the entire collection of intervals $I_{j, k}(\beta_j)$, $k=1,\ldots, N_j$, $j=1,\ldots,M$, will simultaneously contain at least $1-
\alpha$ of the distribution of ${\hat{m}}^*_k(x_{j,k})$ about ${\hat{m}}_g(x_{j,k})$. Thus the intervals $I_{j,k}(\beta_j)-{\hat{m}}_g(x_{j,k})+{\hat{m}}_h(x_{j,k})$ will be simultaneous confidence intervals with confidence coefficient at least $1-
\alpha$. The result of this process is summarized as the following theorem.

Theorem 4.3.2   Define $M$ groups of locations $x_{j, k}$, $k=1,\ldots, N_j$, $j=1,\ldots,M$, where simultaneous error bars are to be established. Compute uniform confidence intervals for each group. Correct the significance level across groups by the Bonferroni method. Then the bootstrap error bars establish asymptotic simultaneous confidence intervals, that is,

\begin{eqnarray*}
&&\lim_{n \to \infty} P \{m(x_{j, k}) \in I_{j, k}(\beta_j)-
...
..._{j,k}),\cr
&&k=1,\ldots, N_j, j=1,\ldots, M, \} \geq 1-\alpha.
\end{eqnarray*}



As a practical method for finding $\beta_j $ for each group $j$ we suggest the following ``halving'' approach. In particular, first try $\beta=\alpha/2 M$, and calculate $\alpha_\beta$. If the result is more than $\alpha/M$, then try $\beta=\alpha/4 M $, otherwise next try $\beta=3 \alpha/4 M$. Continue this halving approach until neighboring (since only finitely many bootstrap replications are made, there is only a finite grid of possible $\beta$s available) values $\beta_*$ and $\beta^*$ are found so that $\alpha_{\beta_*}<\alpha/M<\alpha_{\beta^*}$. Finally, take a weighted average of the $\beta_*$ and the $\beta^*$ intervals where the weights are $(\alpha_{\beta^*}-\alpha/M)/(\alpha_{\beta^*}-\alpha_{\beta_*})$ and $(\alpha/M-\alpha_{\beta_*})/(\alpha_{\beta^*}-\alpha_{\beta_*})$, respectively.

Note that Theorem 4.3.2 contains, as a special case, the asymptotic validity of both the Bonferroni and the direct simultaneous error bars. Bonferroni is the special case $N_1=\cdots=N_M=1$, and the direct method is where $M=1$.

The wild bootstrap simultaneous error bars are constructed according to the following algorithm.

Algorithm 4.3.4
b=0

REPEAT

b=b+1

STEP 1.

Sample $\epsilon^*_i$ from the two-point distribution ${\hat{G}}_i$ 4.2.9, where

\begin{displaymath}{\hat{G}}_i=\gamma \delta_a+(1-\gamma) \delta_b
\end{displaymath}

STEP 2.

Construct wild bootstrap observations

\begin{displaymath}Y_i^*={\hat{m}}_g (X_i)+\epsilon^*_i,\end{displaymath}

where ${\hat{m}}_g(x)$ is a slightly oversmoothed kernel estimator with bandwidth $g$. Compute ${\hat{m}}^*_h(x)$ at $M$ different locations from the bootstrap sample $\{(X_i,Y^*_i)\}^n_{i=1}$.

UNTIL b=B=number of bootstrap samples.

STEP 3.

Calculate $I_{j, k}(\beta_j)$ as follows.

First try $\beta=\alpha/2 M$, and calculate $\alpha_\beta$.

If the result is more than $\alpha/M$, then try $\beta=\alpha/4 M $, otherwise next try $\beta=3 \alpha/4 M$.

Continue this halving approach until neighboring values $\beta_*$ and $\beta^*$ are found so that $\alpha_{\beta_*}<\alpha/M<\alpha_{\beta^*}$.

Finally, take a weighted average of the $\beta_*$ and the $\beta^*$ intervals where the weights are

\begin{displaymath}(\alpha_{\beta^*}-\alpha/M)/(\alpha_{\beta^*}-\alpha_{\beta_*})\end{displaymath}

and

\begin{displaymath}(\alpha/M-\alpha_{\beta_*})/(\alpha_{\beta^*}-\alpha_{\beta_*}),\end{displaymath}

respectively.

Define

\begin{displaymath}[C L O^*(x_{j,k}), CUP^*(x_{j,k})]={\hat{m}}_h(x_{j,k})-
{\hat{m}}_g(x_{j,k})+I_{j,k}(\beta_j).\end{displaymath}

STEP 4.

Draw the interval $[ C L O^*, CUP^*]$ around ${\hat{m}}_h(x)$ at the $M$ distinct points $x_1,\ldots, x_M$.

This wild bootstrap technique was applied to the potato versus net income example. Figure 4.7 displays the error bars for this data.

Figure 4.7: Uniform error bars for the potato versus net income data set for 1973. The solid line is the kernel smooth $m_h(x)$, $h=0.35$ (quartic kernel). Family Expenditure Survey (1968-1981).
\includegraphics[scale=0.2]{ANR4,7.ps}

To study the practical difference between the various types of error bars, Härdle and Marron (1988) considered the distribution of ${\hat{m}}_h(x)-m(x)$ at a grid of $x$ values for some specific examples. They chose the underlying curve to be of linear form, with an added bell shaped hump,

\begin{displaymath}m(x)=x+4 e^{-2 x^2}/\sqrt{2 \pi}.\end{displaymath}

To see what this looks like, consider Figure 4.8. The solid curve in each part of Figure 4.8 is this $m(x)$.

Figure 4.8: Solid curve is $m(x)$, crosses are a realization of $Y_1$, ..., $Y_{200}$ for (a) $\sigma=.3$, (b) $\sigma=.6$, (c) $\sigma=1$, (d) $\sigma=1.5$. Dotted curve is ${\hat{m}}_{h_0}(x)$; curve at bottom shows effective window width used by ${\hat{m}}_{h_0}(x)$. 11076 ANR200obs.xpl From Härdle and Marron (1988).
\includegraphics[scale=0.7]{ANR200obs.ps}

The marginal distribution of $X$ is $N(0,1)$, and the conditional distribution of $Y \vert X$ is $N(m(X),\sigma^2)$, for $\sigma=.3, .6, 1, 1.5$. For each of these four distributions 200 observations were generated. Figure 4.8 shows one realization from each of the four settings. Figure 4.8 also shows ${\hat{m}}_{h_0}(x)$, as calculated from the crosses, for each setting, together with a plot of the kernel function at the bottom which shows the effective amount of local averaging being done in each case. The bandwidth for these is $h_0$, the optimal bandwidth as in Härdle and Marron (1985b), where the weight function $w(x)$ in that paper was taken to be the indicator function of $[-2,2]$. As expected, more smoothing is required when the error variance is larger.

To study the differences among the various error bars, for each setting, 500 pseudo data sets were generated. Then kernel estimates were calculated, at the points $x=-2,-1.8,-1.6,\ldots, 1.8, 2$ using a standard normal density as kernel. The bandwidth was chosen to be $h_0$. Figure 4.9 shows, for the $\sigma=1$ distribution, $m(x)$ overlayed with error bars whose endpoints are various types of quantiles of the distribution of ${\hat{m}}_h(x)$. The centers of the error bars are at the means of these distributions, and show clearly the bias that is inherent in nonparametric regression estimation. Note, in particular, how substantial bias is caused by both the curvature of $m(x)$ near the hump, and by the curvature of $f(x)$ near $x=-2,2$. The bars in Figure 4.9a are 80% pointwise error bars. In Figure 4.9b they are 80% simultaneous bars. In Figure 4.9c, the $x$ values were split up into the neighborhoods $\{-2,\ldots,-1 \}$, $\{-.8,\ldots,0 \}$, $\{.2,\ldots,1 \}$, $\{1.2,\ldots,2 \}$ and the neighborhood method of Theorem 4.3.2 was used. Figure 4.9d shows the completely Bonferroni 80% error bars.

Figure 4.9: Overlay of $m(x)$ with empirical (from 500 simulation runs) quantiles of ${\hat{m}}_{h_0}(x)$ distribution. Centers of bars are means of distributions. Error bars are (a) 80% pointwise,(b) 80% simultaneous, (c) 80% neighborhood, (d) 80% Bonferroni. From Härdle and Marron (1988).
\includegraphics[scale=0.2]{ANR4,9a.ps} \includegraphics[scale=0.2]{ANR4,9b.ps}

For easy comparison of the lengths of these intervals, consider Figure 4.10. This shows, for the same $x$ values, the lengths of the various bars in Figure 4.9. Of course, these bars are shorter near the center, which reflects the fact that there is more data there, so the estimates are more accurate. As expected, the lengths increase from pointwise, to actual simultaneous, to neighborhood, to Bonferroni bars. Also note that, as stated above, the difference between the actual simultaneous bars and the neighborhood simultaneous bars is really quite small, whereas the pointwise bars are a lot narrower.

Figure 4.10: Lengths of the bars in Figure 4.9, where the $x$ locations are the same. From Härdle and Marron (1990).
\includegraphics[scale=0.2]{ANR4,10.ps}

Exercises
4.3.1 Refine Algorithm 4.3.2 to allow a kernel with $K(A)>0$.

4.3.2 Use the WARPing algorithm of Exercise 4.2.3 on the wild bootstrap smoothing to program Algorithm 4.3.4 for simultaneous error bars.

4.3.3 Compare the naive bootstrap bands with the wild bootstrap error bars. Where do you see the essential difference?

[Hint: Consider the bias of ${\hat{m}}_h(x)$.]

4.3.4 Use Algorithm 4.3.2 to find a smooth uniform confidence band for the motorcycle data set.

4.3.5 Could you translate Theorem 4.3.1 into the world of $k$-$NN$ smoothing using the equivalence statements of Section 3.11?

4.3.4 Complements

An important issue is how to fine tune the choice of the pilot bandwidth $g$. Though it is true that the bootstrap works (in the sense of giving asymptotically correct coverage probabilities) with a rather crude choice of $g$, it is intuitively clear that specification of $g$ will play a role in how well it works for finite samples. Since the main role of the pilot smooth is to provide a correct adjustment for the bias, we use the goal of bias estimation as a criterion. We think theoretical analysis of the above type will be more straightforward than allowing the $N_j$ to increase, which provides further motivation for considering this general grouping framework.

In particular, recall that the bias in the estimation of $m(x)$ by ${\hat{m}}_h(x)$ is given by

\begin{displaymath}b_h(x)=E^{Y \vert X}[{\hat{m}}_h(x)]-m(x).\end{displaymath}

The bootstrap bias of the estimator constructed from the resampled data is

\begin{eqnarray*}
{\hat{b}}_{h,g}(x)&=&E^*[{\hat{m}}_h^*(x)]-{\hat{m}}_g(x)\cr
...
...1}^n K_h (x-X_i) {\hat{m}}_g(X_i)/{\hat{f}}_h(x)-{\hat{m}}_g(x).
\end{eqnarray*}



The following theorem gives an asymptotic representation of the mean square error for the problem of estimating $b_h(x)$ by ${\hat{b}}_{h,g}(x)$. It is then straightforward to find a $g$ that minimizes this representation. Such a choice of $g$ will make the means of the $Y \vert X$- and $*$-distributions close to each other.

Theorem 4.3.3   Under the conditions of Theorem 4.2.2, along almost all sample sequences,

\begin{displaymath}E[({\hat{b}}_{h,g}(x)-b_h(x))^2 \vert X_1,\ldots,X_n] \sim h^4[C_1 n^{-1}g^{-
5}+C_2 g^4],\end{displaymath}

in the sense that the ratio tends in probability to one, where

\begin{eqnarray*}
C_1&=&\int(K'')^2 {1 \over 2} d_K \sigma^2 (x)/f(x),\cr
C_2&=&({1 \over 2} d_K)^4 [(m f)^{(4)}-(m f'')'']^2 (x)/f(x)^2.
\end{eqnarray*}



An immediate consequence of Theorem 4.3.3 is that the rate of convergence of $g$ for $d=1$ should be $n^{-1/9}$. This makes precise the above intuition which indicated that $g$ should be slightly oversmoothed. In addition, under these assumptions reasonable choices of $h$ will be of the order $n^{-1/5}$. Hence, Theorem 4.3.3 shows once again that $g$ should tend to zero more slowly than $h$. A proof of Theorem 4.3.3 is contained in Härdle and Marron (1988).