4.1 The speed at which the smooth curve converges

This section is rather theoretical in spirit. The reader more interested in practical consequences of the theoretical results should go directly to the Exercises and Complements.

In parametric problems the speed at which estimated parameters tend to the true value is typically $n^{-1/2}$ (Bickel and Doksum 1977, chapter 4.4). By contrast, in the nonparametric curve estimation setting, the rate of convergence, if quantified, for instance, by the square root of the quadratic deviation, is usually of slower order $n^{-r}, 0<r<1/2$. The subject of this section is to shed some light on the dependence of this rate $r$ on four important qualitative features of regression smoothing,

(S) the Smoothness of $m$;

(D) the Dimension $d$ of $X$;

(O) the Object $m^{(k)}$, the $k$th derivative of $m$, that is to be estimated;

(T) the Type of estimator that is used.

Other distance measures, for example, the uniform deviation $d_{L_\infty} ({\hat{m}},
m)$, depend also on these four characteristics of regression smoothing but might have a slightly different rate. Consider for the moment just the mean integrated squared error $d_M ({\hat{m}}, m)$ for which we would like to analyze the speed of convergence. Let ${b_n}$ be a sequence of positive constants. It is called a lower rate of convergence if for some $c>0$ and $n \ge
n_0$

\begin{displaymath}\inf_{{\hat{m}}} \sup_{m \in {\cal M}} d_M ({\hat{m}}, m) \ge c b_n^2.\end{displaymath}

Here the $\inf_{\hat{m}}$ denotes the infimum over all possible estimators ${\hat{m}}$ of $m$ and the $\sup_{m \in {\cal M}}$ denotes the supremum over a class of functions ${\cal M}$ with certain smoothness properties. A lower rate of convergence ${b_n}$ is thus a sequence of constants that tends to zero faster than any smoother ${\hat{m}}$ converges to $m$ in a uniform sense. The sequence is said to be an achievable rate of convergence if there is an estimator ${\hat{m}}$ and a $C>0$ such that for $n \ge
n_0$

\begin{displaymath}\sup_{m \in {\cal M}} d_M ({\hat{m}}, m) \le C b_n^2.\end{displaymath}

An achievable rate of convergence is thus a sequence that tends to zero slower than a specific smoother converges to the true regression function. An optimal rate of convergence is a rate that is both a lower and an achievable rate of convergence. A consequence of these definitions is that if ${b_n}$ is a lower rate of convergence and ${b'_n}$ is an achievable rate of convergence, the sequence ${b_n}$ must tend to zero faster than ${b'_n}$ in the sense that $b'_n \ge c_1 b_n$ for some $c_1>0$ and $n \ge
n_0$. The notion of optimal rate of convergence is not unique since, with an optimal rate ${b_n}$, ${b_n (1+o(1))}$ also is optimal. Asymptotically, optimal rates of convergence differ by only a constant, so it makes sense to call any optimal rate of convergence the optimal rate of convergence. A smoother ${\hat{m}}$ that achieves the optimal rate is called asymptotically optimal.

So far the concept of optimal rates of convergence has been defined through the mean integrated squared error (MISE) $d_M$. It turns out that for kernel estimators one could equally well have stated the concept of optimal rates with the integrated squared error $d_I$ or some other distance measure, such as $d_A$; see Härdle (1986b). Asymptotically they define distance measures of equal sharpness, as is shown in the following theorem by Marron and Härdle (1986, theorem 3.4).

Theorem 4.1.1   Assume that

30pt to 30pt(A1) $E(Y^k \vert X=x) \le C_k<\infty$, $k=1,2,\ldots;$

30pt to 30pt(A2) $f(x)$ is Hölder continuous and is positive on the support of $w$;

30pt to 30pt(A3) $K$ is Hölder continuous.

Then for kernel estimators

\begin{eqnarray*}
\sup_{h \in H_n} \left\vert d_A(h)-d_M(h) \right\vert /d_M(h) ...
...t\vert d_I(h)-d_M(h) \right\vert /d_M(h)& \to 0 & \quad
a.s.,\\
\end{eqnarray*}



where $H_n=\left[n^{\delta-1/d},n^{-\delta}\right]$, $0<\delta<1/(2 d)$ and $d_\bullet (h)$ is short for
$d_\bullet({\hat{m}}_h,m)$.

pt The optimal global rates of convergence for nonparametric regression were derived for the case $d=1$ by Ibgragimov and Hasminskii (1980) and for the multidimensional case by Stone (1982), Nussbaum (1985) and Nemirovskii, Polyak and Tsybakov (1985). Nemirovskii et al. (1985) gave very general results on the optimal rates of convergence including the setup with an $L_p$ loss function. Stone (1982) derived optimal global rates of convergence using $d_I$ among other distance measures. The following theorem is the immediate consequence of Stone's (1982) theorem 1. To formulate it I need some notation. Let ${\cal M}={\cal M}_{p,\beta}$ be the smoothness class of all $p$-times differentiable functions $m$ on the real line such that the $p$th derivative is Hölder continuous with exponent $\beta$

\begin{displaymath}\vert m^{(p)} (u)-m^{(p)} (v) \vert \le L_\beta \vert u-v \vert^\beta,\end{displaymath}

where $0<\beta \le 1$.

Theorem 4.1.2   Suppose that

30pt to 30pt(A1) $w(x)$ is the indicator function of the compact set ${\cal X}$;

30pt to 30pt(A2) the conditional distribution of $Y$ given $X=x$ is normal with variance $\sigma ^2(x)$;

30pt to 30pt(A3) the conditional variance $\sigma ^2(x)$ is bounded from above as well as bounded from zero on some compact set ${\cal X}' \subset {\cal X}$;

30pt to 30pt(A4) the marginal density $f(x)$ is bounded away from zero on ${\cal X}'$;

30pt to 30pt(S) $m(x)$ is in smoothness class ${\cal M}_{p,\beta}$;

30pt to 30pt(D) $X$ is one-dimensional;

30pt to 30pt(O) $m^{(k)}$, $k \le p$, is to be estimated.

Then the lower rate of convergence is $n^{-r}$ with

\begin{displaymath}
r=\frac{p+\beta-k}{2(p+\beta)+1}.
\end{displaymath} (4.1.1)

Stone (1982) proved that under the assumptions of this theorem the rate $n^{-r}$ with $r$ as in 4.1.1 is also achievable in some weaker sense than defined earlier. He also showed that for a suitable generalization of ${\cal M}_{p,\beta}$ the optimal rate in this weaker sense is $n^{-r}$ with

\begin{displaymath}
r=\frac{p+\beta-k}{2(p+\beta)+d},
\end{displaymath} (4.1.2)

with $d$ denoting the dimension of $X$. Ibgragimov and Hasminskii (1980), Nussbaum (1985) and Nemirovskii et al. (1985) proved that this rate is achievable under the equispaced design (or nearly equispaced design). The achievability for the one-dimensional case will become evident from the calculations given in this section and in Section 4.5. Pointwise convergence rates are analogous and were derived by Stone (1980).

Note that the optimal rate tends to zero faster if the regression curve has more derivatives existing. The optimal rate tends to zero slower if the $X$-variable is of higher dimension or if higher order derivatives $m^{(k)}$ of $m$ are to be estimated.

Kernel estimators are asymptotically optimal if the bandwidth sequence and the kernel function are suitably chosen. Consider the fixed design model in which the data are taken at $X_i=i/n$ and $Y_i=m(X_i)+\epsilon_i$, where $\epsilon_i$ is normal with variance $\sigma^2$. Suppose that $m$ is four-times differentiable and it is desired to estimate the second derivative $m^{(2)}
(x)$.

Theorem 4.1.2 says that for this estimation problem $(p=4, k=2, d=1)$ the best rate of convergence can only be $n^{-4/9}$. If this rate is also achievable it is optimal. In particular, $n^{-4/9}$ is a lower rate of convergence. (Recall the definition of optimal rate of convergence.) I shall show that the rate $n^{-4/9}$ is achievable over ${\cal M}_{4,0}$ for a certain kernel. Take the Priestley-Chao kernel estimate with weight sequence $\{ W_{hi}^{(2)}\} $:

\begin{displaymath}
m^{(2)}_h (x)=n^{-1} h^{-3} \sum^n_{i=1} K^{(2)} \left(\frac{x-X_i}{h}\right)
Y_i,
\end{displaymath} (4.1.3)

where $K^{(2)}$ denotes the second derivative of a symmetric kernel function. It is not hard to derive that (see the Complements of this section and of Section 3.1),
$\displaystyle \hbox{var}\{m^{(2)}_h (x)\}$ $\textstyle =$ $\displaystyle O(n^{-1} h^{-5})$  
$\displaystyle \hbox{bias}^2 \{m^{(2)}_h (x)\}$ $\textstyle =$ $\displaystyle O\left(\int (m^{(4)}(x))^2 w(x) d x\
h^4\right),$ (4.1.4)

as $h \to 0, n h^5 \to \infty $. So if the bandwidth is set at $h \sim n^{- 1/9}$,

\begin{displaymath}\sup_{m \in {\cal M}_{4,0}} d_M ({\hat{m}}^{(2)}_h, m^{(2)}) \le Cn^{-4/9}\end{displaymath}

for some $C>0$ and for $n \ge
n_0$. Thus $n^{-4/9}$ is by definition an achievable rate of convergence.

Note that if $h$ is chosen different from $n^{-1/9}$, in this example the kernel estimator will not achieve the optimal rate. To illustrate this, consider a bandwidth sequence $h=n^{-1/9+\delta}$ with $\delta$ positive or negative. If $\delta >0$ then the squared bias component of $d_M ({\hat{m}}^{(2)}_h,
m^{(2)})$ dominates and $d_M \approx n^{-4/9} (n^{4 \delta})$. If $\delta<0$ then the variance component of $d_M ({\hat{m}}^{(2)}_h,
m^{(2)})$ dominates and $d_M \approx n^{-4/9} (n^{-5 \delta})$. In any case, the rate of convergence is slower than the optimal rate $n^{-4/9}$.

This example shows that it is very important to tune the smoothing parameter $h$ to the right speed to balance bias$^2$ and variance. In Section 5.1 it will be seen how data-driven bandwidths can be constructed which automatically achieve the correct rate of convergence, giving asymptotically optimal estimates. This might seem to be somewhat at odds with the merits of the nonparametric smoothing approach. One motivated nonparametric regression estimation by a desire to assume less about the structure of $m$ than in a parametric framework, but in order to construct optimal estimators one seems to need the very specific assumption that higher derivatives up to certain order exist. A way out of this dilemma is presented in Section 5.1 where it is seen that the smoothing parameter can, in fact, be adapted to the degree of smoothness of $m$ without prior knowledge of the degree of differentiability of $m$.

On the other hand, the aim to achieve the optimal rate over a specific smoothness class should not be taken too literally, since in a practical situation the number $n^{-r_1},r_1=p_1/(2 p_1+1)$, will not be much different from $n^{-r_2},r_2=p_2/(2 p_2+1)$. Suppose that $p_1=16$. Even if we double the degree of differentiability to achieve a better rate of convergence, the relative improvement, $n^{-r_2}/n^{-r_1}$, for a sample size of $n = 100$ is only 3.5 percent.

Note that there are kernel smoothers of the form 4.1.3 which do not achieve the optimal rate of $n^{-4/9}$. This addresses the fourth characteristic (T) of regression smoothing: The type of estimator has to be selected in a reasonable way in order to achieve the optimal rate. Suppose that we had taken an asymmetric kernel in 4.1.3 which did not fulfill the orthogonality condition $\int u\ K(u) d u=0$. A little calculus and partial integration yields

$\displaystyle \hbox{bias}\{{\hat{m}}^{(2)}_h (x)\}$ $\textstyle =$ $\displaystyle \int K_h\ (x-u)\ [m^{(2)}(x)-
m^{(2)}(u)]du$  
    $\displaystyle +O(n^{-1}h^{-1})+o(h)$ (4.1.5)
  $\textstyle \approx$ $\displaystyle h m^{(3)}(x) \int u K(u)d u,\ h \to 0,\ n h \to
\infty,$  

which converges more slowly than $h^2$, the order of bias for symmetric kernels.

4.1.1 Rates of convergence for nonquadratic distance measures

More general distance measures of the form

\begin{displaymath}d_{L_\nu}\ ({\hat{m}},m)=\left[ \int \left\vert {\hat{m}}(x)-m(x) \right\vert ^\nu w(x) d x
\right]^{1/\nu},\quad \nu \ge 1,\end{displaymath}

have also been considered in the literature (Prakasa Rao 1983, p. 244). The distance for $\nu=\infty$ is defined, as usual, as the uniform maximal deviation

\begin{displaymath}\sup_{x \in {\cal X}}\ \left\vert {\hat{m}}(x)-m(x) \right\vert .\end{displaymath}

As with quadratic distance measures one can define an optimal rate of convergence. Stone proved that, with $r$ as in the Theorem 4.1.2, $n^{-r}$ is the optimal rate of convergence for $d_{L_\nu}$, if $1 \le \nu<\infty $. If $\nu=\infty$, the optimal rate is $n^{-r}(\log n)^r$. The uniform maximal deviation thus converges slightly slower to zero. In the Complements of this section I show that under some weak conditions
\begin{displaymath}
\sup_{x \in {\cal X}} \left\vert {\hat{m}}(x)-m(x) \right\vert =O_p (\max \{(n h/(\log n))^{-
1/2}, h \}).
\end{displaymath} (4.1.6)

This result was also obtained by Mack and Silverman (1982, section 3). If the bandwidth sequence is chosen as $h=h_n=O((n/\log n)^{-1/3})$ one has the rate $O_p((n/\log n)^{-1/3})$ in 4.1.5 which is the optimal rate 4.1.2 for $p=1, k=0, d=1$ and $\nu=\infty$. In Härdle, Janssen and Serfling (1988) this was shown to be an achievable rate of convergence not only for estimation of $m$ but also for other smooth functionals of the conditional distribution function. Such functionals include the nonparametric scale curve (see also Section 6.2) and trimmed $L$-smoothers or $M$-smoothers (Section 6.1).

Exercises
4.1.1 From the discussion after Theorem 4.1.2 we have seen that the type of estimator has to be adapted as well to achieve the optimal rate of convergence. Consider now the fixed design setting and $m \in C^4$ and a positive kernel weight sequence. Such a kernel smoother cannot achieve the optimal rate which is

\begin{displaymath}n^{-((2 \cdot 4)/(2 \cdot 4+1))}.\end{displaymath}

Why? Which rate is achieved instead?

[Hint: Compute the bias as in Section 3.1]

4.1.2 Conduct a small Monte Carlo study in which you compare the distances $d_{L_\infty}$ and $d_M$ for the same bandwidth. Do you observe the slower rate of convergence for $d_{L_\infty}$?

4.1.3 Describe the qualitative differences of the accuracy measures $d_{L_\infty}$ and $d_M$.

[Hint: Consider a situation where the smooth curve contains a wild single spike or wiggles around the true curve with a small variation.]

4.1.4 Give exact arguments for 4.1.4 in the fixed design setting.

[Hint: Look up the paper by Gasser and Müller (1984).]

4.1.5 Compute the optimal bandwidth that minimizes the first two dominant terms of MISE. Interpret the constants occurring in this asymptotically optimal bandwidth. When will $h$ tend to be large? When would you expect $h$ to be small?

4.1.6 Compute the bandwidth that balances the stochastic and the bias term for the supremum distance. Is it going faster or slower to zero than the MSE optimal bandwidth?

[Hint: The stochastic term is of order $O_p((n h)^{-1/2} (\log
n)^{1/2}$ as is shown in the complements. The systematic bias term is as seen above of the order $O(h^2)$ for $m \in {\cal C}^2$.]

4.1.2 Complements

To have some insight into why the $d_{L_\infty}$ rate has this additional log term consider the one-dimensional case $d=1$. We have to estimate the following probability.

\begin{eqnarray*}
\lefteqn{P \{\sup_{x \in {\cal X}}\vert\hat{m}(x)-m(x)\vert>\d...
...\left.-\;(\hat{m}(x_l)-m(x_l))\vert>\frac{\delta}{2}b_n\right\},
\end{eqnarray*}



where the $M_n$ intervals $\{x: \left\vert x-x_l \right\vert \le \eta_n
\}$ cover the compact set of interest ${\cal X}$. If $\eta_n$ is chosen small enough, then the second term is negligible compared to the first one. The first term can be estimated via the Bonferroni inequality, so it remains to bound the following

\begin{eqnarray*}
&& \sum^{M_n}_{l=1} P \left\{\left\vert {\hat{m}}(x_l)-m(x_l)
...
...hat{m}}(x_l)-m(x_l) \right\vert >{\delta
\over 2} b_n \right\}.
\end{eqnarray*}



Suppose now that ${\hat{m}}$ is the kernel estimator, $d=1, X_i=i/n$ and $m$ is Lipschitz continuous, that is, $p=1$. Then

\begin{eqnarray*}
\vert {\hat{m}}(x_l)-m(x_l) \vert&\le&\vert E({\hat{m}}(x_l))-...
...t\vert n^{-1} \sum^n_{i=1} K_h (x_l-X_i) \epsilon_i
\right\vert.
\end{eqnarray*}



Choosing $h=C b_n$, where $C>0$ is some small constant, we get

\begin{eqnarray*}
&&P \left\{\vert {\hat{m}}(x_l)-m(x_l) \vert>{\delta \over 2} ...
... (x_l-X_i) \epsilon_i
\right\vert>{\delta \over 4} b_n \right\}.
\end{eqnarray*}



If we assume that the errors are bounded and $\vert K(u) \vert \le 1$ then we can estimate the last probability using Bernstein's inequality (Uspensky 1937).

\begin{eqnarray*}
&&P \left\{\left\vert n^{-1} \sum^n_{i=1} K_h (x_l-X_i) \epsil...
...over 2 \sigma_n^2+{2 \over 3 h}
({\delta b_n \over 4})} \Bigr),
\end{eqnarray*}



where

\begin{displaymath}\sigma_n=E [h^{-2} K^2 (x_l-X_i)]=O(h^{-1}).\end{displaymath}

It is easily seen now that if

\begin{displaymath}b_n \sim n^{-1/3} \log n^{1/3}\end{displaymath}

then for $M_n=O(n)$ the term

\begin{displaymath}M_n \sup_{l=1,\ldots,M_n} P \left\{\left\vert {\hat{m}}(x_l)-m(x_l) \right\vert >{\delta
\over 2} b_n \right\}\end{displaymath}

tends to zero as $\delta \to \infty$. This means that $b_n \sup_{x \in {\cal X}}
\left\vert {\hat{m}}(x_l)-m(x_l) \right\vert $ is bounded in probability, that is,

\begin{displaymath}\sup_{x \in {\cal X}} \left\vert {\hat{m}}(x_l)-m(x_l) \right\vert =O_p (n^{-1/3} \log
n^{1/3}).\end{displaymath}