4.5 Multivariate Kernel Regression

In the previous section several techniques for estimating the conditional expectation function $ m$ of the bivariate distribution of the random variables $ Y$ and $ X$ were presented. Recall that the conditional expectation function is an interesting target for estimation since it tells us how $ Y$ and $ X$ are related on average. In practice, however, we will mostly be interested in specifying how the response variable $ Y$ depends on a vector of exogenous variables, denoted by $ {\boldsymbol{X}}$. This means we aim to estimate the conditional expectation

$\displaystyle E(Y\vert{\boldsymbol{X}})=E\left(Y\vert X_{1},\ldots,X_{d}\right) =m({\boldsymbol{X}}),$ (4.67)

where $ {\boldsymbol{X}}=(X_{1},\ldots, X_{d})^\top$. Consider the relation

$\displaystyle E(Y\vert{\boldsymbol{X}})=\int y f(y\vert{\boldsymbol{x}})\,dy = \frac{\int y
f(y,{\boldsymbol{x}})\,dy}{f_{{\boldsymbol{X}}}(x)}\,.$

If we replace the multivariate density $ f(y,{\boldsymbol{x}})$ by its kernel density estimate

$\displaystyle \widehat{f}_{h,{\mathbf{H}}}(y,{\boldsymbol{x}})=\frac{1}{n}\sum_...
...(Y_{i}-y)\, {\mathcal{K}}_{{\mathbf{H}}}({\boldsymbol{X}}_{i}-{\boldsymbol{x}})$

and $ f_{{\boldsymbol{X}}}({\boldsymbol{x}})$ by (3.60) we arrive at the multivariate generalization of the Nadaraya-Watson estimator:

$\displaystyle \index{multivariate regression!Nadaraya-Watson estimator} \index{...
...athcal{K}}_{{\mathbf{H}}}\left({\boldsymbol{X}}_{i}-{\boldsymbol{x}}\right)}\,.$ (4.68)

Hence, the multivariate kernel regression estimator is again a weighted sum of the observed responses $ Y_{i}$. Depending on the choice of the kernel, $ \widehat{m}_{{\mathbf{H}}}({\boldsymbol{x}})$ is a weighted average of those $ Y_{i}$ where $ {\boldsymbol{X}}_{i}$ lies in a ball or cube around $ {\boldsymbol{x}}$.

Note also, that the multivariate Nadaraya-Watson estimator is a local constant estimator. The definition of local polynomial kernel regression is a straightforward generalization of the univariate case. Let us illustrate this with the example of a local linear regression estimate. The minimization problem here is

$\displaystyle \min_{\beta_{0},{\boldsymbol{\beta}}_{1}}
\sum_{i=1}^{n} \left\{...
...ght\}^{2}
{\mathcal{K}}_{{\mathbf{H}}}({\boldsymbol{X}}_{i} -{\boldsymbol{x}}).$

The solution to the problem can hence be equivalently written as

$\displaystyle \widehat{{\boldsymbol{\beta}}} = (\widehat{\beta}_{0}, \widehat{{...
...thbf{W}}{\mathbf{X}} \right)^{-1} {\mathbf{X}}^\top{\mathbf{W}}{\boldsymbol{Y}}$ (4.69)

using the notations

$\displaystyle {\mathbf{X}}= \left(\begin{array}{cc}
1&({\boldsymbol{X}}_{1}-{\...
...symbol{Y}}= \left(\begin{array}{c}
Y_{1}\\ \vdots \\ Y_{n}
\end{array}\right),$

and $ {\mathbf{W}}= \mathop{\hbox{diag}}\left({\mathcal{K}}_{{\mathbf{H}}}({\boldsym...
...ots,
{\mathcal{K}}_{{\mathbf{H}}}({\boldsymbol{X}}_{n}-{\boldsymbol{x}})\right)$. In (4.69) $ \widehat{\beta}_{0}$ estimates the regression function itself, whereas $ \widehat{{\boldsymbol{\beta}}}_{1}$ estimates the partial derivatives w.r.t. the components $ {\boldsymbol{x}}$. In the following we denote the multivariate local linear estimator as

$\displaystyle \widehat{m}_{1,{\mathbf{H}}}({\boldsymbol{x}})= \widehat{\beta}_{0}({\boldsymbol{x}}).$ (4.70)

4.5.1 Statistical Properties

The asymptotic conditional variances of the Nadaraya-Watson estimator $ \widehat{m}_{{\mathbf{H}}}$ and the local linear $ \widehat{m}_{1,{\mathbf{H}}}$ are identical and their derivation can be found in detail in Ruppert & Wand (1994):

$\displaystyle \mathop{\mathit{Var}}\left\{\widehat{m}_{{\mathbf{H}}}({\boldsymb...
...ma({\boldsymbol{x}})}{f_{{\boldsymbol{X}}}({\boldsymbol{x}})}\, \{1+o_{p}(1)\},$ (4.71)

with $ \sigma({\boldsymbol{x}})$ denoting $ \mathop{\mathit{Var}}(Y\vert{\boldsymbol{X}}={\boldsymbol{x}})$.

In the following we will sketch the derivation of the asymptotic conditional bias. We have seen this remarkable difference between both estimators already in the univariate case. Denote $ {\boldsymbol{M}}$ the second order Taylor expansion of $ \left(m({\boldsymbol{X}}_{1}),\ldots,m({\boldsymbol{X}}_{1})\right)^\top$, i.e.

$\displaystyle {\boldsymbol{M}}\approx m({\boldsymbol{x}}){{ {1}\hskip-4pt{1}}}_...
...ymbol{x}}) \end{array}\right) + \frac{1}{2} {\boldsymbol{Q}}({\boldsymbol{x}}),$ (4.72)

where

$\displaystyle {\boldsymbol{L}}({\boldsymbol{x}})= \left(\begin{array}{c}
({\bo...
...m}({\boldsymbol{x}})({\boldsymbol{X}}_{1}-{\boldsymbol{x}})
\end{array}\right)$

and $ \gradi$ and $ {\mathcal{H}}$ being the gradient and the Hessian, respectively. Additionally to (3.62) it holds

$\displaystyle \frac{1}{n} \sum_{i=1}^{n} {\mathcal{K}}_{{\mathbf{H}}}({\boldsym...
...ldsymbol{x}}) + o_{p}({\mathbf{H}}
{\mathbf{H}}^\top{{ {1}\hskip-4pt{1}}}_{d}),$

$\displaystyle \frac{1}{n} \sum_{i=1}^{n} {\mathcal{K}}_{{\mathbf{H}}}({\boldsym...
...f{H}}^\top\gradi_{f}({\boldsymbol{x}}) +
o_{p}({\mathbf{H}}{\mathbf{H}}^\top),$

see Ruppert & Wand (1994). Therefore the denominator of the conditional asymptotic expectation of the Nadaraya-Watson estimator $ \widehat{m}_{{\mathbf{H}}}$ is approximately $ f_{{\boldsymbol{X}}}({\boldsymbol{x}})$. Using $ E({\boldsymbol{Y}}\vert{\boldsymbol{X}}_{1},\ldots,{\boldsymbol{X}}_{n})={\boldsymbol{M}}$ and the Taylor expansion for $ {\boldsymbol{M}}$ we have
$\displaystyle E \left\{\widehat{m}_{{\mathbf{H}}}\vert{\boldsymbol{X}}_{1},\ldots,
{\boldsymbol{X}}_{n}\right\}$ $\displaystyle \approx$ $\displaystyle \{f_{{\boldsymbol{X}}}({\boldsymbol{x}})+o_{p}(1)\}^{-1}
\bigg\{\...
...l{K}}_{{\mathbf{H}}}({\boldsymbol{X}}_{i}-{\boldsymbol{x}})
m({\boldsymbol{x}})$  
    $\displaystyle \quad + \sum_{i=1}^{n} {\mathcal{K}}_{{\mathbf{H}}}({\boldsymbol{...
...l{x}})
({\boldsymbol{X}}_{i}-{\boldsymbol{x}})^\top\gradi_{m}({\boldsymbol{x}})$  
    $\displaystyle \quad + \sum_{i=1}^{n} {\mathcal{K}}_{{\mathbf{H}}}({\boldsymbol{...
...cal{H}}_{m}({\boldsymbol{x}})({\boldsymbol{X}}_{i}-{\boldsymbol{x}})
\bigg\}\,.$  

Hence
$\displaystyle E \left\{\widehat{m}_{{\mathbf{H}}}\vert{\boldsymbol{X}}_{1},\ldots,
{\boldsymbol{X}}_{n}\right\}$ $\displaystyle \approx$ $\displaystyle \{f_{{\boldsymbol{X}}}({\boldsymbol{x}})\}^{-1}\bigg[f_{{\boldsym...
...x}}) +
\mu_{2}({\mathcal{K}})
\gradi_{m}{\mathbf{H}}{\mathbf{H}}^\top\gradi_{f}$  
    $\displaystyle \quad\quad + \frac{1}{2}\mu_{2}({\mathcal{K}})f_{{\boldsymbol{X}}...
...}\{{\mathbf{H}}^\top{\mathcal{H}}_{m}({\boldsymbol{x}}){\mathbf{H}}\} \bigg]\,,$  

such that we obtain the following theorem.

THEOREM 4.8  
The conditional asymptotic bias and variance of the multivariate Nadaraya-Watson kernel regression estimator are
$\displaystyle \bias\left\{\widehat{m}_{{\mathbf{H}}}\vert{\boldsymbol{X}}_{1},\ldots,{\boldsymbol{X}}_{n}\right\}$ $\displaystyle \approx$ $\displaystyle \mu_{2}({\mathcal{K}})
\frac{\gradi_{m}({\boldsymbol{x}})^\top{\m...
...{H}}^\top
\gradi_{f}({\boldsymbol{x}})}{f_{{\boldsymbol{X}}}({\boldsymbol{x}})}$  
    $\displaystyle \quad\quad + \frac{1}{2}\mu_{2}({\mathcal{K}})
\mathop{\hbox{tr}}\{{\mathbf{H}}^\top{\mathcal{H}}_{m}({\boldsymbol{x}}){\mathbf{H}}\}$  
$\displaystyle \mathop{\mathit{Var}}\left\{\widehat{m}_{{\mathbf{H}}}\vert{\boldsymbol{X}}_{1},\ldots,{\boldsymbol{X}}_{n}\right\}$ $\displaystyle \approx$ $\displaystyle \frac{1}{n \mathop{\rm {det}}({\mathbf{H}})}\, \Vert{\mathcal{K}}...
..._{2} \,
\frac{\sigma({\boldsymbol{x}})}{f_{{\boldsymbol{X}}}({\boldsymbol{x}})}$  

in the interior of the support of $ f_X$.

Let us now turn to the local linear case. Recall that we use the notation $ {\boldsymbol{e}}_{0}=(1,0,\ldots,0)^\top$ for the first unit vector in $ \mathbb{R}^{d}$. Then we can write the local linear estimator as

$\displaystyle \widehat{m}_{1,{\mathbf{H}}}({\boldsymbol{x}})={\boldsymbol{e}}_{...
...hbf{W}}{\mathbf{X}}
\right)^{-1} {\mathbf{X}}^\top{\mathbf{W}}{\boldsymbol{Y}}.$

Now we have using (4.69) and (4.72),
$\displaystyle {E \left\{\widehat{m}_{1,{\mathbf{H}}}\vert{\boldsymbol{X}}_{1},\ldots,{\boldsymbol{X}}_{n}\right\}
- m({\boldsymbol{x}})}$
  $\displaystyle =$ $\displaystyle {\boldsymbol{e}}_{0}^\top \left( {\mathbf{X}}^\top{\mathbf{W}}{\m...
...) + \frac{1}{2} {\boldsymbol{Q}}({\boldsymbol{x}})\right\} -m({\boldsymbol{x}})$  
  $\displaystyle =$ $\displaystyle \frac{1}{2} {\boldsymbol{e}}_{0}^\top \left( {\mathbf{X}}^\top{\m...
...}}
\right)^{-1} {\mathbf{X}}^\top{\mathbf{W}}{\boldsymbol{Q}}({\boldsymbol{x}})$  

since $ {\boldsymbol{e}}_{0}^\top[m({\boldsymbol{x}}),\gradi_{m}({\boldsymbol{x}})^\top]=m({\boldsymbol{x}})$. Hence, the numerator of the asymptotic conditional bias only depends on the quadratic term. This is one of the key points in asymptotics for local polynomial estimators. If we were to use local polynomials of order $ p$ and expand (4.72) up to order $ p+1$, then only the term of order $ p+1$ would appear in the numerator of the asymptotic conditional bias. Of course this is to be paid with a more complicated structure of the denominator. The following theorem summarizes bias and variance for the estimator $ \widehat{m}_{1,{\mathbf{H}}}$.

THEOREM 4.9  
The conditional asymptotic bias and variance of the multivariate local linear regression estimator are
$\displaystyle \bias\left\{\widehat{m}_{1,{\mathbf{H}}}\vert{\boldsymbol{X}}_{1},\ldots,{\boldsymbol{X}}_{n}\right\}$ $\displaystyle \approx$ $\displaystyle \frac{1}{2}\mu_{2}({\mathcal{K}})
\mathop{\hbox{tr}}\{{\mathbf{H}}^\top{\mathcal{H}}_{m}({\boldsymbol{x}}){\mathbf{H}}\}$  
$\displaystyle \mathop{\mathit{Var}}
\left\{\widehat{m}_{1,{\mathbf{H}}}\vert{\boldsymbol{X}}_{1},\ldots,{\boldsymbol{X}}_{n}\right\}$ $\displaystyle \approx$ $\displaystyle \frac{1}{n \mathop{\rm {det}}({\mathbf{H}})}\, \Vert{\mathcal{K}}...
..._{2} \,
\frac{\sigma({\boldsymbol{x}})}{f_{{\boldsymbol{X}}}({\boldsymbol{x}})}$  

in the interior of the support of $ f_X$.

For all omitted details we refer again to Ruppert & Wand (1994). They also point out that the local linear estimate has the same order conditional bias in the interior as well as in the boundary of the support of $ f_X$.

4.5.2 Practical Aspects

The computation of local polynomial estimators can be done by any statistical package that is able to run weighted least squares regression. However, since we estimate a function, this weighted least squares regression has to be performed in all observation points or on a grid of points in $ \mathbb{R}^{d}$. Therefore, explicit formulae, which can be derived at least for lower dimensions $ d$ are useful.

EXAMPLE 4.17  
Consider $ d=2$ and for fixed $ {\boldsymbol{x}}$ the sums
$\displaystyle S_{jk} = S_{jk}({\boldsymbol{x}})$ $\displaystyle =$ $\displaystyle \sum_{i=1}^n {\mathcal{K}}_{{\mathbf{H}}}({\boldsymbol{x}}-{\boldsymbol{X}}_{i})
(X_{1i}-x_{1})^j
(X_{2i}-x_{2})^k,$  
$\displaystyle T_{jk} = T_{jk}({\boldsymbol{x}})$ $\displaystyle =$ $\displaystyle \sum_{i=1}^n {\mathcal{K}}_{{\mathbf{H}}}({\boldsymbol{x}}-{\boldsymbol{X}}_{i})
(X_{1i}-x_{1})^j
(X_{2i}-x_{2})^k Y_{i}.$  

Then for the local linear estimate we can write

$\displaystyle \widehat\beta = \left(\begin{array}{ccc} S_{00}& S_{10}&S_{01}\\ ...
...ight)^{-1} \left(\begin{array}{c} T_{00}\\ T_{10} \\ T_{01} \end{array}\right).$ (4.73)

Here it is still possible to fit the explicit formula for the estimated regression function on one line:

$\displaystyle \widehat{m}_{1,{\mathbf{H}}}({\boldsymbol{x}}) = \frac{(S_{20}S_{...
..._{02}S_{10}^{2} - S_{00}S_{11}^{2} - S_{01}^{2}S_{20} + S_{00} S_{02} S_{20} }.$ (4.74)

To estimate the regression plane we have to apply (4.74) on a two-dimensional grid of points.

Figure 4.18 shows the Nadaraya-Watson and the local linear two-dimensional estimate for simulated data. We use $ 500$ design points uniformly distributed in $ [0,1]\times[0,1]$ and the regression function

$\displaystyle m({\boldsymbol{x}})=\sin(2\pi x_1)+x_2.$

The error term is $ N(0,\frac{1}{4})$. The bandwidth is chosen as $ h_{1}=h_{2}=0.3$ for both estimators. $ \Box$

Figure 4.18: Two-dimensional local linear estimate
\includegraphics[width=0.03\defepswidth]{quantlet.ps}SPMtruenadloc
\includegraphics[width=1.35\defpicwidth]{SPMtruenadloc.ps}

Nonparametric kernel regression function estimation is not limited to bivariate distributions. Everything can be generalized to higher dimensions but unfortunately some problems arise. A practical problem is the graphical display for higher dimensional multivariate functions. This problem has already been considered in Chapter 3 when we discussed the graphical representation of multivariate density estimates. The corresponding remarks for plotting functions of up to three-dimensional arguments apply here again.

A general problem in multivariate nonparametric estimation is the so-called curse of dimensionality. Recall that the nonparametric regression estimators are based on the idea of local (weighted) averaging. In higher dimensions the observations are sparsely distributed even for large sample sizes, and consequently estimators based on local averaging perform unsatisfactorily in this situation. Technically, one can explain this effect by looking at the $ \amse$ again. Consider a multivariate regression estimator with the same bandwidth $ h$ for all components, e.g. a Nadaraya-Watson or local linear estimator with bandwidth matrix $ {\mathbf{H}}=h\cdotp{\mathbf{I}}$. Here the asymptotic $ \mse$ will also depend on $ d$:

$\displaystyle \amse(n,h)=\frac{1}{nh^d}{C_{1}}+h^4 C_{2}.$

where $ C_1$ and $ C_2$ are constants that neither depend on $ n$ nor $ h$. If we derive the optimal bandwidth we find that $ h_{opt}\sim n^{-1/(4+d)}$ and hence the rate of convergence for $ \amse$ is $ n^{-4/(4+d)}$. You can clearly see that the speed of convergence decreases dramatically for higher dimensions $ d$.

An introduction to kernel regression methods can be found in the monographs of Silverman (1986), Härdle (1990), Bowman & Azzalini (1997), Simonoff (1996) and Pagan & Ullah (1999). The books of Scott (1992) and Wand & Jones (1995) deal particularly with the multivariate case. For detailed derivations of the asymptotic properties we refer to Collomb (1985) and Gasser & Müller (1984). The latter reference also considers boundary kernels to reduce the bias at the boundary regions of the explanatory variables.

Locally weighted least squares were originally studied by Stone (1977), Cleveland (1979) and Lejeune (1985). Technical details of asymptotic expansions for bias and variance can be found in Ruppert & Wand (1994). Monographs concerning local polynomial fitting are Wand & Jones (1995), Fan & Gijbels (1996) and Simonoff (1996). Computational aspects, in particular the WARPing technique (binning) for kernel and local polynomial regression, are discussed in Härdle & Scott (1992) and Fan & Marron (1994). The monograph of Loader (1999) discusses local regression in combination with likelihood-based estimation.

For comprehensive works on spline smoothing see Eilers & Marx (1996), Wahba (1990) and and Green & Silverman (1994). Good resources for wavelets are Daubechies (1992), Donoho & Johnstone (1994), Donoho & Johnstone (1995) and Donoho et al. (1995). The books of Eubank (1999) and Schimek (2000b) provide extensive overviews on a variety of different smoothing methods.

For a monograph on testing in nonparametric models see Hart (1997). The concepts presented by equations (4.63)-(4.66) are in particular studied by the following articles: González Manteiga & Cao (1993) and Härdle & Mammen (1993) introduced (4.63), Gozalo & Linton (2001) studied (4.64) motivated by Lagrange multiplier tests. Equation (4.65) was originally introduced by Zheng (1996) and independently discussed by Fan & Li (1996). Finally, (4.66) was proposed by Dette (1999) in the context of testing for parametric structures in the regression function. For an introductory presentation see Yatchew (2003). In general, all test approaches are also possible for multivariate regressors.

More sophisticated is the minimax approach for testing nonparametric alternatives studied by Ingster (1993). This approach tries to maximize the power for the worst alternative case, i.e. the one that is closest to the hypothesis but could still be detected. Rather different approaches have been introduced by Bierens (1990) and Bierens & Ploberger (1997), who consider (integrated) conditional moment tests or by Stute (1997) and Stute et al. (1998), who verify, via bootstrap, whether the residuals of the hypothesis integrated over the empirical distribution of the regressor variable $ X$ converge to a centered Gaussian process. There is further literature about adaptive testing which tries to find the smoothing parameter that maximizes the power when holding the type one error, see for example Ledwina (1994), Kallenberg & Ledwina (1995), Spokoiny (1996) and Spokoiny (1998).

EXERCISE 4.1   Consider $ X,Y$ with a bivariate normal distribution, i.e.

$\displaystyle {X\choose Y} \sim N \left( {\mu\choose\eta},
\left( \begin{array...
... & \rho \sigma \tau \\ \rho \sigma \tau &
\tau^{2} \end{array} \right) \right) $

with density

$\displaystyle f \left( x,y \right) = \frac{1}{2 \pi \sigma \tau \sqrt{1-\rho^{2...
... \left( \frac{y-\eta}{\tau} \right)}{2 \left( 1- \rho^{2} \right)} \right\}\,.
$

This means $ X \sim N \left( \mu, \sigma^{2} \right) , Y \sim N \left( \eta, \tau^{2} \right)$ and $ corr \left( X,Y \right) = \rho$. Show that the regression function $ m \left( x \right) = E \left( Y \vert X=x \right)$ is linear, more exactly

$\displaystyle E \left( Y \vert X=x \right) = \alpha + \beta x,
$

with $ \alpha = \eta - \mu \rho \tau/\sigma$ and $ \beta = \rho \tau/\sigma$.

EXERCISE 4.2   Explain the effects of

$\displaystyle 2 \frac{m'(x)f'_X(x)}{f_X(x)}$

in the bias part of the Nadaraya-Watson $ \mse$, see equation (4.13). When will the bias be large/small?

EXERCISE 4.3   Calculate the pointwise optimal bandwidth $ h_{opt}(x)$ from the Nadaraya-Watson $ \amse$, see equation (4.13).

EXERCISE 4.4   Show that $ \widehat{m}_{0,h}(x)=\widehat{m}_{h}(x)$, i.e. the local constant estimator (4.17) equals the Nadaraya-Watson estimator.

EXERCISE 4.5   Show that from (4.33) and (4.34) indeed the spline formula $ {\widehat{{\boldsymbol{m}}}_\lambda}=({\mathbf{I}}+\lambda {\mathbf{K}})^{-1} {\boldsymbol{Y}}$ in equation (4.35) follows.

EXERCISE 4.6   Compare the kernel, the $ k$-NN, the spline and a linear regression fit for a data set.

EXERCISE 4.7   Show that the Legendre polynomials $ p_0$ and $ p_1$ are indeed orthogonal functions.

EXERCISE 4.8   Compute and plot the confidence intervals for the Nadaraya-Watson kernel estimate and the local polynomial estimate for a data set.

EXERCISE 4.9   Prove that the solution of the minimization problem (4.31) is a piecewise cubic polynomial function which is twice continuously differentiable (cubic spline).

EXERCISE 4.10   Discuss the behavior of the smoothing parameters $ h$ in kernel regression, $ k$ in nearest-neighbor estimation, $ \lambda$ in spline smoothing and $ N$ in orthogonal series estimation.


Summary
$ \ast$
The regression function $ m(\bullet)$ which relates an independent variable $ X$ and a dependent variable $ Y$ is the conditional expectation

$\displaystyle m(x) = E(Y\vert X=x). $

$ \ast$
A natural kernel regression estimate for a random design can be obtained by replacing the unknown densities in $ E(Y\vert X=x)$ by kernel density estimates. This yields the Nadaraya-Watson estimator

$\displaystyle \widehat{m}_h(x) = \frac{1}{n}\sum_{i=1} W_{hi}(x)Y_{i}$

with weights

$\displaystyle W_{hi}(x)= \frac{n\,K_h(x-X_i)}
{\sum^n_{j=1} K_h(x-X_j)}\,. $

For a fixed design we can use the Gasser-Müller estimator with

$\displaystyle %%\frac{1}{n}\sum_{i=1} W_{hi}^{GM}(x)Y_{i} \quad \textrm{with} \quad
W_{hi}^{GM}(x)=n \int_{s_{i-1}}^{s_{i}}K_{h}(x-u)du.$

$ \ast$
The asymptotic $ \mse$ of the Nadaraya-Watson estimator is
$\displaystyle \amse\{\widehat{m}_{h}(x)\}$ $\displaystyle =$ $\displaystyle \frac{1}{nh}\;\frac{\sigma^{2}(x)}{f_X(x)}\;\Vert K\Vert^{2}_{2}$  
    $\displaystyle \quad +\frac{h^{4}}{4}\left\{m''(x)+2
\frac{m'(x)f'_X(x)}{f_X(x)}\right\}^{2}\mu^{2}_{2}(K),$  

the asymptotic $ \mse$ of the Gasser-Müller estimator is identical up to the $ 2\frac{m'(x)f'_X(x)}{f_X(x)}$ term.
$ \ast$
The Nadaraya-Watson estimator is a local constant least squares estimator. Extending the local constant approach to local polynomials of degree $ p$ yields the minimization problem:

$\displaystyle \min_{\beta} \sum_{i=1}^n \left \{ Y_i - \beta_0 - \beta_1 (X_i - x) - \ldots -\beta_p (X_i - x)^p \right\}^2 K_h(x-X_i),$

where $ \widehat{\beta}_0$ is the estimator of the regression curve and the $ \widehat{\beta}_\nu$ are proportional to the estimates for the derivatives.
$ \ast$
For the regression problem, odd order local polynomial estimators outperform even order regression fits. In particular, the asymptotic $ \mse$ for the local linear kernel regression estimator is

$\displaystyle \amse\{\widehat{m}_{1,h}(x)\} = \frac{1}{nh}\;\frac{\sigma^{2}(x)...
...;\Vert K\Vert^{2}_{2} +\frac{h^{4}}{4}\left\{m''(x)\right\}^{2} \mu^{2}_{2}(K).$

The bias does not depend on $ f_X$, $ f'_X$ and $ m'$ which makes the the local linear estimator more design adaptive and improves the behavior in boundary regions.
$ \ast$
The $ k$-NN estimator has in its simplest form the representation

$\displaystyle \widehat{m}_k(x)=\frac{1}{k} \sum_{i=n}^n Y_{i}
\Ind\{\textrm{$X_i$ is among $k$ nearest neighbors of $x$}\}.$

This can be refined using kernel weights instead of uniform weighting of all observations nearest to $ x$. The variance of the $ k$-NN estimator does not depend on $ f_X(x)$.
$ \ast$
Median smoothing is a version of $ k$-NN which estimates the conditional median rather than the conditional mean:

$\displaystyle \widehat{m}_k(x)=\med\{Y_{i}: X_{i}\ \textrm{is among of $k$ nearest
neighbors of $x$}\}.$

It is more robust to outliers.
$ \ast$
The smoothing spline minimizes a penalized residual sum of squares over all twice differentiable functions $ m(\bullet)$:

$\displaystyle S_{\lambda}({m})=\sum_{i=1}^n \left\{Y_{i}-{m}(X_{i})
\right\}^2+\lambda \Vert{m}''\Vert^{2}_{2},$

the solution consists of piecewise cubic polynomials in $ [X_{(i)},X_{(i+1)}]$. Under regularity conditions the spline smoother is equivalent to a kernel estimator with higher order kernel and design dependent bandwidth.
$ \ast$
Orthogonal series regression (Fourier regression, wavelet regression) uses the fact that under certain conditions, functions can be represented by a series of basis functions. The coefficients of the basis functions have to be estimated. The smoothing parameter $ N$ is the number of terms in the series.
$ \ast$
Bandwidth selection in regression is usually done by cross-validation or the penalized residual sum of squares.
$ \ast$
Pointwise confidence intervals and uniform confidence bands can be constructed analogously to the density estimation case.
$ \ast$
Nonparametric regression estimators for univariate data can be easily generalized to the multivariate case. A general problem is the curse of dimensionality.