# 4.5 Multivariate Kernel Regression

In the previous section several techniques for estimating the conditional expectation function of the bivariate distribution of the random variables and were presented. Recall that the conditional expectation function is an interesting target for estimation since it tells us how and are related on average. In practice, however, we will mostly be interested in specifying how the response variable depends on a vector of exogenous variables, denoted by . This means we aim to estimate the conditional expectation

 (4.67)

where . Consider the relation

If we replace the multivariate density by its kernel density estimate

and by (3.60) we arrive at the multivariate generalization of the Nadaraya-Watson estimator:

 (4.68)

Hence, the multivariate kernel regression estimator is again a weighted sum of the observed responses . Depending on the choice of the kernel, is a weighted average of those where lies in a ball or cube around .

Note also, that the multivariate Nadaraya-Watson estimator is a local constant estimator. The definition of local polynomial kernel regression is a straightforward generalization of the univariate case. Let us illustrate this with the example of a local linear regression estimate. The minimization problem here is

The solution to the problem can hence be equivalently written as

 (4.69)

using the notations

and . In (4.69) estimates the regression function itself, whereas estimates the partial derivatives w.r.t. the components . In the following we denote the multivariate local linear estimator as

 (4.70)

## 4.5.1 Statistical Properties

The asymptotic conditional variances of the Nadaraya-Watson estimator and the local linear are identical and their derivation can be found in detail in Ruppert & Wand (1994):

 (4.71)

with denoting .

In the following we will sketch the derivation of the asymptotic conditional bias. We have seen this remarkable difference between both estimators already in the univariate case. Denote the second order Taylor expansion of , i.e.

 (4.72)

where

and and being the gradient and the Hessian, respectively. Additionally to (3.62) it holds

see Ruppert & Wand (1994). Therefore the denominator of the conditional asymptotic expectation of the Nadaraya-Watson estimator is approximately . Using and the Taylor expansion for we have

Hence

such that we obtain the following theorem.

THEOREM 4.8
The conditional asymptotic bias and variance of the multivariate Nadaraya-Watson kernel regression estimator are

in the interior of the support of .

Let us now turn to the local linear case. Recall that we use the notation for the first unit vector in . Then we can write the local linear estimator as

Now we have using (4.69) and (4.72),

since . Hence, the numerator of the asymptotic conditional bias only depends on the quadratic term. This is one of the key points in asymptotics for local polynomial estimators. If we were to use local polynomials of order and expand (4.72) up to order , then only the term of order would appear in the numerator of the asymptotic conditional bias. Of course this is to be paid with a more complicated structure of the denominator. The following theorem summarizes bias and variance for the estimator .

THEOREM 4.9
The conditional asymptotic bias and variance of the multivariate local linear regression estimator are

in the interior of the support of .

For all omitted details we refer again to Ruppert & Wand (1994). They also point out that the local linear estimate has the same order conditional bias in the interior as well as in the boundary of the support of .

## 4.5.2 Practical Aspects

The computation of local polynomial estimators can be done by any statistical package that is able to run weighted least squares regression. However, since we estimate a function, this weighted least squares regression has to be performed in all observation points or on a grid of points in . Therefore, explicit formulae, which can be derived at least for lower dimensions are useful.

EXAMPLE 4.17
Consider and for fixed the sums

Then for the local linear estimate we can write

 (4.73)

Here it is still possible to fit the explicit formula for the estimated regression function on one line:

 (4.74)

To estimate the regression plane we have to apply (4.74) on a two-dimensional grid of points.

Figure 4.18 shows the Nadaraya-Watson and the local linear two-dimensional estimate for simulated data. We use design points uniformly distributed in and the regression function

The error term is . The bandwidth is chosen as for both estimators.

Nonparametric kernel regression function estimation is not limited to bivariate distributions. Everything can be generalized to higher dimensions but unfortunately some problems arise. A practical problem is the graphical display for higher dimensional multivariate functions. This problem has already been considered in Chapter 3 when we discussed the graphical representation of multivariate density estimates. The corresponding remarks for plotting functions of up to three-dimensional arguments apply here again.

A general problem in multivariate nonparametric estimation is the so-called curse of dimensionality. Recall that the nonparametric regression estimators are based on the idea of local (weighted) averaging. In higher dimensions the observations are sparsely distributed even for large sample sizes, and consequently estimators based on local averaging perform unsatisfactorily in this situation. Technically, one can explain this effect by looking at the again. Consider a multivariate regression estimator with the same bandwidth for all components, e.g. a Nadaraya-Watson or local linear estimator with bandwidth matrix . Here the asymptotic will also depend on :

where and are constants that neither depend on nor . If we derive the optimal bandwidth we find that and hence the rate of convergence for is . You can clearly see that the speed of convergence decreases dramatically for higher dimensions .

An introduction to kernel regression methods can be found in the monographs of Silverman (1986), Härdle (1990), Bowman & Azzalini (1997), Simonoff (1996) and Pagan & Ullah (1999). The books of Scott (1992) and Wand & Jones (1995) deal particularly with the multivariate case. For detailed derivations of the asymptotic properties we refer to Collomb (1985) and Gasser & Müller (1984). The latter reference also considers boundary kernels to reduce the bias at the boundary regions of the explanatory variables.

Locally weighted least squares were originally studied by Stone (1977), Cleveland (1979) and Lejeune (1985). Technical details of asymptotic expansions for bias and variance can be found in Ruppert & Wand (1994). Monographs concerning local polynomial fitting are Wand & Jones (1995), Fan & Gijbels (1996) and Simonoff (1996). Computational aspects, in particular the WARPing technique (binning) for kernel and local polynomial regression, are discussed in Härdle & Scott (1992) and Fan & Marron (1994). The monograph of Loader (1999) discusses local regression in combination with likelihood-based estimation.

For comprehensive works on spline smoothing see Eilers & Marx (1996), Wahba (1990) and and Green & Silverman (1994). Good resources for wavelets are Daubechies (1992), Donoho & Johnstone (1994), Donoho & Johnstone (1995) and Donoho et al. (1995). The books of Eubank (1999) and Schimek (2000b) provide extensive overviews on a variety of different smoothing methods.

For a monograph on testing in nonparametric models see Hart (1997). The concepts presented by equations (4.63)-(4.66) are in particular studied by the following articles: González Manteiga & Cao (1993) and Härdle & Mammen (1993) introduced (4.63), Gozalo & Linton (2001) studied (4.64) motivated by Lagrange multiplier tests. Equation (4.65) was originally introduced by Zheng (1996) and independently discussed by Fan & Li (1996). Finally, (4.66) was proposed by Dette (1999) in the context of testing for parametric structures in the regression function. For an introductory presentation see Yatchew (2003). In general, all test approaches are also possible for multivariate regressors.

More sophisticated is the minimax approach for testing nonparametric alternatives studied by Ingster (1993). This approach tries to maximize the power for the worst alternative case, i.e. the one that is closest to the hypothesis but could still be detected. Rather different approaches have been introduced by Bierens (1990) and Bierens & Ploberger (1997), who consider (integrated) conditional moment tests or by Stute (1997) and Stute et al. (1998), who verify, via bootstrap, whether the residuals of the hypothesis integrated over the empirical distribution of the regressor variable converge to a centered Gaussian process. There is further literature about adaptive testing which tries to find the smoothing parameter that maximizes the power when holding the type one error, see for example Ledwina (1994), Kallenberg & Ledwina (1995), Spokoiny (1996) and Spokoiny (1998).

EXERCISE 4.1   Consider with a bivariate normal distribution, i.e.

with density

This means and . Show that the regression function is linear, more exactly

with and .

EXERCISE 4.2   Explain the effects of

in the bias part of the Nadaraya-Watson , see equation (4.13). When will the bias be large/small?

EXERCISE 4.3   Calculate the pointwise optimal bandwidth from the Nadaraya-Watson , see equation (4.13).

EXERCISE 4.4   Show that , i.e. the local constant estimator (4.17) equals the Nadaraya-Watson estimator.

EXERCISE 4.5   Show that from (4.33) and (4.34) indeed the spline formula in equation (4.35) follows.

EXERCISE 4.6   Compare the kernel, the -NN, the spline and a linear regression fit for a data set.

EXERCISE 4.7   Show that the Legendre polynomials and are indeed orthogonal functions.

EXERCISE 4.8   Compute and plot the confidence intervals for the Nadaraya-Watson kernel estimate and the local polynomial estimate for a data set.

EXERCISE 4.9   Prove that the solution of the minimization problem (4.31) is a piecewise cubic polynomial function which is twice continuously differentiable (cubic spline).

EXERCISE 4.10   Discuss the behavior of the smoothing parameters in kernel regression, in nearest-neighbor estimation, in spline smoothing and in orthogonal series estimation.

Summary
The regression function which relates an independent variable and a dependent variable is the conditional expectation

A natural kernel regression estimate for a random design can be obtained by replacing the unknown densities in by kernel density estimates. This yields the Nadaraya-Watson estimator

with weights

For a fixed design we can use the Gasser-Müller estimator with

The asymptotic of the Nadaraya-Watson estimator is

the asymptotic of the Gasser-Müller estimator is identical up to the term.
The Nadaraya-Watson estimator is a local constant least squares estimator. Extending the local constant approach to local polynomials of degree yields the minimization problem:

where is the estimator of the regression curve and the are proportional to the estimates for the derivatives.
For the regression problem, odd order local polynomial estimators outperform even order regression fits. In particular, the asymptotic for the local linear kernel regression estimator is

The bias does not depend on , and which makes the the local linear estimator more design adaptive and improves the behavior in boundary regions.
The -NN estimator has in its simplest form the representation

This can be refined using kernel weights instead of uniform weighting of all observations nearest to . The variance of the -NN estimator does not depend on .
Median smoothing is a version of -NN which estimates the conditional median rather than the conditional mean:

It is more robust to outliers.
The smoothing spline minimizes a penalized residual sum of squares over all twice differentiable functions :

the solution consists of piecewise cubic polynomials in . Under regularity conditions the spline smoother is equivalent to a kernel estimator with higher order kernel and design dependent bandwidth.
Orthogonal series regression (Fourier regression, wavelet regression) uses the fact that under certain conditions, functions can be represented by a series of basis functions. The coefficients of the basis functions have to be estimated. The smoothing parameter is the number of terms in the series.
Bandwidth selection in regression is usually done by cross-validation or the penalized residual sum of squares.
Pointwise confidence intervals and uniform confidence bands can be constructed analogously to the density estimation case.
Nonparametric regression estimators for univariate data can be easily generalized to the multivariate case. A general problem is the curse of dimensionality.