6.2 Kernel Regression

Regression smoothing investigates the association between an explanatory variable and a response variable . This section explains how to apply Nadaraya-Watson and local polynomial kernel regression.

Nonparametric regression aims to estimate the functional relation between and , i.e. the conditional expectation

$\displaystyle E(Y\vert X)=m(X).$

The function $m(\bullet)$ is the regression function to estimate. Alternatively, we can write

$\displaystyle Y= m(X) + \varepsilon, \quad E(\varepsilon) = 0.$

It is not necessary that the variance of

is a constant function. Typically one assumes

$\displaystyle \ \textrm{Var}(Y\vert X) = \sigma^2(X),$

(6.11)

where $\sigma(\bullet)$ is a continuous and bounded function.

Suppose that we have independent observations $(x_1,y_1),\ldots,(x_n,y_n)$ . The Nadaraya-Watson estimator is defined as

$\displaystyle \widehat m_{h}(x)= \frac{\sum\limits_{i=1}^n K\left(\frac{\displa... ...\limits_{i=1}^n K\left(\frac{\displaystyle x_{i}-x}{\displaystyle h}\right)}\,.$

It is easy to see that this kernel regression estimator is just a weighted sum of the observed responses $y_{i}$ . The denominator ensures that the weights sum up to 1. The kernel function $K(\bullet)$ can be any kernel function from Table 6.1.

6.2.1 Computational Aspects

The computational effort for calculating a Nadaraya-Watson or local polynomial regression is in the same order as for kernel density estimation (see Section 6.1.1). As in density estimation, all routines are offered in an exact and in a WARPing version:

Functionality	Exact	WARPing
Nadaraya-Watson regression	`regxest`	`regest`
Nadaraya-Watson confidence intervals	`regxci`	`regci`
Nadaraya-Watson confidence bands	`regxcb`	`regcb`
Nadaraya-Watson bandwidth selection	`regxbwsel`	`regbwsel`
local polynomial regression	`lpregxest`	`lpregest`
local polynomial derivatives	`lpderxest`	`lpderest`

6.2.2 Computing Kernel Regression Estimates

mh = regest (x {,h {,K} {,d}}): computes the kernel regression on a grid using the WARPing method
mh = regxest (x {,h {,K} {,v}}): computes the kernel regression estimate for all observations or on a grid v by exact computation

The WARPing-based function 12985 regest offers the fastest way to compute the Nadaraya-Watson regression estimator for exploratory purposes. We apply this routine nicfoo data, which contain observations on household netincome in the first column and on food expenditures in the second column. The following quantlet computes and plots the regression curve together with the data:

  nicfoo=read("nicfoo")
  h=0.2*(max(nicfoo[,1])-min(nicfoo[,1]))
  mh=regest(nicfoo,h)
  mh=setmask(mh,"line","blue")
  xy=setmask(nicfoo,"cross","small")
  plot(xy,mh)
  setgopt(plotdisplay,1,1,"title","regression estimate")

XLGsmoo08.xpl

The function 12996

regest needs to be replaced by 12999

regxest , if one is interested in the fitted values for all observations $x_1,\ldots,x_n$ . The following quantlet code computes the regression function on all observations and shows a residual plot together with the zero line:

  mh=regxest(nicfoo,0.2)
  res=nicfoo[,1] ~ (nicfoo[,2]-mh[,2])
  res=setmask(res,"cross")
  zline=(min(nicfoo[,1])|max(nicfoo[,1])) ~ (0|0)
  zline=setmask(zline,"line","red")
  plot(res,zline)
  setgopt(plotdisplay,1,1,"title","regression residuals")

XLGsmoo09.xpl

**Figure 6.8:** Regression estimate for the netincome vs. food expenditures data.
$\includegraphics[scale=0.425]{smootherreg1}$

**Figure 6.9:** Residuals of kernel regression.
$\includegraphics[scale=0.425]{smootherreg2}$

The resulting regression function is shown in Figure 6.8. Figure 6.9 shows the resulting residual plot. We observe that most of the nonlinear structure of the data is captured by the nonparametric regression function. However, the residual graph shows that the data are heteroskedastic, in the way that the residual variance increases with increasing netincome.

6.2.3 Bandwidth Selection

{hcrit,crit} = regbwsel (): starts an interactive tool for kernel regression bandwidth selection using the WARPing method
{hcrit,crit} = regxbwsel (): starts an interactive tool for kernel regression bandwidth selection using exact computation

As in kernel density estimation, kernel regression involves choosing the kernel function and the bandwidth parameter. One observes the same phenomenon as in kernel density estimation here: The difference between two kernel functions is almost negligible when the bandwidths are appropriately rescaled. To make the bandwidths for two different kernels comparable, the same technique as described in Subsection 6.1.3 can be used.

Consequently, we now concentrate on the problem of bandwidth selection. In the regression case, typically the averaged squared error

$\displaystyle \textrm{ASE}(h) = \sum_{i=1}^n \left\{m(x_i)-\widehat{m}_h(x_i) \right\}^2$

(6.12)

is used as the criterion for goodness of fit. A simple estimator for $\textrm{ASE}$ would be to replace the unknown function values

by the observations

, which yields the residual sum of squares

$\displaystyle \ \textrm{RSS}(h) = \sum_{i=1}^n \left\{y_i-\widehat{m}_h(x_i) \right\}^2.$

However, in the case of a nonparametric kernel regression, the values $\widehat{m}_h(x_i)$ approach

when the bandwidth

approaches zero. A more appropriate estimation of $\textrm{ASE}$ is obtained by the cross-validation criterion

$\displaystyle \textrm{CV}(h)= \sum_{i=1}^n \left\{y_i-\widehat{m}_{h,-i}(x_i) \right\}^2,$

(6.13)

where $\widehat{m}_{h,-i}(\bullet)$ denotes the kernel regression estimate which is obtained without using the

-th observation

. Some calculation shows (Härdle; 1990) that $\textrm{CV}(h)$ can be written as

$\displaystyle \textrm{CV}(h) = \sum_{i=1}^n \{y_{i}-\widehat{m}_{h}(x_{i})\}^2 \ \,\Xi\left(W_{hi}(x_{i})\right)$

(6.14)

with $\Xi(u)=(1-u)^{-2}$ and

$\displaystyle W_{hi}(x_{i})= \frac{K(0)}{\sum\limits_{j=1}^n K(\frac{x_{i}-x_{j}}{h})}.$

Therefore, the cross-validation can also be interpreted as a residual sum of squares (

), where small values for

are penalized by means of including the function $\Xi$ . The function $\Xi(\bullet)$ coincides with the penalty function that Craven and Wahba (1979) proposed for their generalized cross-validation criterion. Table 6.3 lists some more penalty functions.

Table 6.3: Penalizing functions.

Criterion	Penalty Function

(Generalized) Cross-validation	$\Xi _{\textrm{GCV}}\ (u)=(1-u)^{-2}$
Shibata's model selector	$\Xi _{\textrm{S}}(u)=1+2u$
Akaike's Information Criterion	$\Xi _{\textrm{AIC}}(u)=\exp\,(2u)$
Akaike's Finite Prediction Error	$\Xi _{\textrm{FPE}}(u)=(1+u)/(1-u)$
Rice's	$\Xi _{\textrm{T}}(u)=(1-2u)^{-1}$

All the mentioned penalty functions have the same asymptotic properties. In finite samples, however, the functions differ in the relative weight they give to variance and bias of $\widehat{m}_{h}(x)$ . Rice's gives the most weight to variance reduction while Shibata's model selector stresses bias reduction the most.

In XploRe , all penalizing functions can be applied via the functions 13192 regbwsel and 13195 regxbwsel . As can be seen from (6.13), criteria like $\textrm{CV}$ need to be evaluated at all observations. Thus, the function 13198 regxbwsel which uses exact computations is to be preferred here. 13201 regbwsel uses the WARPing approximation and may select bandwidths far from the optimal, if the discretization binwidth large. Note that both 13204 regbwsel and 13207 regxbwsel may suffer from numerical problems if the studied bandwidths are too small.

An example for calling 13210 regxbwsel gives the following quantlet which uses the nicfoo data again:

  nicfoo=read("nicfoo")
  tmp=regxbwsel(nicfoo)

XLGsmoo10.xpl

At first, a selection box opens which offers the choice between the different penalty functions. Let us choose the first item Cross Validation which optimizes the $\textrm{CV}$ criterion (6.13). A graphical display appears which shows the $\textrm{CV}$ criterion in th upper left, the chosen optimal bandwidth in the upper right, the resulting kernel regression in the lower left, and information about the search grid and the kernel in the lower right.

Figure 6.10 shows this graphical display. The menu now allows the modification of the search grid and the kernel or the usage of other bandwidth selectors.

**Figure 6.10:** Cross-validation bandwidth selector for Netincome vs. Food data.
$\includegraphics[scale=0.45]{smootherregcv}$

6.2.4 Confidence Intervals and Bands

{mh, mhl, mhu} = regci (x {,h {,alpha {,K} {,d}}}): computes the kernel density estimate and pointwise confidence intervals on a grid using the WARPing method
{mh, mhl, mhu} = regcb (x {,h {,alpha {,K} {,d}}}): computes the kernel regression and uniform confidence bands on a grid using the WARPing method
{mh, mhl, mhu} = regxci (x {,h {,alpha {,K} {,d}}}): computes the kernel density estimate and pointwise confidence intervals for all observations or on a grid v by exact computation
{mh, mhl, mhu} = regxcb (x {,h {,alpha {,K} {,d}}}): computes the kernel regression and uniform confidence bands for all observations or on a grid v by exact computation

As in the case of density estimation, it can be shown that the regression estimates $\widehat{m}_h(x)$ have an asymptotic normal distribution. Suppose that and (the density of the explanatory variable ) are twice differentiable, and that $h=cn^{-1/5}$ . Then

$\displaystyle n^{2/5} \left\{\widehat{m}_{h}(x)-m(x)\right\} \mathrel{\mathop{\... ...)\,,\, \frac{1}{c}\,\frac{\sigma^2(x)}{f(x)}\,\Vert K \Vert^{2}_{2} \right), \$

(6.15)

with

$\displaystyle b(x)=m''(x)+\frac{m'(x)f'(x)}{f'(x)}.$

Thus, similar to the density case, an approximate confidence interval can be obtained by

$\displaystyle \left[\widehat{m}_h (x) - z_{1-\frac{\alpha}{2}} \sqrt{\frac{ \wi... ...{ \widehat{\sigma}_h^2 (x)\Vert K \Vert^2_2 }{nh\widehat{f}_h(x)}}\, \right]\,,$

(6.16)

where $z_{1-\frac{\alpha}{2}}$ is the $(1-\frac{\alpha}{2})$ -quantile of the standard normal distribution and the estimate of the variance $Var(Y\vert x)= \sigma^2(x)$ is given by

$\displaystyle \widehat{\sigma}_h^2(x)= \frac{\frac{1}{n} \sum\limits_{i=1}^n K\... ...-\widehat{m}_h(x)\}^2} {\sum\limits_{i=1}^n K\left(\frac{x_{i}-x}{h}\right)}\,.$

Also similar to the density case, uniform confidence bands for $m(\bullet)$ need rather restrictive assumptions (Härdle; 1990, p. 116). Suppose that is a density on and $h=n^{-\delta}$ , $\delta \in (\frac{1}{5},\frac{1}{2})$ . Then it holds under some regularity for all $x \in [0,1]$ :

$\displaystyle \ {P\Bigg(\widehat{m}_{h}(x)- z_{n,\alpha} \ \sqrt{\frac{\widehat{\sigma}^2_{h}(x)\Vert K \Vert _{2}^{2}} \ {nh\widehat{f}_{h}(x)}}}$
$\displaystyle \$		$\displaystyle \leq m(x) \ \leq \widehat{m}_{h}(x)+ z_{n,\alpha} \sqrt{\frac{\wi... ...x)}}\Bigg) \mathrel{\mathop{\approx}\limits_{\textrm{\small as.}}^{}} 1-\alpha,$

where

$\displaystyle z_{n,\alpha}=\left\{\frac{-\log\{-\frac12 \log(1-\alpha)\}}{(2\delta \log{n})^{1/2}}+d_{n}\right\}^{1/2},$

$\displaystyle d_{n}=(2\delta\log{n})^{1/2}+(2\delta\log{n})^{-1/2} \log{\left(\frac{1}{2\pi}\frac{\Vert K'\Vert _{2}}{\Vert K \Vert _{2}}\right)}^{1/2}.$

Pointwise confidence intervals and uniform confidence bands using the WARPing approximation are provided by 13547 regci and 13550 regcb , respectively. The equivalents for exact computations are 13553 regxci and 13556 regxcb . The functions 13559 regcb and 13562 regxcb can be directly applied to the original data $x_1,\ldots,x_n$ , the transformation to is performed internally. The following quantlet code extends the above regression function by confidence intervals and confidence bands:

  {mh,mli,mui}=regci(nicfoo,0.18)  ; intervals
  {mh,mlb,mub}=regcb(nicfoo,0.18)  ; bands
  mh =setmask(mh,"line","blue","thick")
  mli=setmask(mli,"line","blue","thin","dashed")
  mui=setmask(mui,"line","blue","thin","dashed")
  mlb=setmask(mlb,"line","blue","thin")
  mub=setmask(mub,"line","blue","thin")
  plot(mh,mli,mui,mlb,mub)
  setgopt(plotdisplay,1,1,"title","Confidence Intervals & Bands")

XLGsmoo11.xpl

Both confidence intervals and bands were computed using bandwidth 0.2. The significance level and kernel are not specified, as the defaults 0.05 and "qua" are used. The resulting plot is shown in Figure 6.11.

**Figure 6.11:** Pointwise confidence intervals (thin dashed) and uniform confidence (thin solid) bands in regression.
$\includegraphics[scale=0.425]{smootherrcb}$

6.2.5 Local Polynomial Regression and Derivative Estimation

mh = lpregest (x {,h {,p {,K} {,d}}}): computes the local polynomial kernel regression on a grid using the WARPing method
mh = lpregxest (x {,h {,p {,v}}}): computes the local polynomial kernel regression estimate for all observations or on a grid v by exact computation

Note that the Nadaraya-Watson estimator is a local constant estimator, i.e. the solution of

$\displaystyle \min_{\beta_{0}} \sum_{i=1}^{n} \left\{ y_{i} - \beta_{0} \right\}^{2} K\left(\frac{x_i-x}{h}\right).$

Replacing $\beta_{0}$ by a

-th order polynomial in

yields the local polynomial kernel regression estimator. For details, see Fan and Gijbels (1996). Let us illustrate this with the example of a local linear regression estimate (

). The minimization problem is

$\displaystyle \min_{\beta_{0},\beta_{1}} \ \sum_{i=1}^{n} \left\{ y_{i} - \beta_{0} - \ \beta_{1}(x_{i}-x) \right\}^{2} \ K\left(\frac{x_i-x}{h}\right).$

The solution of the problem can be written as

$\displaystyle \widehat{\beta} = (\widehat{\beta}_{0}, \widehat{\beta}_{1})^{T} ... ...eft( {\cal X}^{T} {\cal W} {\cal X} \right)^{-1} {\cal X}^{T} {\cal W} {\cal Y}$

(6.17)

using the notations

$\displaystyle {\cal X}= \left(\begin{array}{cc} 1&(x_{1}-x)\\ \vdots&\vdots\\... ... {\cal Y}= \left(\begin{array}{c} y_{1}\\ \vdots \\ y_{n} \end{array}\right),$

and ${\cal W}=\textrm{diag}\left\{K\left(\frac{x_{1}-x}{h}\right), \ldots,K\left(\frac{x_{n}-x}{h}\right)\right\}$ . In (6.17), $\widehat{m}_h(x)=\widehat{\beta}_{0}$ estimates the regression function itself, whereas $\widehat{m}_{1,h}(x)=\widehat{\beta}_{1}$ estimates the first derivative of the regression function.

The functions 13790 lpregest and 13793 lpregxest for local polynomial regression have essentially the same input as their Nadaraya-Watson equivalents, except that an additional parameter to specify the degree of the polynomial can be given. For local polynomial regression, an odd value of is recommended since odd-order local polynomial regressions outperform even-order local polynomial regressions.

Derivatives of regression functions are computed with 13796 lpderest or 13799 lpderxest . For derivative estimation a polynomial order whose difference to the derivative order is odd should be used. Typically one uses the (local linear) for the estimation of the regression function and (local quadratic) for the estimation of its first derivative.

13802 lpdregxest and 13805 lpderxest use automatically local linear and local quadratic estimation if no order is specified. The default kernel function is the Quartic kernel "qua". Appropriate bandwidths can be found by means of rule of thumbs that replace the unknown regression function by a higher-order polynomial (Fan and Gijbels; 1996). The following quantlet code estimates the regression function and its first derivative by the local polynomial method. Both functions and the data are plotted together in Figure 6.12.

  motcyc=read("motcyc")
  hh=lpregrot(motcyc)     ; rule-of-thumb bandwidth
  hd=lpderrot(motcyc)     ; rule-of-thumb bandwidth
  mh=lpregest(motcyc,hh)  ; local linear regression
  md=lpderest(motcyc,hd)  ; local quadratic derivative
  mh=setmask(mh,"line","black")
  md=setmask(md,"line","blue","dashed")
  xy=setmask(motcyc,"cross","small","red")
  plot(xy,mh,md)
  setgopt(plotdisplay,1,1,"title","Local Polynomial Estimation")

XLGsmoo12.xpl

**Figure 6.12:** Local linear regression (solid), derivative (dashed) estimate and data.
$\includegraphics[scale=0.425]{smootherlld}$