6.2 Kernel Regression

Regression smoothing investigates the association between an explanatory variable $ X$ and a response variable $ Y$. This section explains how to apply Nadaraya-Watson and local polynomial kernel regression.

Nonparametric regression aims to estimate the functional relation between $ Y$ and $ X$, i.e. the conditional expectation

$\displaystyle E(Y\vert X)=m(X).$

The function $ m(\bullet)$ is the regression function to estimate. Alternatively, we can write

$\displaystyle Y= m(X) + \varepsilon, \quad E(\varepsilon) = 0.$

It is not necessary that the variance of $ Y$ is a constant function. Typically one assumes

$\displaystyle \ \textrm{Var}(Y\vert X) = \sigma^2(X),$ (6.11)

where $ \sigma(\bullet)$ is a continuous and bounded function.

Suppose that we have independent observations $ (x_1,y_1),\ldots,(x_n,y_n)$. The Nadaraya-Watson estimator is defined as

$\displaystyle \widehat m_{h}(x)=
\frac{\sum\limits_{i=1}^n K\left(\frac{\displa...
...\limits_{i=1}^n K\left(\frac{\displaystyle x_{i}-x}{\displaystyle h}\right)}\,.$

It is easy to see that this kernel regression estimator is just a weighted sum of the observed responses $ y_{i}$. The denominator ensures that the weights sum up to 1. The kernel function $ K(\bullet)$ can be any kernel function from Table 6.1.


6.2.1 Computational Aspects

The computational effort for calculating a Nadaraya-Watson or local polynomial regression is in the same order as for kernel density estimation (see Section 6.1.1). As in density estimation, all routines are offered in an exact and in a WARPing version:

Functionality Exact WARPing
Nadaraya-Watson regression 12862 regxest 12865 regest
Nadaraya-Watson confidence intervals 12868 regxci 12871 regci
Nadaraya-Watson confidence bands 12874 regxcb 12877 regcb
Nadaraya-Watson bandwidth selection 12880 regxbwsel 12883 regbwsel
local polynomial regression 12886 lpregxest 12889 lpregest
local polynomial derivatives 12892 lpderxest 12895 lpderest


6.2.2 Computing Kernel Regression Estimates


mh = 12979 regest (x {,h {,K} {,d}})
computes the kernel regression on a grid using the WARPing method
mh = 12982 regxest (x {,h {,K} {,v}})
computes the kernel regression estimate for all observations or on a grid v by exact computation

The WARPing-based function 12985 regest offers the fastest way to compute the Nadaraya-Watson regression estimator for exploratory purposes. We apply this routine nicfoo data, which contain observations on household netincome in the first column and on food expenditures in the second column. The following quantlet computes and plots the regression curve together with the data:

  nicfoo=read("nicfoo")
  h=0.2*(max(nicfoo[,1])-min(nicfoo[,1]))
  mh=regest(nicfoo,h)
  mh=setmask(mh,"line","blue")
  xy=setmask(nicfoo,"cross","small")
  plot(xy,mh)
  setgopt(plotdisplay,1,1,"title","regression estimate")
12991 XLGsmoo08.xpl

The function 12996 regest needs to be replaced by 12999 regxest , if one is interested in the fitted values for all observations $ x_1,\ldots,x_n$. The following quantlet code computes the regression function on all observations and shows a residual plot together with the zero line:
  mh=regxest(nicfoo,0.2)
  res=nicfoo[,1] ~ (nicfoo[,2]-mh[,2])
  res=setmask(res,"cross")
  zline=(min(nicfoo[,1])|max(nicfoo[,1])) ~ (0|0)
  zline=setmask(zline,"line","red")
  plot(res,zline)
  setgopt(plotdisplay,1,1,"title","regression residuals")
13003 XLGsmoo09.xpl

Figure 6.8: Regression estimate for the netincome vs. food expenditures data.
\includegraphics[scale=0.425]{smootherreg1}

Figure 6.9: Residuals of kernel regression.
\includegraphics[scale=0.425]{smootherreg2}

The resulting regression function is shown in Figure 6.8. Figure 6.9 shows the resulting residual plot. We observe that most of the nonlinear structure of the data is captured by the nonparametric regression function. However, the residual graph shows that the data are heteroskedastic, in the way that the residual variance increases with increasing netincome.


6.2.3 Bandwidth Selection


{hcrit,crit} = 13182 regbwsel ()
starts an interactive tool for kernel regression bandwidth selection using the WARPing method
{hcrit,crit} = 13185 regxbwsel ()
starts an interactive tool for kernel regression bandwidth selection using exact computation

As in kernel density estimation, kernel regression involves choosing the kernel function and the bandwidth parameter. One observes the same phenomenon as in kernel density estimation here: The difference between two kernel functions $ K$ is almost negligible when the bandwidths are appropriately rescaled. To make the bandwidths for two different kernels $ K$ comparable, the same technique as described in Subsection 6.1.3 can be used.

Consequently, we now concentrate on the problem of bandwidth selection. In the regression case, typically the averaged squared error

$\displaystyle \textrm{ASE}(h) = \sum_{i=1}^n \left\{m(x_i)-\widehat{m}_h(x_i) \right\}^2$ (6.12)

is used as the criterion for goodness of fit. A simple estimator for $ \textrm{ASE}$ would be to replace the unknown function values $ m(x_i)$ by the observations $ y_i$, which yields the residual sum of squares

$\displaystyle \ \textrm{RSS}(h) = \sum_{i=1}^n \left\{y_i-\widehat{m}_h(x_i)
\right\}^2.
$

However, in the case of a nonparametric kernel regression, the values $ \widehat{m}_h(x_i)$ approach $ y_i$ when the bandwidth $ h$ approaches zero. A more appropriate estimation of $ \textrm{ASE}$ is obtained by the cross-validation criterion

$\displaystyle \textrm{CV}(h)= \sum_{i=1}^n \left\{y_i-\widehat{m}_{h,-i}(x_i) \right\}^2,$ (6.13)

where $ \widehat{m}_{h,-i}(\bullet)$ denotes the kernel regression estimate which is obtained without using the $ i$-th observation $ (x_i,y_i)$. Some calculation shows (Härdle; 1990) that $ \textrm{CV}(h)$ can be written as

$\displaystyle \textrm{CV}(h) = \sum_{i=1}^n \{y_{i}-\widehat{m}_{h}(x_{i})\}^2 \ \,\Xi\left(W_{hi}(x_{i})\right)$ (6.14)

with $ \Xi(u)=(1-u)^{-2}$ and

$\displaystyle W_{hi}(x_{i})= \frac{K(0)}{\sum\limits_{j=1}^n K(\frac{x_{i}-x_{j}}{h})}.
$

Therefore, the cross-validation can also be interpreted as a residual sum of squares ($ RSS$), where small values for $ h$ are penalized by means of including the function $ \Xi$. The function $ \Xi(\bullet)$ coincides with the penalty function that Craven and Wahba (1979) proposed for their generalized cross-validation criterion. Table 6.3 lists some more penalty functions.


Table 6.3: Penalizing functions.
Criterion Penalty Function
   
(Generalized) Cross-validation $ \Xi _{\textrm{GCV}}\ (u)=(1-u)^{-2}$
Shibata's model selector $ \Xi _{\textrm{S}}(u)=1+2u$
Akaike's Information Criterion $ \Xi _{\textrm{AIC}}(u)=\exp\,(2u)$
Akaike's Finite Prediction Error $ \Xi _{\textrm{FPE}}(u)=(1+u)/(1-u) $
Rice's $ T$ $ \Xi _{\textrm{T}}(u)=(1-2u)^{-1} $
   


All the mentioned penalty functions have the same asymptotic properties. In finite samples, however, the functions differ in the relative weight they give to variance and bias of $ \widehat{m}_{h}(x)$. Rice's $ T$ gives the most weight to variance reduction while Shibata's model selector stresses bias reduction the most.

In XploRe , all penalizing functions can be applied via the functions 13192 regbwsel and 13195 regxbwsel . As can be seen from (6.13), criteria like $ \textrm{CV}$ need to be evaluated at all observations. Thus, the function 13198 regxbwsel which uses exact computations is to be preferred here. 13201 regbwsel uses the WARPing approximation and may select bandwidths far from the optimal, if the discretization binwidth $ d$ large. Note that both 13204 regbwsel and 13207 regxbwsel may suffer from numerical problems if the studied bandwidths are too small.

An example for calling 13210 regxbwsel gives the following quantlet which uses the nicfoo data again:

  nicfoo=read("nicfoo")
  tmp=regxbwsel(nicfoo)
13216 XLGsmoo10.xpl

At first, a selection box opens which offers the choice between the different penalty functions. Let us choose the first item Cross Validation which optimizes the $ \textrm{CV}$ criterion (6.13). A graphical display appears which shows the $ \textrm{CV}$ criterion in th upper left, the chosen optimal bandwidth in the upper right, the resulting kernel regression in the lower left, and information about the search grid and the kernel in the lower right.

Figure 6.10 shows this graphical display. The menu now allows the modification of the search grid and the kernel or the usage of other bandwidth selectors.


13222

Figure 6.10: Cross-validation bandwidth selector for Netincome vs. Food data.
\includegraphics[scale=0.45]{smootherregcv}


6.2.4 Confidence Intervals and Bands


{mh, mhl, mhu} = 13525 regci (x {,h {,alpha {,K} {,d}}})
computes the kernel density estimate and pointwise confidence intervals on a grid using the WARPing method
{mh, mhl, mhu} = 13528 regcb (x {,h {,alpha {,K} {,d}}})
computes the kernel regression and uniform confidence bands on a grid using the WARPing method
{mh, mhl, mhu} = 13531 regxci (x {,h {,alpha {,K} {,d}}})
computes the kernel density estimate and pointwise confidence intervals for all observations or on a grid v by exact computation
{mh, mhl, mhu} = 13534 regxcb (x {,h {,alpha {,K} {,d}}})
computes the kernel regression and uniform confidence bands for all observations or on a grid v by exact computation

As in the case of density estimation, it can be shown that the regression estimates $ \widehat{m}_h(x)$ have an asymptotic normal distribution. Suppose that $ m$ and $ f$ (the density of the explanatory variable $ X$) are twice differentiable, and that $ h=cn^{-1/5}$. Then

$\displaystyle n^{2/5} \left\{\widehat{m}_{h}(x)-m(x)\right\} \mathrel{\mathop{\...
...)\,,\, \frac{1}{c}\,\frac{\sigma^2(x)}{f(x)}\,\Vert K \Vert^{2}_{2} \right), \ $ (6.15)

with

$\displaystyle b(x)=m''(x)+\frac{m'(x)f'(x)}{f'(x)}. $

Thus, similar to the density case, an approximate confidence interval can be obtained by

$\displaystyle \left[\widehat{m}_h (x) - z_{1-\frac{\alpha}{2}} \sqrt{\frac{ \wi...
...{ \widehat{\sigma}_h^2 (x)\Vert K \Vert^2_2 }{nh\widehat{f}_h(x)}}\, \right]\,,$ (6.16)

where $ z_{1-\frac{\alpha}{2}}$ is the $ (1-\frac{\alpha}{2})$-quantile of the standard normal distribution and the estimate of the variance $ Var(Y\vert x)= \sigma^2(x)$ is given by

$\displaystyle \widehat{\sigma}_h^2(x)=
\frac{\frac{1}{n} \sum\limits_{i=1}^n K\...
...-\widehat{m}_h(x)\}^2}
{\sum\limits_{i=1}^n K\left(\frac{x_{i}-x}{h}\right)}\,.$

Also similar to the density case, uniform confidence bands for $ m(\bullet)$ need rather restrictive assumptions (Härdle; 1990, p. 116). Suppose that $ f$ is a density on $ [0,1]$ and $ h=n^{-\delta}$, $ \delta
\in (\frac{1}{5},\frac{1}{2})$. Then it holds under some regularity for all $ x \in [0,1]$:

$\displaystyle \ {P\Bigg(\widehat{m}_{h}(x)- z_{n,\alpha}
\ \sqrt{\frac{\widehat{\sigma}^2_{h}(x)\Vert K \Vert _{2}^{2}}
\ {nh\widehat{f}_{h}(x)}}}$
$\displaystyle \ $   $\displaystyle \leq m(x)
\ \leq \widehat{m}_{h}(x)+ z_{n,\alpha}
\sqrt{\frac{\wi...
...x)}}\Bigg)
\mathrel{\mathop{\approx}\limits_{\textrm{\small as.}}^{}} 1-\alpha,$  

where

$\displaystyle z_{n,\alpha}=\left\{\frac{-\log\{-\frac12 \log(1-\alpha)\}}{(2\delta
\log{n})^{1/2}}+d_{n}\right\}^{1/2},$

$\displaystyle d_{n}=(2\delta\log{n})^{1/2}+(2\delta\log{n})^{-1/2}
\log{\left(\frac{1}{2\pi}\frac{\Vert K'\Vert _{2}}{\Vert K
\Vert _{2}}\right)}^{1/2}.$

Pointwise confidence intervals and uniform confidence bands using the WARPing approximation are provided by 13547 regci and 13550 regcb , respectively. The equivalents for exact computations are 13553 regxci and 13556 regxcb . The functions 13559 regcb and 13562 regxcb can be directly applied to the original data $ x_1,\ldots,x_n$, the transformation to $ [0,1]$ is performed internally. The following quantlet code extends the above regression function by confidence intervals and confidence bands:

  {mh,mli,mui}=regci(nicfoo,0.18)  ; intervals
  {mh,mlb,mub}=regcb(nicfoo,0.18)  ; bands
  mh =setmask(mh,"line","blue","thick")
  mli=setmask(mli,"line","blue","thin","dashed")
  mui=setmask(mui,"line","blue","thin","dashed")
  mlb=setmask(mlb,"line","blue","thin")
  mub=setmask(mub,"line","blue","thin")
  plot(mh,mli,mui,mlb,mub)
  setgopt(plotdisplay,1,1,"title","Confidence Intervals & Bands")
13566 XLGsmoo11.xpl

Both confidence intervals and bands were computed using bandwidth 0.2. The significance level and kernel are not specified, as the defaults 0.05 and "qua" are used. The resulting plot is shown in Figure 6.11.

Figure 6.11: Pointwise confidence intervals (thin dashed) and uniform confidence (thin solid) bands in regression.
\includegraphics[scale=0.425]{smootherrcb}


6.2.5 Local Polynomial Regression and Derivative Estimation


mh = 13784 lpregest (x {,h {,p {,K} {,d}}})
computes the local polynomial kernel regression on a grid using the WARPing method
mh = 13787 lpregxest (x {,h {,p {,v}}})
computes the local polynomial kernel regression estimate for all observations or on a grid v by exact computation

Note that the Nadaraya-Watson estimator is a local constant estimator, i.e. the solution of

$\displaystyle \min_{\beta_{0}} \sum_{i=1}^{n} \left\{ y_{i} -
\beta_{0} \right\}^{2}
K\left(\frac{x_i-x}{h}\right).$

Replacing $ \beta_{0}$ by a $ p$-th order polynomial in $ x_i-x$ yields the local polynomial kernel regression estimator. For details, see Fan and Gijbels (1996). Let us illustrate this with the example of a local linear regression estimate ($ p=1$). The minimization problem is

$\displaystyle \min_{\beta_{0},\beta_{1}}
\ \sum_{i=1}^{n} \left\{ y_{i} - \beta_{0} -
\ \beta_{1}(x_{i}-x) \right\}^{2}
\ K\left(\frac{x_i-x}{h}\right).$

The solution of the problem can be written as

$\displaystyle \widehat{\beta} = (\widehat{\beta}_{0}, \widehat{\beta}_{1})^{T} ...
...eft( {\cal X}^{T} {\cal W} {\cal X} \right)^{-1} {\cal X}^{T} {\cal W} {\cal Y}$ (6.17)

using the notations

$\displaystyle {\cal X}= \left(\begin{array}{cc}
1&(x_{1}-x)\\
\vdots&\vdots\\...
... {\cal Y}= \left(\begin{array}{c}
y_{1}\\ \vdots \\ y_{n}
\end{array}\right),
$

and $ {\cal W}=\textrm{diag}\left\{K\left(\frac{x_{1}-x}{h}\right),
\ldots,K\left(\frac{x_{n}-x}{h}\right)\right\}$. In (6.17), $ \widehat{m}_h(x)=\widehat{\beta}_{0}$ estimates the regression function itself, whereas $ \widehat{m}_{1,h}(x)=\widehat{\beta}_{1}$ estimates the first derivative of the regression function.

The functions 13790 lpregest and 13793 lpregxest for local polynomial regression have essentially the same input as their Nadaraya-Watson equivalents, except that an additional parameter $ p$ to specify the degree of the polynomial can be given. For local polynomial regression, an odd value of $ p$ is recommended since odd-order local polynomial regressions outperform even-order local polynomial regressions.

Derivatives of regression functions are computed with 13796 lpderest or 13799 lpderxest . For derivative estimation a polynomial order whose difference to the derivative order is odd should be used. Typically one uses the $ p=1$ (local linear) for the estimation of the regression function and $ p=2$ (local quadratic) for the estimation of its first derivative.

13802 lpdregxest and 13805 lpderxest use automatically local linear and local quadratic estimation if no order is specified. The default kernel function is the Quartic kernel "qua". Appropriate bandwidths can be found by means of rule of thumbs that replace the unknown regression function by a higher-order polynomial (Fan and Gijbels; 1996). The following quantlet code estimates the regression function and its first derivative by the local polynomial method. Both functions and the data are plotted together in Figure 6.12.

  motcyc=read("motcyc")
  hh=lpregrot(motcyc)     ; rule-of-thumb bandwidth
  hd=lpderrot(motcyc)     ; rule-of-thumb bandwidth
  mh=lpregest(motcyc,hh)  ; local linear regression
  md=lpderest(motcyc,hd)  ; local quadratic derivative
  mh=setmask(mh,"line","black")
  md=setmask(md,"line","blue","dashed")
  xy=setmask(motcyc,"cross","small","red")
  plot(xy,mh,md)
  setgopt(plotdisplay,1,1,"title","Local Polynomial Estimation")
13809 XLGsmoo12.xpl

Figure 6.12: Local linear regression (solid), derivative (dashed) estimate and data.
\includegraphics[scale=0.425]{smootherlld}