4.1 Univariate Kernel Regression

An important question in many fields of science is the relationship between two variables, say $ X$ and $ Y$. Regression analysis is concerned with the question of how $ Y$ (the dependent variable) can be explained by $ X$ (the independent or explanatory or regressor variable). This means a relation of the form

$\displaystyle Y=m(X),$

where $ m(\bullet)$ is a function in the mathematical sense. In many cases theory does not put any restrictions on the form of $ m(\bullet)$, i.e. theory does not say whether $ m(\bullet)$ is linear, quadratic, increasing in $ X$, etc.. Hence, it is up to empirical analysis to use data to find out more about $ m(\bullet)$.

4.1.1 Introduction

Let us consider an example from Economics. Suppose $ Y$ is expenditure on potatoes and $ X$ is net-income. If we draw a graph with quantity of potatoes on the vertical axis and income on the horizontal axis then we have drawn an Engel curve. Apparently, Engel curves relate optimal quantities of a commodity to income, holding prices constant. If we derive the Engel curve analytically, then it takes the form $ Y=m(X)$, where $ Y$ denotes the quantity of potatoes bought at income level $ X$. Depending on individual preferences several possibilities arise:

Are potatoes inferior goods? There is just one way to find out: collect appropriate data and estimate an Engel curve for potatoes. We can interpret the statement ``potatoes are inferior'' in the sense that, on average, consumers will buy fewer potatoes if their income grows while prices are held constant. The principle that theoretic laws usually do not hold in every individual case but merely on average can be formalized as
$\displaystyle y_{i}$ $\displaystyle =$ $\displaystyle m(x_{i})+ \varepsilon_{i}, \quad i=1,\ldots,n,$ (4.1)
$\displaystyle E(Y \vert X=x)$ $\displaystyle =$ $\displaystyle m(x).$ (4.2)

Equation (4.1) says that the relationship $ Y=m(X)$ doesn't need to hold exactly for the $ i$th observation (household) but is ``disturbed'' by the random variable $ \varepsilon$. Yet, (4.2) says that the relationship holds on average, i.e. the expectation of $ Y$ on the condition that $ X=x$ is given by $ m(x)$. The goal of the empirical analysis is to use a finite set of observations $ (x_{i},y_{i})$, $ i=1,\ldots,n$ to estimate $ m(\bullet)$.

EXAMPLE 4.1  
In Figure 4.1, we have $ n=7125$ observations of net-income and expenditure on food expenditures (not only potatoes), taken from the Family Expenditure Survey of British households in 1973. Graphically, we try to fit an (Engel) curve to the scatterplot of food versus net-income. Clearly, the graph of the estimate of $ m(\bullet)$ will not run through every point in the scatterplot, i.e. we will not be able to use this graph to perfectly predict food consumption of every household, given that we know the household's income. But this does not constitute a serious problem (or any problem at all) if you recall that our theoretical statement refers to average behavior. $ \Box$

Figure: Nadaraya-Watson kernel regression, $ h=0.2$, U.K. Family Expenditure Survey 1973
\includegraphics[width=0.03\defepswidth]{quantlet.ps}SPMengelcurve1
\includegraphics[width=1.2\defpicwidth]{SPMengelcurve1.ps}

Let us point out that, in a parametric approach, it is often assumed that $ m(x)=\alpha + \beta\cdotp x$, and the problem of estimating $ m(\bullet)$ is reduced to the problem of estimating $ \alpha$ and $ \beta$. But note that this approach is not useful in our example. After all, the alleged shape of the Engel curve for potatoes, upward sloping for smaller income levels but eventually downward sloping as income is increased, is ruled out by the specification $ m(x)=\alpha + \beta\cdotp x$. The nonparametric approach does not put such prior restrictions on $ m(\bullet)$. However, as we will see below, there is a price to pay for this flexibility.


4.1.1.1 Conditional Expectation

In this section we will recall two concepts that you should already be familiar with, conditional expectation and conditional expectation function. However, these concepts are central to regression analysis and deserve to be treated accordingly. Let $ X$ and $ Y$ be two random variables with joint probability density function $ f(x,y)$. The conditional expectation of $ Y$ given that $ X=x$ is defined as

$\displaystyle E(Y\vert X=x) = \int y\;f(y\vert x)\;dy = \int y\,\frac{f(x,y)}{f_X(x)}\;dy
= m(x),$

where $ f(y\vert x)$ is the conditional probability density function (conditional pdf) of $ Y$ given $ X=x$, and $ f_X(x)$ is the marginal pdf of $ X$. The mean function might be quite nonlinear even for simple-looking densities.

EXAMPLE 4.2  
Consider the roof distribution with joint pdf

$\displaystyle f(x,y)= x+y \qquad \textrm{for} \qquad 0\le x \le 1 \quad
\textrm{and} \quad 0\le y \le 1,$

with $ f(x,y)=0$ elsewhere, and marginal pdf

$\displaystyle f_X(x)=\int_{0}^{1}f(x,y)\;dy=x+\frac{1}{2} \qquad \textrm{for}
\quad 0 \le x \le 1,$

with $ f_X(x)=0$ elsewhere. Hence we get

$\displaystyle E(Y\vert X=x) = \int y\,\frac{f(x,y)}{f_X(x)}\;dy
= \int_{0}^{1}...
...c{x+y}{x+\frac{1}{2}}\;dy
=\frac{\frac{1}{2}x+\frac{1}{3}}{x+\frac{1}{2}}= m(x)$

which is an obviously nonlinear function. $ \Box$

Note that $ E(Y\vert X=x)$ is a function of $ x$ alone. Consequently, we may abbreviate this term as $ m(x)$. If we vary $ x$ we get a set of conditional expectations. This mapping from $ x$ to $ m(x)$ is called the conditional expectation function and is often denoted as $ E(Y \vert X)$. This tells us how $ Y$ and $ X$ are related ``on average''. Therefore, it is of immediate interest to estimate $ m(\bullet)$.


4.1.1.2 Fixed and Random Design

We started the discussion in the preceeding section by assuming that both $ X$ and $ Y$ are random variables with joint pdf $ f(x,y)$. The natural sampling scheme in this setup is to draw a random sample from the bivariate distribution that is characterized by $ f(x,y)$. That is, we randomly draw observations of the form $ \{X_i,Y_i\},\ i=1,\ldots,n$. Before the sample is drawn, we can view the $ n$ pairs $ \{X_{i},Y_{i}\}$ as identically and independently distributed pairs of random variables. This sampling scheme will be referred to as the random design.

We will concentrate on random design in the following derivations. However, there are applications (especially in the natural sciences) where the researcher is able to control the values of the predictor variable $ X$ and $ Y$ is the sole random variable. As an example, imagine an experiment that is supposed to provide evidence for the link between a person's beer consumption ($ X$) and his or her reaction time ($ Y$) in a traffic incident. Here the researcher will be able to specify the amount of beer the testee is given before the experiment is conducted. Hence $ X$ will no longer be a random variable, while $ Y$ still will be. This setup is usually referred to as the fixed design. In repeated sampling, in the fixed design case the density $ f_X(x)$ is known (it is induced by the researcher). This additional knowledge (relative to the random design case, where $ f_X(x)$ is unknown) will simplify the estimation of $ m(\bullet)$, as well as deriving statistical properties of the estimator used, as we shall see below. A special case of the fixed design model is the e.g. equispaced sequence $ x_i=i/n$, $ i=0,\ldots,n$, on $ [0,1]$.

4.1.2 Kernel Regression

As we just mentioned, kernel regression estimators depend on the type of the design.


4.1.2.1 Random Design

The derivation of the estimator in the random design case starts with the definition of conditional expectation:

$\displaystyle m(x)=E(Y\vert X=x)=\int y\,\frac{f(x,y)}{f_X(x)}\;dy =\frac{\int y\;f(x,y)\;dy}{f_X(x)}.$ (4.3)

Given that we have observations of the form $ \{X_{i},Y_{i}\},\
i=1,\ldots,n$, the only unknown quantities on the right hand side of (4.3) are $ f(x,y)$ and $ f_X(x)$. From our discussion of kernel density estimation we know how to estimate probability density functions. Consequently, we plug in kernel estimates for $ f_X(x)$ and $ f(x,y)$ in (4.3). Estimating $ f_X(x)$ is straightforward. To estimate $ f(x,y)$ we employ the multiplicative kernel density estimator (with product kernel) of Section 3.6

$\displaystyle \widehat{f}_{h,g}(x,y)=\frac{1}{n}\sum_{i=1}^{n} K_h\left(\frac{x-X_{i}}{h}\right) K_g\left(\frac{y-Y_{i}}{g}\right).$ (4.4)

Hence, for the numerator of (4.3) we get
$\displaystyle \int y\;\widehat{f}_{h,g}(x,y) dy$ $\displaystyle =$ $\displaystyle \frac{1}{n}
\sum_{i=1}^{n} \frac{1}{h}
K\left(\frac{x-X_{i}}{h}\right) \int \frac{y}{g}
K\left(\frac{y-Y_{i}}{g}\right) dy$ (4.5)
  $\displaystyle =$ $\displaystyle \frac{1}{n} \sum_{i=1}^{n} K_{h}(x-X_{i}) \int (sg+Y_{i}) K(s)\;ds$  
  $\displaystyle =$ $\displaystyle \frac{1}{n} \sum_{i=1}^{n} K_{h}(x-X_{i})Y_{i},$  

where we used the facts that kernel functions integrate to 1 and are symmetric around zero. Plugging in leads to the Nadaraya-Watson estimator introduced by Nadaraya (1964) and Watson (1964)

$\displaystyle \widehat{m}_h(x) = \frac{n^{-1} \sum^n_{i=1} K_h(x-X_i) Y_i} {n^{-1} \sum^n_{j=1} K_h(x-X_j)}\,$ (4.6)

which is the natural extension of kernel estimation to the problem of estimating an unknown conditional expectation function. Several points are noteworthy:

Figure: Four kernel regression estimates for the 1973 U.K. Family Expenditure data with bandwidths $ h=0.05$, $ h=0.1$, $ h=0.2$, and $ h=0.5$
\includegraphics[width=0.03\defepswidth]{quantlet.ps}SPMregress
\includegraphics[width=1.4\defpicwidth]{SPMregress.ps}


4.1.2.2 Fixed Design

In the fixed design model, $ f_X(x)$ is assumed to be known and a possible kernel estimator for this sampling scheme employs weights of the form

$\displaystyle W_{hi}^{FD}(x)=\frac{K_{h}(x-x_{i})}{f_X(x)}.$ (4.8)

Thus, estimators for the fixed design case are of simpler structure and are easier to analyze in their statistical properties.

Since our main interest is the random design case, we will only mention a very particular fixed design kernel regression estimator: For the case of ordered design points $ x_{(i)}$, $ i=1,\ldots,n,$ from some interval $ [a,b]$ Gasser & Müller (1984) suggested the following weight sequence

$\displaystyle W_{hi}^{GM}(x)=n \int_{s_{i-1}}^{s_{i}}K_{h}(x-u)du,$ (4.9)

where $ s_i=\left(x_{(i)}+x_{(i+1)}\right)/2$, $ s_0=a$, $ s_{n+1}=b$. Note that as for the Nadaraya-Watson estimator, the weights $ W_{hi}^{GM}(x)$ sum to 1.

To show how the weights (4.9) are related to the intuitively appealing formula (4.8) note that by the mean value theorem

$\displaystyle (s_{i}-s_{i-1})\;K_{h}(x-\xi)=\int_{s_{i-1}}^{s_{i}}K_{h}(x-u)\;du$ (4.10)

for some $ \xi$ between $ s_{i}$ and $ s_{i-1}$. Moreover,

$\displaystyle n(s_{i}-s_{i-1}) \approx \frac{1}{f_X(x)}\,.$ (4.11)

Plugging in (4.10) and (4.11) into (4.8) gives

$\displaystyle W_{hi}^{FD}(x)
= \frac{ K_{h}(x-x_{(i)}) }{f_X(x)}
\approx n \int_{s_{i-1}}^{s_{i}}K_{h}(x-u)\;du
= W_{hi}^{GM}(x).$

We will meet the Gasser-Müller estimator $ \frac{1}{n}\sum_{i=1}^{n}W_{hi}^{GM}(x)Y_{i}$ again in the following section where the statistical properties of kernel regression estimators are discussed.


4.1.2.3 Statistical Properties

Are kernel regression estimators consistent? In the previous chapters we showed that an estimator is consistent in deriving its mean squared error ($ \mse$), showing that the $ \mse$ converges, and appealing to the fact that convergence in mean square implies convergence in probability (the latter being the condition stated in the definition of consistency).

Moreover, the $ \mse$ helped in assessing the speed with which convergence is attained. In the random design case it is very difficult to derive the $ \mse$ of the Nadaraya-Watson estimator since it is the ratio (and not the sum) of two estimators. It turns out that one can show that the Nadaraya-Watson estimator is consistent in the random design case without explicit recurrence to the $ \mse$ of this estimator. The conditions under which this result holds are summarized in the following theorem:

THEOREM 4.1  
Assume the univariate random design model and the regularity conditions
[4] $ \int \vert K(u)\vert du < \infty$, $ u\;K(u)\to 0$ for $ \vert u\vert\rightarrow \infty$, $ EY^{2} < \infty$. Suppose also $ h\to 0$, $ nh\to\infty$, then

$\displaystyle \frac{1}{n}\sum_{i=1}^{n}W_{hi}(x)Y_{i}=\widehat{m}_h(x)\mathrel{\mathop{\longrightarrow}\limits_{}^{P}}m(x) $

where for $ x$ holds $ f_X(x)>0$ and $ x$ is a point of continuity of $ m(x)$, $ f_X(x)$, and $ \sigma^2(x)=\mathop{\mathit{Var}}(Y\vert X=x)$.

The proof involves showing that -- considered separately -- both the numerator and the denominator of $ \widehat{m}_h(x)$ converge. Then, as a consequence of Slutsky's theorem, it can be shown that $ \widehat{m}_h(x)$ converges. For more details see Härdle (1990, p. 39ff).

Certainly, we would like to know the speed with which the estimator converges but we have already pointed out that the $ \mse$ of the Nadaraya-Watson estimator in the random design case is very hard to derive. For the fixed design case, Gasser & Müller (1984) have derived the $ \mse$ of the estimator named after them:

THEOREM 4.2  
Assume the univariate fixed design model and the conditions: $ K(\bullet)$ has support $ [-1,1]$ with $ K(-1)=K(1)=0$, $ m(\bullet)$ is twice continuously differentiable, $ \max_{i}\vert x_{i}-x_{i-1}\vert=O(n^{-1})$. Assume $ \mathop{\mathit{Var}}(\varepsilon_{i})=\sigma^{2},\ i=1,\ldots,n$. Then, under $ h\to 0$, $ nh\to\infty$ it holds

$\displaystyle \mse\left\{ \frac{1}{n}\sum_{i=1}^{n}W_{hi}^{GM}(x)Y_{i}
\right\}...
...} \sigma^{2} \Vert K\Vert^2_{2}
+ \frac{h^{4}}{4} \mu^2_{2}(K) \{m''(x)\}^{2}.$

As usual, the (asymptotic) $ \mse$ has two components, the variance term $ \sigma^{2} \Vert K\Vert^2_{2}/(nh)$ and the squared bias term $ h^{4} \mu_{2}^2(K) \{m''(x)\}^{2}/4$. Hence, if we increase the bandwidth $ h$ we face the familiar trade-off between decreasing the variance while increasing the squared bias.

To get a similar result for the random design case, we linearize the Nadaraya-Watson estimator as follows

$\displaystyle \widehat{m}_{h}(x)
= \frac{\widehat{r}_{h}(x)}{\widehat{f}_{h}(x)},$

thus
$\displaystyle \widehat{m}_{h}(x) - m(x)$ $\displaystyle =$ $\displaystyle \left\{\frac{\widehat{r}_{h}(x)}{\widehat{f}_{h}(x)}-m(x)\right\}...
...t{f}_{h}(x)}{f_X(x)}+\left\{1-\frac{\widehat{f}_{h}(x)}
{f_X(x)}\right\}\right]$  
  $\displaystyle =$ $\displaystyle \frac{\widehat{r}_{h}(x)-m(x)\widehat{f}_{h}(x)}{f_X(x)}$  
    $\displaystyle \quad\quad\quad
+\{\widehat{m}_{h}(x)-m(x)\} \frac{f_X(x)-\widehat{f}_{h}(x)}{f_X(x)}.$ (4.12)

It can be shown that of the two terms on the right hand side, the first term is the leading term in the distribution of $ \widehat{m}_{h}(x) - m(x)$, whereas the second term can be neglected. Hence, the $ \mse$ of $ \widehat{m}_{h}(x)$ can be approximated by calculating

$\displaystyle E\left\{\frac{\widehat{r}_{h}(x)-m(x)
\widehat{f}_{h}(x)}{f_X(x)}\right\}^{2}.$

The following theorem can be derived this way:

THEOREM 4.3  
Assume the univariate random design model and the conditions
[3] $ \int \vert K(u)\vert du < \infty$, $ \lim\limits u\,K(u)=0$ for $ \vert u\vert\rightarrow \infty$ and $ EY^{2} < \infty$ hold. Suppose $ h\to 0$, $ nh\to\infty$, then

$\displaystyle \mse\{\widehat{m}_{h}(x)\} \approx \underbrace{\frac{1}{nh}\;\fra...
...2\frac{m'(x)f'_X(x)}{f_X(x)}\right\}^{2} \mu^{2}_{2}(K) }_{\textrm{bias part}},$ (4.13)

where for $ x$ holds $ f_X(x)>0$ and $ x$ is a point of continuity of $ m'(x)$, $ m''(x)$, $ f_X(x)$, $ f'_X(x)$, and $ \sigma^2(x)=\mathop{\mathit{Var}}(Y\vert X=x)$.

Let $ \amse$ denote the asymptotic $ \mse$. Most components of this formula are constants w.r.t. $ n$ and $ h$, and we may write denoting constant terms by $ C_{1}$ and $ C_{2}$, respectively

$\displaystyle \amse(n,h)=\frac{1}{nh}C_{1}+h^{4}C_{2}.$

Minimizing this expression with respect to $ h$ gives the optimal bandwidth $ h_{opt}\sim n^{-1/5}$. If you plug a bandwidth $ h\sim n^{-1/5}$ into (4.13), you will find that the $ \amse$ is of order $ O(n^{-4/5})$, a rate of convergence that is slower than the rate obtained by the LS estimator in linear regression but is the same as for estimating a density function (cf. Section 3.2).

As in the density estimation case, $ \amse$ depends on unknown quantities like $ \sigma^{2}(x)$ or $ m''(x)$. Once more, we are faced with the problem of finding a bandwidth-selection rule that has desirable theoretical properties and is applicable in practice. We have displayed Nadaraya-Watson kernel regression estimates with different bandwidths in Figure 4.2. The issue of bandwidth selection will be discussed later on in Section 4.3.

4.1.3 Local Polynomial Regression and Derivative Estimation    

The Nadaraya-Watson estimator can be seen as a special case of a larger class of kernel regression estimators: Nadaraya-Watson regression corresponds to a local constant least squares fit. To motivate local linear and higher order local polynomial fits, let us first consider a Taylor expansion of the unknown conditional expectation function $ m(\bullet)$:

$\displaystyle m(t) \approx m(x) + m'(x) (t - x) + \cdots + m^{(p)}(x) (t - x)^p \frac{1}{p!}$ (4.14)

for $ t$ in a neighborhood of the point $ x$. This suggests local polynomial regression, namely to fit a polynomial in a neighborhood of $ x$. The neighborhood is realized by including kernel weights into the minimization problem

$\displaystyle \min_{\beta} \sum_{i=1}^n \left \{ Y_i - \beta_0 - \beta_1 (X_i - x) - \ldots -\beta_p (X_i - x)^p \right\}^2 K_h(x-X_i),$ (4.15)

where $ \beta$ denotes the vector of coefficients $ (\beta_{0},
\beta_{1}, \ldots,\beta_{p})^\top$. The result is therefore a weighted least squares estimator with weights $ K_h(x-X_i)$. Using the notations

$\displaystyle {\mathbf{X}}= \left(\begin{array}{ccccc}
1&X_{1}-x&(X_{1}-x)^2&\...
...}}= \left(\begin{array}{c}
Y_{1}\\ Y_{2}\\ \vdots \\ Y_{n}
\end{array}\right),$

$\displaystyle {\mathbf{W}}= \left(\begin{array}{ccccc}
K_{h}(x-X_{1})&0 &\ldot...
...
\vdots&\vdots&\ddots&\vdots\\
0 &0 &\ldots&K_{h}(x-X_{n})
\end{array}\right),$

we can compute $ \widehat\beta$ which minimizes (4.15) by the usual formula for a weighted least squares estimator

$\displaystyle \widehat\beta(x) = \left( {\mathbf{X}}^\top {\mathbf{W}}{\mathbf{X}}\right)^{-1} {\mathbf{X}}^\top {\mathbf{W}}{\boldsymbol{Y}}.$ (4.16)

It is important to note that -- in contrast to parametric least squares -- this estimator varies with $ x$. Hence, this is really a local regression at the point $ x$. Denote the components of $ \widehat\beta(x)$ by $ \widehat{\beta}_0(x)$, ..., $ \widehat{\beta}_p(x)$. The local polynomial estimator of the regression function $ m$ is

$\displaystyle \widehat{m}_{p,h}(x) = \widehat{\beta}_0 (x)$ (4.17)

due to the fact that we have $ m(x)\approx\beta_{0}(x)$ by comparing (4.14) and (4.15). The whole curve $ \widehat{m}_{p,h}(\bullet)$ is obtained by running the above local polynomial regression with varying $ x$. We have included the parameter $ h$ in the notation since the final estimator depends obviously on the bandwidth parameter $ h$ as it does the Nadaraya-Watson estimator.

Let us gain some more insight into this by computing the estimators for special values of $ p$. For $ p=0$ $ \widehat\beta$ reduces to $ \widehat\beta_{0}$, which means that the local constant estimator is nothing else as our well known Nadaraya-Watson estimator, i.e.

$\displaystyle \widehat{m}_{0,h}(x)=\widehat{m}_{h}(x)
= \frac{ \sum_{i=1}^n K_{h}(x-X_{i}) Y_{i} }
{ \sum_{i=1}^n K_{h}(x-X_{i})}.$

Now turn to $ p=1$. Denote
$\displaystyle S_{h,j}(x)$ $\displaystyle =$ $\displaystyle \sum_{i=1}^n K_{h}(x-X_{i}) (X_{i}-x)^j,$  
$\displaystyle T_{h,j}(x)$ $\displaystyle =$ $\displaystyle \sum_{i=1}^n K_{h}(x-X_{i}) (X_{i}-x)^j Y_{i},$  

then we can write

$\displaystyle \widehat\beta(x) = \left(\begin{array}{cc} S_{h,0}(x)& S_{h,1}(x)...
...y}\right)^{-1} \left(\begin{array}{c} T_{h,0}(x)\\ T_{h,1}(x)\end{array}\right)$ (4.18)

which yields the local linear estimator

$\displaystyle \widehat m_{1,h}(x)= \widehat\beta_{0}(x) = \frac{T_{h,0}(x)\;S_{h,2}(x)-T_{h,1}(x)\;S_{h,1}(x)} {S_{h,0}(x)\;S_{h,2}(x)-S^2_{h,1}(x)}.$ (4.19)

Here we used the usual matrix inversion formula for $ 2\times 2$ matrices. Of course, (4.18) can be generalized for arbitrary large $ p$. The general formula is

$\displaystyle \widehat\beta(x) = \left(\begin{array}{llll} S_{h,0}(x)& S_{h,1}(...
...egin{array}{c} T_{h,0}(x)\\ T_{h,1}(x)\\ \vdots\\ T_{h,p}(x)\end{array}\right).$ (4.20)

Introducing the notation $ {\boldsymbol{e}}_{0}=(1,0,\ldots,0)^\top$ for the first unit vector in $ \mathbb{R}^{p+1}$, we can write the local linear estimator as

$\displaystyle \widehat{m}_{1,h}(x)={\boldsymbol{e}}_{0}^\top
\left({\mathbf{X}}...
...hbf{W}}{\mathbf{X}}
\right)^{-1} {\mathbf{X}}^\top{\mathbf{W}}{\boldsymbol{Y}}.$

Note that the Nadaraya-Watson estimator could also be written as

$\displaystyle \widehat{m}_{h}(x)=\frac{T_{h,0}(x)}{S_{h,0}(x)}.$

EXAMPLE 4.3  
The local linear estimator $ \widehat m_{1,h}$ for our running example is displayed in Figure 4.3. What can we conclude from comparing this fit with the Nadaraya-Watson fit in Figure 4.1? The main difference to see is that the local linear fit reacts more sensitively on the boundaries of the fit.

Another graphical difference will appear, when we compare local linear and Nadaraya-Watson estimates with optimized bandwidths (see Section 4.3). Then we will see that the local linear fit will be influenced less by outliers like those which cause the ``bump'' in the right part of both Engel curves. $ \Box$

Figure: Local polynomial regression, $ p=1, h=0.2$, U.K. Family Expenditure Survey 1973
\includegraphics[width=0.03\defepswidth]{quantlet.ps}SPMlocpolyreg
\includegraphics[width=1.2\defpicwidth]{SPMlocpolyreg.ps}

Here we can discuss this effect by looking at the asymptotic $ \mse$ of the local linear regression estimator:

$\displaystyle \amse\{\widehat{m}_{1,h}(x)\} = \frac{1}{nh}\;\frac{\sigma^{2}(x)...
...;\Vert K\Vert^{2}_{2} +\frac{h^{4}}{4}\left\{m''(x)\right\}^{2} \mu^{2}_{2}(K).$ (4.21)

This formula is dealt with in more detail when we come to multivariate regression, see Section 4.5. The $ \amse$ in the local linear case differs from that for the Nadaraya-Watson estimator (4.13) only with regard to the bias. It is easy to see that the bias of the local linear fit is design-independent and disappears when $ m(\bullet)$ is linear. Thus, a local linear fit can improve the function estimation in regions with sparse observations, for instance in the high net-income region in our Engel curve example. Let us also mention that the bias of the local linear estimator has the same form as that of the Gasser-Müller estimator, i.e. the bias in the fixed design case.

The local linear estimator achieves further improvement in the boundary regions. In the case of Nadaraya-Watson estimates we typically observe problems due to the one-sided neighborhoods at the boundaries. The reason is that in local constant modeling, more or less the same points are used to estimate the curve near the boundary. Local polynomial regression overcomes this by fitting a higher degree polynomial here.

For estimating regression functions, the order $ p$ is usually taken to be one (local linear) or three (local cubic regression). As we have seen, the local linear fit performs (asymptotically) better than the Nadaraya-Watson estimator (local constant). This holds generally: Odd order fits outperform even order fits. Some additional remarks should be made in summary:

A further advantage of the local polynomial approach is that it provides an easy way of estimating derivatives of the function $ m(\bullet)$. The natural approach would be to estimate $ m$ by $ \widehat m$ and then to compute the derivative $ \widehat{m}'$. But an alternative and more efficient method is obtained by comparing (4.14) and (4.15) again. From this we get the local polynomial derivative estimator

$\displaystyle \widehat{m}^{(\nu)}_{p,h}(x) = \nu{!}\, \widehat\beta_{\nu}(x)$ (4.22)

for the $ \nu$th derivative of $ m(\bullet)$. Usually the order of the polynomial is $ p=\nu+1$ or $ p=\nu+3$ in analogy to the regression case (recall that the zero derivative of a function is always the function itself). Also in analogy, the ``odd'' order $ p=\nu+2\ell+1$ outperforms the ``even'' order $ p=\nu+2\ell$.

EXAMPLE 4.4  
To estimate the first ($ \nu=1$) derivative of our Engel curve we could take $ p=2$ (local quadratic derivative estimator). This is done to get Figure 4.4. Note that we have used a rule-of-thumb bandwidth here, see Fan & Müller (1995, p. 92) and Fan & Gijbels (1996, p. 111)$ \Box$

Figure: Local polynomial regression ($ p=1$) and derivative estimation, ($ p=2$), $ h$ by rule of thumb, U.K. Family Expenditure Survey 1973
\includegraphics[width=0.03\defepswidth]{quantlet.ps}SPMderivest
\includegraphics[width=1.2\defpicwidth]{SPMderivest.ps}