An important question in many fields of science is the relationship
between two variables, say
and
. Regression analysis is
concerned with the question of how
(the dependent variable)
can be explained by
(the independent or explanatory
or regressor variable). This means a relation of the form
where
is a function in the mathematical sense. In many
cases theory does not put any restrictions on the form of
, i.e. theory does not say whether
is
linear, quadratic, increasing in
, etc.. Hence, it is up to
empirical analysis to use data to find out more about
.
Let us consider an example from Economics.
Suppose
is expenditure on potatoes
and
is net-income.
If we draw a graph with quantity
of potatoes on the vertical axis and income on the horizontal axis
then we have drawn an Engel curve.
Apparently, Engel curves
relate optimal
quantities of a commodity to income, holding prices constant. If we
derive the Engel curve analytically, then it takes the form
, where
denotes the quantity of potatoes bought at income
level
. Depending on individual preferences
several possibilities arise:
- The Engel curve slopes upward, i.e.
is an
increasing function of
. As income increases the
consumer is buying more of the good. In this case the good is said to
be normal. A special case of an upward sloping Engel curve
is a straight line through the origin, i.e.
is a linear
function.
- The Engel curve slopes downward or eventually slopes downward
(after sloping upward first) as income grows.
If the Engel curve slopes downward the good is said to be inferior.
Are potatoes inferior goods? There is just one way to find out:
collect appropriate data and estimate an Engel curve for potatoes.
We can interpret the statement ``potatoes are inferior''
in the sense that, on average, consumers will buy fewer potatoes
if their income grows while prices are held constant.
The principle that theoretic laws usually do not hold in every
individual case but merely on average can be formalized as
Equation (4.1) says that the relationship
doesn't need to hold exactly for
the
th observation (household) but is ``disturbed''
by the random variable
. Yet, (4.2) says that the relationship holds on
average, i.e. the expectation of
on the condition that
is given by
. The goal of the empirical analysis is to use a
finite set of observations
,
to estimate
.
EXAMPLE 4.1
In Figure
4.1, we have

observations of net-income and expenditure on food
expenditures (not only potatoes), taken from the
Family Expenditure Survey of British households in 1973. Graphically,
we try to fit an (Engel) curve to the scatterplot of
food versus net-income. Clearly, the graph of the estimate of

will not run through every point in the scatterplot,
i.e. we will not be able to use this graph to perfectly predict
food consumption of every household, given that we know the
household's income.
But this does not constitute a
serious problem (or any problem at all) if you recall that our
theoretical statement refers to average behavior.

Figure:
Nadaraya-Watson
kernel regression,
, U.K. Family Expenditure Survey 1973
SPMengelcurve1
|
|
Let us point out that, in a parametric approach, it is often
assumed that
, and the problem of estimating
is reduced to the problem of estimating
and
.
But note that this approach is not useful in our example. After all,
the alleged shape of the Engel curve for potatoes, upward sloping
for smaller income levels but eventually downward sloping as income is
increased, is ruled out by the specification
. The nonparametric approach does not
put such prior restrictions on
. However, as we will see below,
there is a price to pay for this flexibility.
4.1.1.1 Conditional Expectation
In this section we will recall two concepts that you should already
be familiar with, conditional expectation and conditional
expectation function. However, these concepts are central to regression
analysis and deserve to be treated accordingly. Let
and
be
two random variables with joint probability
density function
. The conditional
expectation
of
given that
is defined as
where
is the conditional probability density
function (conditional pdf) of
given
, and
is the marginal pdf of
.
The mean function might be quite nonlinear even for
simple-looking densities.
EXAMPLE 4.2
Consider the roof distribution with joint pdf
with

elsewhere, and marginal pdf
with

elsewhere. Hence we get
which is an obviously nonlinear function.

Note that
is a function of
alone. Consequently, we may abbreviate this term as
.
If we vary
we get a set of
conditional expectations. This mapping from
to
is called the conditional expectation function and
is often denoted as
. This tells us
how
and
are related ``on average''. Therefore, it is
of immediate interest to estimate
.
4.1.1.2 Fixed and Random Design
We started the discussion in the preceeding section by assuming that
both
and
are random variables with joint pdf
.
The natural sampling scheme in this setup is to draw a random sample
from the bivariate distribution that is characterized by
.
That is, we randomly draw observations of the form
. Before the sample is drawn, we can
view the
pairs
as identically
and independently distributed pairs of random variables. This sampling
scheme will be referred to as the random design.
We will concentrate on random design in the following derivations.
However, there are applications (especially in the natural sciences)
where the researcher is able to control the values of the predictor
variable
and
is the sole random variable. As an
example, imagine an
experiment that is supposed to provide evidence for the link between
a person's beer consumption (
) and his or her reaction time (
)
in a traffic incident.
Here the researcher will be able to specify the amount of beer the testee
is given before the experiment is conducted. Hence
will no longer
be a random variable, while
still will be.
This setup is usually referred to
as the fixed design. In repeated sampling, in the fixed design case
the density
is known (it is induced by the researcher).
This additional
knowledge (relative to the random design case, where
is unknown)
will simplify the estimation of
, as well as deriving statistical
properties of the estimator used, as we shall see below.
A special case of the fixed design model is the e.g. equispaced
sequence
,
, on
.
As we just mentioned, kernel
regression estimators depend on the type of the design.
4.1.2.1 Random Design
The derivation of the estimator in the random design case starts with the
definition of conditional expectation:
 |
(4.3) |
Given that we have observations of the form
, the only unknown quantities on the right hand side
of (4.3) are
and
. From our
discussion of kernel density estimation we know how to estimate
probability density functions. Consequently, we plug in kernel
estimates for
and
in (4.3).
Estimating
is straightforward. To estimate
we
employ the multiplicative kernel density estimator (with product kernel)
of Section 3.6
 |
(4.4) |
Hence, for the numerator of (4.3) we get
where we used the facts that kernel functions integrate to 1 and
are symmetric around zero.
Plugging in leads to the Nadaraya-Watson
estimator
introduced by Nadaraya (1964) and Watson (1964)
 |
(4.6) |
which is the natural extension of kernel estimation to the
problem of estimating an unknown conditional expectation function.
Several points are noteworthy:
- Rewriting (4.6) as
 |
(4.7) |
reveals that the Nadaraya-Watson estimator can be seen as a
weighted (local) average of the response variables
(note
). In
fact, the Nadaraya-Watson estimator shares this weighted local
average property with several other smoothing techniques, e.g.
-nearest-neighbor and spline smoothing, see Subsections 4.2.1
and 4.2.3.
- Note that just as in kernel density estimation the
bandwidth
determines the degree of smoothness of
, see Figure 4.2.
To motivate this, let
go to either
extreme. If
then
if
and is not defined elsewhere. Hence, at an observation
,
converges to
, i.e. we get an
interpolation of the data. On the other hand if
then
for all values of
, and
, i.e. the estimator
is a constant function that assigns the sample mean of
to
each
. Choosing
so that a good compromise between over-
and undersmoothing is achieved, is once again a crucial problem.
- You may wonder what happens if the denominator of
is equal to zero. In this case, the numerator is
also equal to zero, and the estimate is not defined. This can
happen in regions of sparse data.
Figure:
Four kernel regression estimates
for the 1973 U.K. Family Expenditure data with bandwidths
,
,
, and
SPMregress
|
|
4.1.2.2 Fixed Design
In the fixed design model,
is assumed to be known and a possible
kernel estimator for this sampling scheme employs weights of the form
 |
(4.8) |
Thus, estimators for the fixed design case are of simpler structure
and are easier to analyze in their statistical properties.
Since our main interest is the random design case, we will only
mention a very particular fixed design kernel regression estimator:
For the case of ordered design points
,
from some interval
Gasser & Müller (1984) suggested the following weight sequence
 |
(4.9) |
where
,
,
.
Note that as for the Nadaraya-Watson
estimator, the weights
sum to 1.
To show how the weights (4.9) are related to
the intuitively appealing formula (4.8) note that by
the mean value
theorem
 |
(4.10) |
for some
between
and
.
Moreover,
 |
(4.11) |
Plugging in (4.10) and (4.11) into
(4.8) gives
We will meet the Gasser-Müller
estimator
again in the following section where the statistical properties of kernel
regression estimators are discussed.
4.1.2.3 Statistical Properties
Are kernel regression estimators consistent? In the previous chapters we
showed that an estimator is consistent in deriving its mean squared error
(
), showing that the
converges, and appealing to the fact that
convergence in mean square implies convergence in probability (the latter
being the condition
stated in the definition of consistency).
Moreover, the
helped in
assessing the speed with which convergence is attained. In the random
design case it is very difficult to derive the
of the
Nadaraya-Watson estimator since it is the ratio (and not the sum) of
two estimators. It turns out that one can show that the Nadaraya-Watson
estimator is consistent in the random design case without explicit
recurrence to the
of this estimator. The conditions under
which this result holds are summarized in the following theorem:
THEOREM 4.1
Assume the univariate random design model and the regularity conditions
[4]

,

for

,

.
Suppose also

,

,
then
where for

holds

and

is a point of
continuity of

,

, and

.
The proof involves showing that -- considered separately -- both the
numerator and the denominator of
converge. Then, as a consequence of Slutsky's theorem, it can be shown that
converges. For more details see
Härdle (1990, p. 39ff).
Certainly, we would
like to know the speed with which the estimator converges but we have
already pointed out that the
of the Nadaraya-Watson estimator in the
random design case is very hard to derive. For the fixed design case,
Gasser & Müller (1984) have derived the
of the estimator
named after them:
THEOREM 4.2
Assume the univariate fixed design model and the conditions:

has support
![$ [-1,1]$](spmhtmlimg863.gif)
with

,

is twice continuously differentiable,

. Assume

.
Then, under

,

it holds
As usual, the (asymptotic)
has two components, the variance
term
and the squared bias term
. Hence, if we increase the bandwidth
we face the familiar trade-off between decreasing the variance while
increasing the squared bias.
To get a similar result for the random design case, we
linearize the Nadaraya-Watson estimator as follows
thus
It can be shown that of the two terms on the
right hand side, the first term is the
leading term in the distribution of
, whereas
the second term can be neglected. Hence, the
of
can be approximated by calculating
The following theorem can be derived this way:
THEOREM 4.3
Assume the univariate random design model and the conditions
[3]

,

for

and

hold.
Suppose

,

, then
 |
(4.13) |
where for

holds

and

is a point of continuity of

,

,

,

, and

.
Let
denote the asymptotic
.
Most components of this formula are constants w.r.t.
and
,
and we may write denoting constant terms by
and
, respectively
Minimizing this expression with respect to
gives the optimal
bandwidth
.
If you plug a bandwidth
into (4.13), you will find
that the
is of order
, a rate of convergence that is
slower than the rate obtained by the LS estimator in
linear regression but is the same as for estimating a density function
(cf. Section 3.2).
As in the density estimation case,
depends on unknown quantities like
or
. Once more, we are faced with the problem of finding a
bandwidth-selection rule that has desirable theoretical properties
and is applicable in practice.
We have displayed Nadaraya-Watson kernel regression estimates with
different bandwidths in Figure 4.2.
The issue of bandwidth selection will be discussed later on
in Section 4.3.
The Nadaraya-Watson estimator can be seen as a special case of a larger
class of kernel regression estimators: Nadaraya-Watson regression
corresponds to a local constant least squares fit. To motivate local
linear and higher order local polynomial fits, let us first consider a
Taylor expansion of the unknown conditional expectation function
:
 |
(4.14) |
for
in a neighborhood of the point
.
This suggests local polynomial regression, namely to fit
a polynomial in a neighborhood of
. The neighborhood is realized
by including kernel weights into the minimization problem
 |
(4.15) |
where
denotes the vector of coefficients
. The result is therefore a weighted least squares estimator with weights
.
Using the notations
we can compute
which minimizes (4.15) by
the usual formula for a weighted least squares estimator
 |
(4.16) |
It is important to note that -- in contrast to parametric least
squares -- this estimator varies with
. Hence, this is really a
local regression at the point
.
Denote the components of
by
, ...,
.
The local polynomial estimator of the regression
function
is
 |
(4.17) |
due to the fact that we have
by comparing
(4.14) and (4.15).
The whole curve
is obtained by running the
above local
polynomial regression with varying
. We have included the
parameter
in the notation since the final estimator depends
obviously on the bandwidth parameter
as it does the Nadaraya-Watson
estimator.
Let us gain some more insight into this by computing the estimators
for special values of
. For
reduces to
, which means
that the local constant estimator is nothing else as our well known
Nadaraya-Watson estimator,
i.e.
Now turn to
. Denote
then we can write
 |
(4.18) |
which yields the local linear estimator
 |
(4.19) |
Here we used the usual matrix inversion formula for
matrices. Of course, (4.18) can be generalized for arbitrary large
.
The general formula is
 |
(4.20) |
Introducing the notation
for the
first unit
vector in
, we can write the local linear
estimator as
Note that the Nadaraya-Watson estimator could also be written as
EXAMPLE 4.3
The local linear estimator

for our running example is
displayed in Figure
4.3.
What can we conclude from comparing this fit with the Nadaraya-Watson
fit in Figure
4.1?
The main difference to see is that the local linear fit reacts more
sensitively on the boundaries of the fit.
Another graphical difference will appear, when we compare local linear
and Nadaraya-Watson estimates with optimized bandwidths (see Section
4.3). Then we will see that the local linear fit will be
influenced less by outliers like those which cause the ``bump'' in the
right part of both Engel curves. 
Figure:
Local
polynomial regression,
, U.K. Family Expenditure Survey 1973
SPMlocpolyreg
|
|
Here we can discuss this effect
by looking at the asymptotic
of the local linear regression
estimator:
 |
(4.21) |
This formula is dealt with in more detail when we come to multivariate
regression, see Section 4.5. The
in the local linear case differs from that
for the Nadaraya-Watson
estimator (4.13) only with regard to the bias. It is easy to
see that the bias of the local linear fit is design-independent
and disappears when
is linear.
Thus, a local linear fit can improve the function estimation in
regions with sparse observations, for instance in the high
net-income region in our Engel curve example.
Let us also mention that the bias of the local linear estimator
has the same form as that of the Gasser-Müller estimator,
i.e. the bias in the fixed design case.
The local linear estimator achieves further improvement in the
boundary regions. In the case of
Nadaraya-Watson estimates we typically observe problems due to the
one-sided neighborhoods at the boundaries.
The reason is that in local constant modeling,
more or less the same points are used to estimate the curve near the boundary.
Local polynomial regression overcomes this by fitting a higher
degree polynomial here.
For estimating regression functions, the order
is usually taken
to be one (local linear) or three (local cubic regression).
As we have seen, the local linear fit performs (asymptotically)
better than the Nadaraya-Watson estimator (local constant).
This holds generally:
Odd order fits outperform even order fits.
Some additional remarks should be made in summary:
- As the Nadaraya-Watson estimator, the local polynomial estimator is a
weighted (local) average of the response variables
.
- As for all other kernel methods
the bandwidth
determines the degree of smoothness of
.
For
we observe the same result as for the Nadaraya-Watson
estimator, namely at an observation
,
converges to
.
The behavior is different for
. An infinitely large
makes all weights equal,
thus we obtain a parametric
th order polynomial fit in that
case.
A further advantage of the local polynomial approach is that it
provides an easy way of estimating derivatives of the function
.
The natural approach would be to estimate
by
and then to compute the derivative
.
But an alternative and more efficient
method is obtained by comparing (4.14) and
(4.15) again. From this we get the local polynomial
derivative estimator
 |
(4.22) |
for the
th derivative of
.
Usually the order of the polynomial is
or
in analogy to the regression case (recall that the
zero derivative of a function is always the function itself).
Also in analogy, the ``odd'' order
outperforms the
``even'' order
.
Figure:
Local polynomial
regression (
) and derivative estimation, (
),
by
rule of thumb, U.K. Family Expenditure Survey 1973
SPMderivest
|
|