An important question in many fields of science is the relationship
between two variables, say and
. Regression analysis is
concerned with the question of how
(the dependent variable)
can be explained by
(the independent or explanatory
or regressor variable). This means a relation of the form
Let us consider an example from Economics.
Suppose is expenditure on potatoes
and
is net-income.
If we draw a graph with quantity
of potatoes on the vertical axis and income on the horizontal axis
then we have drawn an Engel curve.
Apparently, Engel curves
relate optimal
quantities of a commodity to income, holding prices constant. If we
derive the Engel curve analytically, then it takes the form
, where
denotes the quantity of potatoes bought at income
level
. Depending on individual preferences
several possibilities arise:
Let us point out that, in a parametric approach, it is often
assumed that
, and the problem of estimating
is reduced to the problem of estimating
and
.
But note that this approach is not useful in our example. After all,
the alleged shape of the Engel curve for potatoes, upward sloping
for smaller income levels but eventually downward sloping as income is
increased, is ruled out by the specification
. The nonparametric approach does not
put such prior restrictions on
. However, as we will see below,
there is a price to pay for this flexibility.
In this section we will recall two concepts that you should already
be familiar with, conditional expectation and conditional
expectation function. However, these concepts are central to regression
analysis and deserve to be treated accordingly. Let and
be
two random variables with joint probability
density function
. The conditional
expectation
of
given that
is defined as
Note that is a function of
alone. Consequently, we may abbreviate this term as
.
If we vary
we get a set of
conditional expectations. This mapping from
to
is called the conditional expectation function and
is often denoted as
. This tells us
how
and
are related ``on average''. Therefore, it is
of immediate interest to estimate
.
We started the discussion in the preceeding section by assuming that
both and
are random variables with joint pdf
.
The natural sampling scheme in this setup is to draw a random sample
from the bivariate distribution that is characterized by
.
That is, we randomly draw observations of the form
. Before the sample is drawn, we can
view the
pairs
as identically
and independently distributed pairs of random variables. This sampling
scheme will be referred to as the random design.
We will concentrate on random design in the following derivations.
However, there are applications (especially in the natural sciences)
where the researcher is able to control the values of the predictor
variable and
is the sole random variable. As an
example, imagine an
experiment that is supposed to provide evidence for the link between
a person's beer consumption (
) and his or her reaction time (
)
in a traffic incident.
Here the researcher will be able to specify the amount of beer the testee
is given before the experiment is conducted. Hence
will no longer
be a random variable, while
still will be.
This setup is usually referred to
as the fixed design. In repeated sampling, in the fixed design case
the density
is known (it is induced by the researcher).
This additional
knowledge (relative to the random design case, where
is unknown)
will simplify the estimation of
, as well as deriving statistical
properties of the estimator used, as we shall see below.
A special case of the fixed design model is the e.g. equispaced
sequence
,
, on
.
As we just mentioned, kernel regression estimators depend on the type of the design.
The derivation of the estimator in the random design case starts with the definition of conditional expectation:
![]() |
(4.4) |
![]() |
In the fixed design model, is assumed to be known and a possible
kernel estimator for this sampling scheme employs weights of the form
Since our main interest is the random design case, we will only
mention a very particular fixed design kernel regression estimator:
For the case of ordered design points ,
from some interval
Gasser & Müller (1984) suggested the following weight sequence
To show how the weights (4.9) are related to the intuitively appealing formula (4.8) note that by the mean value theorem
Are kernel regression estimators consistent? In the previous chapters we
showed that an estimator is consistent in deriving its mean squared error
(), showing that the
converges, and appealing to the fact that
convergence in mean square implies convergence in probability (the latter
being the condition
stated in the definition of consistency).
Moreover, the helped in
assessing the speed with which convergence is attained. In the random
design case it is very difficult to derive the
of the
Nadaraya-Watson estimator since it is the ratio (and not the sum) of
two estimators. It turns out that one can show that the Nadaraya-Watson
estimator is consistent in the random design case without explicit
recurrence to the
of this estimator. The conditions under
which this result holds are summarized in the following theorem:
The proof involves showing that -- considered separately -- both the
numerator and the denominator of
converge. Then, as a consequence of Slutsky's theorem, it can be shown that
converges. For more details see
Härdle (1990, p. 39ff).
Certainly, we would
like to know the speed with which the estimator converges but we have
already pointed out that the of the Nadaraya-Watson estimator in the
random design case is very hard to derive. For the fixed design case,
Gasser & Müller (1984) have derived the
of the estimator
named after them:
As usual, the (asymptotic) has two components, the variance
term
and the squared bias term
. Hence, if we increase the bandwidth
we face the familiar trade-off between decreasing the variance while
increasing the squared bias.
To get a similar result for the random design case, we linearize the Nadaraya-Watson estimator as follows
Let denote the asymptotic
.
Most components of this formula are constants w.r.t.
and
,
and we may write denoting constant terms by
and
, respectively
As in the density estimation case, depends on unknown quantities like
or
. Once more, we are faced with the problem of finding a
bandwidth-selection rule that has desirable theoretical properties
and is applicable in practice.
We have displayed Nadaraya-Watson kernel regression estimates with
different bandwidths in Figure 4.2.
The issue of bandwidth selection will be discussed later on
in Section 4.3.
The Nadaraya-Watson estimator can be seen as a special case of a larger
class of kernel regression estimators: Nadaraya-Watson regression
corresponds to a local constant least squares fit. To motivate local
linear and higher order local polynomial fits, let us first consider a
Taylor expansion of the unknown conditional expectation function
:
![]() |
(4.16) |
Let us gain some more insight into this by computing the estimators
for special values of . For
reduces to
, which means
that the local constant estimator is nothing else as our well known
Nadaraya-Watson estimator,
i.e.
![]() |
![]() |
![]() |
|
![]() |
![]() |
![]() |
![]() |
(4.19) |
Another graphical difference will appear, when we compare local linear
and Nadaraya-Watson estimates with optimized bandwidths (see Section
4.3). Then we will see that the local linear fit will be
influenced less by outliers like those which cause the ``bump'' in the
right part of both Engel curves.
Here we can discuss this effect
by looking at the asymptotic of the local linear regression
estimator:
The local linear estimator achieves further improvement in the boundary regions. In the case of Nadaraya-Watson estimates we typically observe problems due to the one-sided neighborhoods at the boundaries. The reason is that in local constant modeling, more or less the same points are used to estimate the curve near the boundary. Local polynomial regression overcomes this by fitting a higher degree polynomial here.
For estimating regression functions, the order is usually taken
to be one (local linear) or three (local cubic regression).
As we have seen, the local linear fit performs (asymptotically)
better than the Nadaraya-Watson estimator (local constant).
This holds generally:
Odd order fits outperform even order fits.
Some additional remarks should be made in summary:
A further advantage of the local polynomial approach is that it
provides an easy way of estimating derivatives of the function
.
The natural approach would be to estimate
by
and then to compute the derivative
.
But an alternative and more efficient
method is obtained by comparing (4.14) and
(4.15) again. From this we get the local polynomial
derivative estimator
![]() |
(4.22) |
![]() |