# 4.1 Univariate Kernel Regression

An important question in many fields of science is the relationship between two variables, say and . Regression analysis is concerned with the question of how (the dependent variable) can be explained by (the independent or explanatory or regressor variable). This means a relation of the form

where is a function in the mathematical sense. In many cases theory does not put any restrictions on the form of , i.e. theory does not say whether is linear, quadratic, increasing in , etc.. Hence, it is up to empirical analysis to use data to find out more about .

## 4.1.1 Introduction

Let us consider an example from Economics. Suppose is expenditure on potatoes and is net-income. If we draw a graph with quantity of potatoes on the vertical axis and income on the horizontal axis then we have drawn an Engel curve. Apparently, Engel curves relate optimal quantities of a commodity to income, holding prices constant. If we derive the Engel curve analytically, then it takes the form , where denotes the quantity of potatoes bought at income level . Depending on individual preferences several possibilities arise:

• The Engel curve slopes upward, i.e. is an increasing function of . As income increases the consumer is buying more of the good. In this case the good is said to be normal. A special case of an upward sloping Engel curve is a straight line through the origin, i.e. is a linear function.
• The Engel curve slopes downward or eventually slopes downward (after sloping upward first) as income grows. If the Engel curve slopes downward the good is said to be inferior.
Are potatoes inferior goods? There is just one way to find out: collect appropriate data and estimate an Engel curve for potatoes. We can interpret the statement potatoes are inferior'' in the sense that, on average, consumers will buy fewer potatoes if their income grows while prices are held constant. The principle that theoretic laws usually do not hold in every individual case but merely on average can be formalized as
 (4.1) (4.2)

Equation (4.1) says that the relationship doesn't need to hold exactly for the th observation (household) but is disturbed'' by the random variable . Yet, (4.2) says that the relationship holds on average, i.e. the expectation of on the condition that is given by . The goal of the empirical analysis is to use a finite set of observations , to estimate .

EXAMPLE 4.1
In Figure 4.1, we have observations of net-income and expenditure on food expenditures (not only potatoes), taken from the Family Expenditure Survey of British households in 1973. Graphically, we try to fit an (Engel) curve to the scatterplot of food versus net-income. Clearly, the graph of the estimate of will not run through every point in the scatterplot, i.e. we will not be able to use this graph to perfectly predict food consumption of every household, given that we know the household's income. But this does not constitute a serious problem (or any problem at all) if you recall that our theoretical statement refers to average behavior.

Let us point out that, in a parametric approach, it is often assumed that , and the problem of estimating is reduced to the problem of estimating and . But note that this approach is not useful in our example. After all, the alleged shape of the Engel curve for potatoes, upward sloping for smaller income levels but eventually downward sloping as income is increased, is ruled out by the specification . The nonparametric approach does not put such prior restrictions on . However, as we will see below, there is a price to pay for this flexibility.

### 4.1.1.1 Conditional Expectation

In this section we will recall two concepts that you should already be familiar with, conditional expectation and conditional expectation function. However, these concepts are central to regression analysis and deserve to be treated accordingly. Let and be two random variables with joint probability density function . The conditional expectation of given that is defined as

where is the conditional probability density function (conditional pdf) of given , and is the marginal pdf of . The mean function might be quite nonlinear even for simple-looking densities.

EXAMPLE 4.2
Consider the roof distribution with joint pdf

with elsewhere, and marginal pdf

with elsewhere. Hence we get

which is an obviously nonlinear function.

Note that is a function of alone. Consequently, we may abbreviate this term as . If we vary we get a set of conditional expectations. This mapping from to is called the conditional expectation function and is often denoted as . This tells us how and are related on average''. Therefore, it is of immediate interest to estimate .

### 4.1.1.2 Fixed and Random Design

We started the discussion in the preceeding section by assuming that both and are random variables with joint pdf . The natural sampling scheme in this setup is to draw a random sample from the bivariate distribution that is characterized by . That is, we randomly draw observations of the form . Before the sample is drawn, we can view the pairs as identically and independently distributed pairs of random variables. This sampling scheme will be referred to as the random design.

We will concentrate on random design in the following derivations. However, there are applications (especially in the natural sciences) where the researcher is able to control the values of the predictor variable and is the sole random variable. As an example, imagine an experiment that is supposed to provide evidence for the link between a person's beer consumption () and his or her reaction time () in a traffic incident. Here the researcher will be able to specify the amount of beer the testee is given before the experiment is conducted. Hence will no longer be a random variable, while still will be. This setup is usually referred to as the fixed design. In repeated sampling, in the fixed design case the density is known (it is induced by the researcher). This additional knowledge (relative to the random design case, where is unknown) will simplify the estimation of , as well as deriving statistical properties of the estimator used, as we shall see below. A special case of the fixed design model is the e.g. equispaced sequence , , on .

## 4.1.2 Kernel Regression

As we just mentioned, kernel regression estimators depend on the type of the design.

### 4.1.2.1 Random Design

The derivation of the estimator in the random design case starts with the definition of conditional expectation:

 (4.3)

Given that we have observations of the form , the only unknown quantities on the right hand side of (4.3) are and . From our discussion of kernel density estimation we know how to estimate probability density functions. Consequently, we plug in kernel estimates for and in (4.3). Estimating is straightforward. To estimate we employ the multiplicative kernel density estimator (with product kernel) of Section 3.6

 (4.4)

Hence, for the numerator of (4.3) we get
 (4.5)

where we used the facts that kernel functions integrate to 1 and are symmetric around zero. Plugging in leads to the Nadaraya-Watson estimator introduced by Nadaraya (1964) and Watson (1964)

 (4.6)

which is the natural extension of kernel estimation to the problem of estimating an unknown conditional expectation function. Several points are noteworthy:
• Rewriting (4.6) as

 (4.7)

reveals that the Nadaraya-Watson estimator can be seen as a weighted (local) average of the response variables (note ). In fact, the Nadaraya-Watson estimator shares this weighted local average property with several other smoothing techniques, e.g. -nearest-neighbor and spline smoothing, see Subsections 4.2.1 and 4.2.3.
• Note that just as in kernel density estimation the bandwidth determines the degree of smoothness of , see Figure 4.2. To motivate this, let go to either extreme. If then if and is not defined elsewhere. Hence, at an observation , converges to , i.e. we get an interpolation of the data. On the other hand if then for all values of , and , i.e. the estimator is a constant function that assigns the sample mean of to each . Choosing so that a good compromise between over- and undersmoothing is achieved, is once again a crucial problem.
• You may wonder what happens if the denominator of is equal to zero. In this case, the numerator is also equal to zero, and the estimate is not defined. This can happen in regions of sparse data.

### 4.1.2.2 Fixed Design

In the fixed design model, is assumed to be known and a possible kernel estimator for this sampling scheme employs weights of the form

 (4.8)

Thus, estimators for the fixed design case are of simpler structure and are easier to analyze in their statistical properties.

Since our main interest is the random design case, we will only mention a very particular fixed design kernel regression estimator: For the case of ordered design points , from some interval Gasser & Müller (1984) suggested the following weight sequence

 (4.9)

where , , . Note that as for the Nadaraya-Watson estimator, the weights sum to 1.

To show how the weights (4.9) are related to the intuitively appealing formula (4.8) note that by the mean value theorem

 (4.10)

for some between and . Moreover,

 (4.11)

Plugging in (4.10) and (4.11) into (4.8) gives

We will meet the Gasser-Müller estimator again in the following section where the statistical properties of kernel regression estimators are discussed.

### 4.1.2.3 Statistical Properties

Are kernel regression estimators consistent? In the previous chapters we showed that an estimator is consistent in deriving its mean squared error (), showing that the converges, and appealing to the fact that convergence in mean square implies convergence in probability (the latter being the condition stated in the definition of consistency).

Moreover, the helped in assessing the speed with which convergence is attained. In the random design case it is very difficult to derive the of the Nadaraya-Watson estimator since it is the ratio (and not the sum) of two estimators. It turns out that one can show that the Nadaraya-Watson estimator is consistent in the random design case without explicit recurrence to the of this estimator. The conditions under which this result holds are summarized in the following theorem:

THEOREM 4.1
Assume the univariate random design model and the regularity conditions
[4] , for , . Suppose also , , then

where for holds and is a point of continuity of , , and .

The proof involves showing that -- considered separately -- both the numerator and the denominator of converge. Then, as a consequence of Slutsky's theorem, it can be shown that converges. For more details see Härdle (1990, p. 39ff).

Certainly, we would like to know the speed with which the estimator converges but we have already pointed out that the of the Nadaraya-Watson estimator in the random design case is very hard to derive. For the fixed design case, Gasser & Müller (1984) have derived the of the estimator named after them:

THEOREM 4.2
Assume the univariate fixed design model and the conditions: has support with , is twice continuously differentiable, . Assume . Then, under , it holds

As usual, the (asymptotic) has two components, the variance term and the squared bias term . Hence, if we increase the bandwidth we face the familiar trade-off between decreasing the variance while increasing the squared bias.

To get a similar result for the random design case, we linearize the Nadaraya-Watson estimator as follows

thus
 (4.12)

It can be shown that of the two terms on the right hand side, the first term is the leading term in the distribution of , whereas the second term can be neglected. Hence, the of can be approximated by calculating

The following theorem can be derived this way:

THEOREM 4.3
Assume the univariate random design model and the conditions
[3] , for and hold. Suppose , , then

 (4.13)

where for holds and is a point of continuity of , , , , and .

Let denote the asymptotic . Most components of this formula are constants w.r.t. and , and we may write denoting constant terms by and , respectively

Minimizing this expression with respect to gives the optimal bandwidth . If you plug a bandwidth into (4.13), you will find that the is of order , a rate of convergence that is slower than the rate obtained by the LS estimator in linear regression but is the same as for estimating a density function (cf. Section 3.2).

As in the density estimation case, depends on unknown quantities like or . Once more, we are faced with the problem of finding a bandwidth-selection rule that has desirable theoretical properties and is applicable in practice. We have displayed Nadaraya-Watson kernel regression estimates with different bandwidths in Figure 4.2. The issue of bandwidth selection will be discussed later on in Section 4.3.

## 4.1.3 Local Polynomial Regression and Derivative Estimation

The Nadaraya-Watson estimator can be seen as a special case of a larger class of kernel regression estimators: Nadaraya-Watson regression corresponds to a local constant least squares fit. To motivate local linear and higher order local polynomial fits, let us first consider a Taylor expansion of the unknown conditional expectation function :

 (4.14)

for in a neighborhood of the point . This suggests local polynomial regression, namely to fit a polynomial in a neighborhood of . The neighborhood is realized by including kernel weights into the minimization problem

 (4.15)

where denotes the vector of coefficients . The result is therefore a weighted least squares estimator with weights . Using the notations

we can compute which minimizes (4.15) by the usual formula for a weighted least squares estimator

 (4.16)

It is important to note that -- in contrast to parametric least squares -- this estimator varies with . Hence, this is really a local regression at the point . Denote the components of by , ..., . The local polynomial estimator of the regression function is

 (4.17)

due to the fact that we have by comparing (4.14) and (4.15). The whole curve is obtained by running the above local polynomial regression with varying . We have included the parameter in the notation since the final estimator depends obviously on the bandwidth parameter as it does the Nadaraya-Watson estimator.

Let us gain some more insight into this by computing the estimators for special values of . For reduces to , which means that the local constant estimator is nothing else as our well known Nadaraya-Watson estimator, i.e.

Now turn to . Denote

then we can write

 (4.18)

which yields the local linear estimator

 (4.19)

Here we used the usual matrix inversion formula for matrices. Of course, (4.18) can be generalized for arbitrary large . The general formula is

 (4.20)

Introducing the notation for the first unit vector in , we can write the local linear estimator as

Note that the Nadaraya-Watson estimator could also be written as

EXAMPLE 4.3
The local linear estimator for our running example is displayed in Figure 4.3. What can we conclude from comparing this fit with the Nadaraya-Watson fit in Figure 4.1? The main difference to see is that the local linear fit reacts more sensitively on the boundaries of the fit.

Another graphical difference will appear, when we compare local linear and Nadaraya-Watson estimates with optimized bandwidths (see Section 4.3). Then we will see that the local linear fit will be influenced less by outliers like those which cause the bump'' in the right part of both Engel curves.

Here we can discuss this effect by looking at the asymptotic of the local linear regression estimator:

 (4.21)

This formula is dealt with in more detail when we come to multivariate regression, see Section 4.5. The in the local linear case differs from that for the Nadaraya-Watson estimator (4.13) only with regard to the bias. It is easy to see that the bias of the local linear fit is design-independent and disappears when is linear. Thus, a local linear fit can improve the function estimation in regions with sparse observations, for instance in the high net-income region in our Engel curve example. Let us also mention that the bias of the local linear estimator has the same form as that of the Gasser-Müller estimator, i.e. the bias in the fixed design case.

The local linear estimator achieves further improvement in the boundary regions. In the case of Nadaraya-Watson estimates we typically observe problems due to the one-sided neighborhoods at the boundaries. The reason is that in local constant modeling, more or less the same points are used to estimate the curve near the boundary. Local polynomial regression overcomes this by fitting a higher degree polynomial here.

For estimating regression functions, the order is usually taken to be one (local linear) or three (local cubic regression). As we have seen, the local linear fit performs (asymptotically) better than the Nadaraya-Watson estimator (local constant). This holds generally: Odd order fits outperform even order fits. Some additional remarks should be made in summary:

• As the Nadaraya-Watson estimator, the local polynomial estimator is a weighted (local) average of the response variables .
• As for all other kernel methods the bandwidth determines the degree of smoothness of . For we observe the same result as for the Nadaraya-Watson estimator, namely at an observation , converges to . The behavior is different for . An infinitely large makes all weights equal, thus we obtain a parametric th order polynomial fit in that case.

A further advantage of the local polynomial approach is that it provides an easy way of estimating derivatives of the function . The natural approach would be to estimate by and then to compute the derivative . But an alternative and more efficient method is obtained by comparing (4.14) and (4.15) again. From this we get the local polynomial derivative estimator

 (4.22)

for the th derivative of . Usually the order of the polynomial is or in analogy to the regression case (recall that the zero derivative of a function is always the function itself). Also in analogy, the odd'' order outperforms the even'' order .

EXAMPLE 4.4
To estimate the first () derivative of our Engel curve we could take (local quadratic derivative estimator). This is done to get Figure 4.4. Note that we have used a rule-of-thumb bandwidth here, see Fan & Müller (1995, p. 92) and Fan & Gijbels (1996, p. 111)