3.1 Motivation and Derivation

3.1.1 Introduction

Contrary to the treatment of the histogram in statistics textbooks we have shown that the histogram is more than just a convenient tool for giving a graphical representation of an empirical frequency distribution. It is a serious and widely used method for estimating an unknown pdf. Yet, the histogram has some shortcomings and hopefully this chapter will persuade you that the method of kernel density estimation is in many respects preferable to the histogram.

Recall that the shape of the histogram is governed by two parameters: the binwidth , and the origin of the bin grid, $x_{0}$ . We have seen that the averaged shifted histogram is a slick way to free the histogram from the dependence on the choice of an origin. You may recall that we have not had similar success in providing a convincing and applicable rule for choosing the binwidth . There is no choice-of-an-origin problem in kernel density estimation but you will soon discover that we will run into the binwidth-selection problem again. Hopefully, the second time around we will be able to give a better answer to this challenge.

Even if the ASH seemed to solve the choice-of-an-origin problem the histogram retains some undesirable properties:

The histogram assigns each in $[m_j-\frac{h}{2},m_j+\frac{h}{2})$ the same estimate for , namely $\widehat f_{h}(m_j)$ . This seems to be overly restrictive.
The histogram is not a continuous function, but has jumps at the boundaries of the bins. It is not differentiable at the jumps and has zero derivative elsewhere. This leads to the ragged appearance of the histogram which is especially undesirable if we want to estimate a smooth, continuous pdf.

3.1.2 Derivation

Recall that our derivation of the histogram was based on the intuitively plausible idea that

$\displaystyle \frac{1}{n\cdotp\textrm{interval length}}\,\char93 \{\textrm{observations that fall into a small interval \underline{containing} } x \}$

is a reasonable way of estimating

. The kernel density estimator can be viewed as being based on an idea that is even more plausible and has the added advantage of freeing the estimator from the problem of choosing the origin of a bin grid. The idea goes as follows: a reasonable way to estimate

is to calculate

$\displaystyle \frac{1}{n\cdotp\textrm{interval length}}\,\char93 \{\textrm{observations that fall into a small interval \underline{around} } x\}$

(3.1)

Note the subtle but important difference to the construction of the histogram: this time the calculation is based on an interval placed around

, not an interval containing

which is placed around some bin center

, determined by the choice of the origin $x_{0}$ . In another deviation from the construction of the histogram we will take the interval length to be

. That is, we consider intervals of the form

. (Recall that in Chapter 2 we had intervals of length

only.) Hence, we can write

$\displaystyle \widehat f_{h}(x)=\frac{1}{2hn}\char93 \left\{ X_{i} \in [x-h,x+h)\right\}$

(3.2)

This formula can be rewritten if we use a weighting function, called the uniform kernel function

$\displaystyle K(u)=\frac{1}{2}\Ind(\vert u\vert \le 1),$

(3.3)

and let $u=(x-X_{i})/h$ . That is, the uniform kernel function assigns weight 1/2 to each observation $X_{i}$ whose distance from

(the point at which we want to estimate the pdf) is not bigger than

. Points farther away from

get zero weight because the indicator function $\Ind(\vert u\vert \le 1)$ is by definition equal to 0 for all values of the scaled distance $u=(x-X_{i})/h$ that are bigger than 1.

Then we can write (3.2) as

$\displaystyle \widehat f_{h}(x)$	$\displaystyle =$	$\displaystyle \frac{1}{nh}\sum_{i=1}^{n} K\left(\frac{x-X_{i}}{h}\right)$	(3.4)
	$\displaystyle =$	$\displaystyle \frac{1}{nh}\sum_{i=1}^{n} \frac{1}{2}\Ind\left(\left\vert \frac{x-X_{i}}{h} \right\vert \le 1\right)$	(3.5)

It is especially apparent from (3.5) that all we have done so far is to formalize (3.1).

Note from (3.5) that for each observation that falls into the interval the indicator function takes on the value , and we get a contribution to our frequency count. But each contribution is weighted equally (namely by a factor of one), no matter how close the observation $X_{i}$ is to (provided that it is within of ). Maybe we should give more weight to contributions from observations very close to than to those coming from observations that are more distant.

For instance, consider the formula

$\displaystyle \widehat f_{h}(x)$	$\displaystyle =$	$\displaystyle \frac{1}{2nh}\sum_{i=1}^{n} \frac{3}{2}\left\{1-\left(\frac{x-X_{... ...ght)^{2}\right\} \Ind\left(\left\vert \frac{x-X_{i}}{h}\right\vert \le 1\right)$	(3.6)
	$\displaystyle =$	$\displaystyle \frac{1}{nh}\sum_{i=1}^{n}K\left(\frac{x-X_{i}}{h}\right),$	(3.7)

where $K(\bullet)$ is a shorthand for a different weighting function, the Epanechnikov kernel

$\displaystyle K(u)=\frac{3}{4}(1-u^{2})\Ind(\vert u \vert \le 1).$

**Table 3.1:** Kernel functions
Kernel
Uniform	$\frac{1}{2}\Ind(\vert u \vert \le 1)$
Triangle	$(1-\vert u \vert)\Ind(\vert u \vert \le 1)$
Epanechnikov	$\frac{3}{4}(1-u^{2})\Ind(\vert u \vert \le 1)$
Quartic (Biweight)	$\frac{15}{16}(1-u^{2})^{2}\Ind(\vert u \vert \le 1)$
Triweight	$\frac{35}{32}(1-u^{2})^{3}\Ind(\vert u \vert \le 1)$
Gaussian	$\frac{1}{\sqrt{2\pi}} \exp(-\frac{1}{2}u^2)$
Cosine	$\frac{\pi}{4}\cos(\frac{\pi}{2}u)\Ind(\vert u \vert \le 1)$

If you look at (3.6) it will be clear that one could think of the procedure as a slick way of counting the number of observations that fall into the interval around , where contributions from $X_{i}$ that are close to are weighted more than those that are further away. The latter property is shared by the Epanechnikov kernel with many other kernels, some of which we introduce in Table 3.1. Figure 3.1 displays some of the kernel functions.

**Figure:** Some kernel functions: Uniform (top left), Triangle (bottom left), Epanechnikov (top right), Quartic (bottom right) `SPMkernel`
$\includegraphics[width=1.4\defpicwidth]{SPMkernel.ps}$

Now we can give the following general form of the kernel density estimator of a probability density , based on a random sample $X_{1},X_{2},\ldots,X_{n}$ from :

$\displaystyle \widehat f_{h}(x)=\frac{1}{n}\sum_{i=1}^{n}K_{h}(x-X_{i}),$

(3.8)

where

$\displaystyle K_{h}(\bullet)=\frac{1}{h}K(\bullet/h).$

(3.9)

$K(\bullet)$ is some kernel function like those given in Table 3.1 and

denotes the bandwidth. Note that the term kernel function refers to the weighting function

, whereas the term kernel density estimator refers to formula (3.8).

3.1.3 Varying the Bandwidth

Similar to the histogram, controls the smoothness of the estimate and the choice of is a crucial problem. Figure 3.2 shows density estimates for the stock returns data using the Quartic kernel and different bandwidths.

**Figure:** Four kernel density estimates for the stock returns data with bandwidths , , , and `SPMdensity`
$\includegraphics[width=1.4\defpicwidth]{SPMdensity.ps}$

Again, it is hard to determine which value of provides the optimal degree of smoothness without some formal criterion at hand. This problem will be handled further below, especially in the Section 3.3.

3.1.4 Varying the Kernel Function

Kernel functions are usually probability density functions, i.e. they integrate to one and $K(u)\ge 0$ for all in the domain of . An immediate consequence of $\int K(u)du=1$ is $\int\widehat f_{h}(x)dx=1$ , i.e. the kernel density estimator is a pdf, too. Moreover, $\widehat f_{h}$ will inherit all the continuity and differentiability properties of . For instance, if is $\nu$ times continuously differentiable then the same will hold true for $\widehat f_{h}$ . On a more intuitive level this ``inheritance property'' of $\widehat f_{h}$ is reflected in the smoothness of its graph. Consider Figure 3.3 where, for the same data set (stock returns) and a given value of , kernel density estimates have been graphed using different kernel functions.

**Figure:** Different kernels for estimation `SPMdenquauni`
$\includegraphics[width=1.4\defpicwidth]{SPMdenquauni.ps}$

Note how the estimate based on the Uniform kernel (right) reflects the box shape of the underlying kernel function with its ragged behavior. The estimate that employed the smooth Quartic kernel function (left), on the other hand, gives a smooth and continuous picture.

**Figure:** Different continuous kernels for estimation `SPMdenepatri`
$\includegraphics[width=1.4\defpicwidth]{SPMdenepatri.ps}$

Differences are not confined to estimates that are based on kernel functions that are continuous or non-continuous. Even among estimates based on continuous kernel functions there are considerable differences in smoothness (for the same value of ) as you can confirm by looking at Figure 3.4. Here, for density estimates are graphed for income data from the Family Expenditure Survey, using the Epanechnikov kernel (left) and the Triweight kernel (right), respectively. There is quite a difference in the smoothness of the graphs of the two estimates.

You might wonder how we will ever solve this dilemma: on one hand we will be trying to find an optimal bandwidth but obviously a given value of does not guarantee the same degree of smoothness if used with different kernel functions. We will come back to this problem in Section 3.4.2.

3.1.5 Kernel Density Estimation as a Sum of Bumps

Before we turn to the statistical properties of kernel density estimators let us present another view on kernel density estimation that provides both further motivation as well as insight into how the procedure works. Look at Figure 3.5 where the kernel density estimate for an artificial data set is shown along with individual rescaled kernel functions.

**Figure:** Kernel density estimate as a sum of bumps `SPMkdeconstruct`
$\includegraphics[width=1.2\defpicwidth]{SPMkdeconstruct.ps}$

What do we mean by a rescaled kernel function? The rescaled kernel function is simply

$\displaystyle \frac{1}{nh}K\left(\frac{x-X_{i}}{h}\right) =\frac{1}{n}K_{h}\left(x-X_{i}\right).$

Note that while the area under the density estimate is equal to one, the area under each rescaled kernel function is equal to (using integration by substitution)

$\displaystyle \int \frac{1}{nh} K\left(\frac{x-X_{i}}{h}\right)\, dx=\frac{1}{nh} \int K(u)h\, du=\frac{1}{nh} h\int K(u)\, du=\frac{1}{n}.$

Let us rewrite (3.8):

$\displaystyle \widehat f_{h}(x)=\frac{1}{n}\sum_{i=1}^{n}K_{h} \left(x-X_{i}\right) =\sum_{i=1}^{n}\frac{1}{nh}K\left(\frac{x-X_{i}}{h}\right).$

(3.10)

Obviously $\widehat f_{h}(x)$ can be written as the sum over the rescaled kernel functions. Figure 3.5 gives a nice graphical representation of this summation over the rescaled kernels. Graphically, the rescaled kernels are the little ``bumps", where each bump is centered at one of the observations. At a given

we find $\widehat f_{h}(x)$ by vertically summing over the bumps. Obviously, different values of

change the appearance of the bumps and, as a consequence, the appearance of their sum $\widehat f_{h}$ .