1.3 Kernel Densities

The major difficulties of histogram estimation may be summarized in four critiques:

determination of the binwidth , which controls the shape of the histogram,
choice of the bin origin , which also influences to some extent the shape,
loss of information since observations are replaced by the central point of the interval in which they fall,
the underlying density function is often assumed to be smooth, but the histogram is not smooth.

Rosenblatt (1956), Whittle (1958), and Parzen (1962) developed an approach which avoids the last three difficulties. First, a smooth kernel function rather than a box is used as the basic building block. Second, the smooth function is centered directly over each observation. Let us study this refinement by supposing that is the center value of a bin. The histogram can in fact be rewritten as

$\begin{displaymath} \widehat f_h(x) = n^{-1}h^{-1}\sum ^n_{i=1}{\boldsymbol{I}}(\vert x-x_i\vert\le \frac{h}{2}). \end{displaymath}$

(1.8)

If we define $K(u)={\boldsymbol{I}}(\vert u\vert\le \frac{1 }{ 2})$ , then (1.8) changes to

$\begin{displaymath} \widehat f_h(x) = n^{-1}h^{-1}\sum ^n_{i=1}K\left (\frac{x-x_i }{h }\right ). \end{displaymath}$

(1.9)

This is the general form of the kernel estimator. Allowing smoother kernel functions like the quartic kernel,

$\begin{displaymath}K(u) = \frac{15 }{16 }(1-u^2)^2\ {\boldsymbol{I}}(\vert u\vert\le 1),\end{displaymath}$

and computing

not only at bin centers gives us the kernel density estimator. Kernel estimators can also be derived via weighted averaging of rounded points (WARPing) or by averaging histograms with different origins, see Scott (1985). Table 1.3 introduces some commonly used kernels.

**Table 1.3:** Kernel functions.
$\begin{table}\begin{displaymath} \begin{array}{lll} \hline\hline K(\bullet) & \q... ... \quad \textrm{Gaussian}\\ \hline\hline \end{array}\end{displaymath}\end{table}$

**Figure:** Densities of the diagonals of genuine and counterfeit bank notes. Automatic density estimates. `MVAdenbank.xpl`
$\includegraphics[width=1\defpicwidth]{denbank.ps}$

**Figure:** Contours of the density of $X_{4}$ and $X_{6}$ of genuine and counterfeit bank notes. `MVAcontbank2.xpl`
$\includegraphics[width=1\defpicwidth]{contbank2.ps}$

Different kernels generate different shapes of the estimated density. The most important parameter is the so-called bandwidth , and can be optimized, for example, by cross-validation; see Härdle (1991) for details. The cross-validation method minimizes the integrated squared error. This measure of discrepancy is based on the squared differences $\left\{\hat{f}_h(x) -f(x)\right\}^2$ . Averaging these squared deviations over a grid of points $\{x_l\}_{l=1}^L$ leads to

$\begin{displaymath} L^{-1}\sum_{l=1}^L \left\{\hat{f}_h(x_l) -f(x_l)\right\}^2. \end{displaymath}$

Asymptotically, if this grid size tends to zero, we obtain the integrated squared error:

$\begin{displaymath}\int\left\{\hat{f}_h(x) -f(x)\right\}^2 dx.\end{displaymath}$

In practice, it turns out that the method consists of selecting a bandwidth that minimizes the cross-validation function

$\begin{displaymath}\int \hat{f}_h^2- 2\sum_{i=1}^n \hat{f}_{h,i}(x_i)\end{displaymath}$

where $\hat{f}_{h,i}$ is the density estimate obtained by using all datapoints except for the

-th observation. Both terms in the above function involve double sums. Computation may therefore be slow. There are many other density bandwidth selection methods. Probably the fastest way to calculate this is to refer to some reasonable reference distribution. The idea of using the Normal distribution as a reference, for example, goes back to Silverman (1986). The resulting choice of

is called the rule of thumb.

For the Gaussian kernel from Table 1.3 and a Normal reference distribution, the rule of thumb is to choose

$\begin{displaymath} h_G=1.06 \, \widehat\sigma \, n^{-1/5} \end{displaymath}$

(1.10)

where $\widehat \sigma = \sqrt{n^{-1}\sum_{i=1}^n (x_{i}-\overline x)^2}$ denotes the sample standard deviation. This choice of

optimizes the integrated squared distance between the estimator and the true density. For the quartic kernel, we need to transform (1.10). The modified rule of thumb is:

$\begin{displaymath} h_Q =2.62\cdot h_G. \end{displaymath}$

(1.11)

Figure 1.9 shows the automatic density estimates for the diagonals of the counterfeit and genuine bank notes. The density on the left is the density corresponding to the diagonal of the counterfeit data. The separation is clearly visible, but there is also an overlap. The problem of distinguishing between the counterfeit and genuine bank notes is not solved by just looking at the diagonals of the notes! The question arises whether a better separation could be achieved using not only the diagonals but one or two more variables of the data set. The estimation of higher dimensional densities is analogous to that of one-dimensional. We show a two dimensional density estimate for $X_{4}$ and $X_{5}$ in Figure 1.10. The contour lines indicate the height of the density. One sees two separate distributions in this higher dimensional space, but they still overlap to some extent.

**Figure:** Contours of the density of $X_{4}, X_{5}, X_{6}$ of genuine and counterfeit bank notes. `MVAcontbank3.xpl`
$\includegraphics[width=1.1\defpicwidth]{contbank3.ps}$

We can add one more dimension and give a graphical representation of a three dimensional density estimate, or more precisely an estimate of the joint distribution of $X_{4}$ , $X_{5}$ and $X_{6}$ . Figure 1.11 shows the contour areas at 3 different levels of the density: (light grey), (grey), and (black) of this three dimensional density estimate. One can clearly recognize two ``ellipsoids'' (at each level), but as before, they overlap. In Chapter 12 we will learn how to separate the two ellipsoids and how to develop a discrimination rule to distinguish between these data points.

Summary

$\ast$: Kernel densities estimate distribution densities by the kernel method.
$\ast$: The bandwidth determines the degree of smoothness of the estimate $\widehat f$ .
$\ast$: Kernel densities are smooth functions and they can graphically represent distributions (up to 3 dimensions).
$\ast$: A simple (but not necessarily correct) way to find a good bandwidth is to compute the rule of thumb bandwidth $h_{G}=1.06 \widehat\sigma n^{-1/5}.$ This bandwidth is to be used only in combination with a Gaussian kernel $\varphi$ .
$\ast$: Kernel density estimates are a good descriptive tool for seeing modes, location, skewness, tails, asymmetry, etc.