# 6.1 Kernel Density Estimation

The goal of density estimation is to approximate the probability density function of a random variable . Assume we have independent observations from the random variable . The kernel density estimator for the estimation of the density value at point is defined as

 (6.1)

denoting a so-called kernel function, and denoting the bandwidth. A number of possible kernel functions is listed in the following table.

Table 6.1: Kernel functions.
 Kernel XploRe function Uniform uni Triangle trian Epanechnikov epa Quartic qua Triweight tri Gaussian gau Cosinus cosi

In order to illustrate how the kernel density estimator works, let us rewrite equation (6.1) for the Uniform kernel:

It is easy to see that for estimating the density at point , the relative frequency of all observations falling in an interval around is counted. The factor is needed to ensure that the resulting density estimate has integral .

The Uniform kernel implies that all observations in the neighborhood of are given the same weight, since observations close to carry also information about . It is reasonable to attribute more weights to observations that are near to and less weight to distant observations. All other kernels in Table 6.1 have this property. Figure 6.1 visualizes the construction of a kernel density estimate of 10 data points ( red + ) using the Gaussian kernel.

## 6.1.1 Computational Aspects

To understand the structure of the smoother library, we need to explain shortly how kernel estimates can be computed in practice. Consider the equation (6.1) again. To obtain the kernel density estimates for all observations we must compute

In the case of an exact computation of these expressions, the kernel function must be evaluated to times. The expression indicates a value proportional to (the squared sample size). This increases the computation time if the sample size is large.

For exploratory purposes, i.e. for graphing the density estimate it is not necessary to evaluate the at all observations . Instead, the estimate can be computed for example on an equidistant grid :

The evaluation of , , requires then only steps.

The number of evaluations of the kernel function is however time consuming if the sample size is large. An alternative and faster way is to approximate the kernel density estimate by the WARPing method (Härdle and Scott; 1992).

The basic idea of WARPing (Weighted Average of Rounded Points) is the binning'' of the data in bins of length starting at the origin . Each observation point is then replaced by the bincenter of the corresponding bin (that means in fact that each point is rounded to the precision given by ). A usual choice for is to use or . In the latter case, the effective sample size for the computation (the number of nonempty bins) can be at most 101.

For the WARPing method, the kernel function needs to be evaluated only at , , where is the number of bins which contains the support of the kernel function. The calculation of the estimated density reduces then to

computed on the grid ( integer) with , denoting the number of observations in the -th and -th bins, respectively. The WARPing approximation requires evaluations of the kernel function and steps in total. This is considerably faster than the exact computation, when the sample size is large.

To have a choice between both the exact and the WARPing computation of kernel estimates and related functions, the smoother library offers most functionality in two functions:

 Functionality Exact WARPing density estimate denxest denest density confidence intervals denxci denci density confidence bands denxcb dencb density bandwidth selection - denbwsel

## 6.1.2 Computing Kernel Density Estimates

 fh = denest (x {,h {,K} {,d}}) computes the kernel density estimate on a grid using the WARPing method fh = denxest (x {,h {,K} {,v}}) computes the kernel density estimate for all observations or on a grid v by exact computation

The easiest way to compute kernel density estimates is the function denest . This function is based on the WARPing method which makes the computation very fast. Let us apply the denest routine to the nicfoo data which contain observations on household netincome in the first column (see Data Sets (B.1)). denest requires the data as a minimum input. Optionally, the bandwidth h, the kernel function K and the discretization parameter d which is used for the WARPing can be specified. For example,

  netinc=read("nicfoo")
netinc=netinc[,1]
h=(max(netinc)-min(netinc))*0.15
fh=denest(netinc,h)


calculates the density estimate using 15% of the range of the data as bandwidth h. If h is not given, the rule of thumb bandwidth (6.6) is used, see Subsection 6.1.4 for more information. Note that we did not specify the kernel function in the above code. Therefore the default Quartic kernel "qua" and the default binsize d are used.

To display the estimated function you may use these lines of code:

  fh=setmask(fh,"line")
plot(fh)
tl="Density Estimate"
xl="netincome"
yl="density fh"
setgopt(plotdisplay,1,1,"title",tl,"xlabel",xl,"ylabel",yl)

The resulting density is plotted in Figure 6.2.

As an alternative to denest , the function denxest can be used for obtaining a density estimate by exact computation. As with denest the data are the minimum input. Optionally, the bandwidth h, the kernel function K and a grid v for the computation can be specified. The following code computes the density estimate for all observations:

  h=(max(netinc)-min(netinc))*0.15
fh=denxest(netinc,h)

To compute the density on a grid of 30 equidistant points within the range of x, the previous example can be modified as follows:
  h=(max(netinc)-min(netinc))*0.15
v=grid( min(netinc), (max(netinc)-min(netinc))/29, 30)
fh=denxest(netinc,h,"qua",v)


Finally, let us calculate the density estimate for a family of different bandwidths in order to see the dependence on the bandwidth parameter. We modify our first code in the following way:

  hfac=(max(netinc)-min(netinc))
plot(fh1,fh2,fh3,fh4,fh)
tl="Family of Density Estimates"
xl="netincome"
yl="density fh"
setgopt(plotdisplay,1,1,"title",tl,"xlabel",xl,"ylabel",yl)


As you can see from the resulting graph (Figure 6.3), smaller bandwidths reveal more structure of the data, whereas larger bandwidths give smoother functions.

## 6.1.3 Kernel Choice

 h2 = canker (h1, K1, K2) does the canonical bandwidth transformation of a bandwidth value for kernel K1 into an equivalent bandwidth for kernel K2

Kernel density estimation requires two parameters: the kernel function and the bandwidth parameter . In practice, the choice of the kernel is not nearly as important as the choice of the kernel. The theoretical background of this observation is that kernel functions can be rescaled such that the difference between two kernel density estimates using two different kernels is almost negligible (Marron and Nolan; 1988).

Instead of rescaling kernel functions, we can also rescale the bandwidths. This rescaling works as follows: Suppose we have estimated an unknown density using some kernel and bandwidth and we are considering estimating with a different kernel, say . Then the appropriate bandwidth to use with is

 (6.2)

where the scale factors (the so-called canonical bandwidths) can be found in Table 6.2 for some selected kernel functions.

Table 6.2: for different kernels.
 Kernel XploRe function Uniform uni 1.3510 Epanechnikov epa 1.7188 Quartic qua 2.0362 Gaussian gau 0.7764

XploRe offers the function canker to do this recomputation of bandwidths:

  hqua=canker(1.0,"gau","qua")
hqua

computes the bandwidth
  Contents of hqua
[1,]   2.6226

that can be used with the Quartic kernel "qua" in order to find the kernel density estimate that is equivalent to using the Gaussian kernel "gau" and bandwidth 1.0.

## 6.1.4 Bandwidth Selection

 hrot = denrot (x {,K {,opt}}) computes a rule-of-thumb bandwidth for density estimation {hcrit,crit} = denbwsel (x {,h {,K {,d}}}) starts an interactive tool for kernel density bandwidth selection using the WARPing method

Now we can consider the problem of bandwidth selection. The optimal bandwidth for a kernel density estimate is typically calculated on the basis of an estimate for the integrated squared error

or the mean integrated squared error

Both criteria should be minimized to obtain a good approximation of the unknown density . For , it is easy to see that

 (6.3)

Since the third term does not depend on , the classical least squares cross-validation criterion ( ) is an estimate for the first and second term of (6.3). Cross-validation variants that are based on are the biased cross-validation criterion ( ), the smoothed cross-validation criterion ( ) and the Jones, Marron and Park cross-validation criterion ( ).

Plug-in approaches attempt to estimate . Under the asymptotic conditions and , it is possible to approximate by

 (6.4)

The terms and are constants depending on the kernel function . Analogously, denotes a constant depending on the second derivative of the unknown density . Optimizing (6.4) with respect to yields the following optimal bandwidth

 (6.5)

is the only unknown term in (6.5). The idea behind plug-in estimates is to replace by an estimate. Silverman's rule of thumb computes as if where the density of the normal distribution . This gives

if the kernel function is also assumed to be the Gaussian kernel.

If a different kernel function is used, the bandwidth can be rescaled by using equation (6.2). For example, the rule-of-thumb bandwidth for the Quartic kernel computes as

 (6.6)

This rule of thumb is implemented in denrot .

Of course, is a very crude estimate, in particular if the true density is far from looking like a normal density. Refined plug-in methods such as the Park and Marron plug-in and the Sheater and Jones plug-in use nonparametric estimates for .

The quantlet denbwsel needs the data vector as input:

  netinc=read("nicfoo")
netinc=netinc[,1]
tmp=denbwsel(netinc)


This opens a selection box which offers you the choice between the different bandwidth selectors:

Choose for instance the Least Squares CV item to use . Here, a graphical display appears which shows the criterion in the upper left, the chosen optimal bandwidth in the upper right, the resulting kernel density estimate in the lower left, and some information about the search grid and the kernel in the lower right. This graphical display is shown in Figure 6.4.

You can play around now with the different bandwidth selectors. Try out the Park and Marron plug-in for example. Here, in contrast to the cross-validation methods, the aim is to find the root of the bandwidth selection criterion. The graphical display for the Park and Marron selector is shown in Figure 6.5.

During the use of denbwsel , all currently obtained results are displayed in the output window. Note that for some selectors, it will be necessary to increase the search grid for . This can be done by using the menu item (Change h range or K).

It is difficult to give a general decision of which bandwidth selector is the best among all. As mentioned above, Silverman's rule of thumb works well when the true density is approximately normal. It may oversmooth for bimodal or multimodal densities. A nice feature of the cross-validation methods is that the selected bandwidth automatically adapts to the smoothness of . From a theoretical point of view, the and bandwidth selectors have the optimal asymptotic rate of convergence, but need either a very large sample size or pay with a large variance. As a consequence, we recommend determining bandwidths by different selection methods and comparing the resulting density estimates.

For detailed overviews on plug-in and cross-validation methods of bandwidth selection, see Härdle (1991), Park and Turlach (1992), and Turlach (1993). Practical experiences are available from the simulation studies in Marron (1989), Jones et al. (1992), Park and Turlach (1992), and Cao et al. (1992).

## 6.1.5 Confidence Intervals and Bands

 {fh, fhl, fhu} = denci (x {,h {,alpha {,K} {,d}}}) computes the kernel density estimate and pointwise confidence intervals on a grid using the WARPing method {fh, fhl, fhu} = dencb (x {,h {,alpha {,K} {,d}}}) computes the kernel density estimate and uniform confidence bands on a grid using the WARPing method {fh, fhl, fhu} = denxci (x {,h {,alpha {,K} {,v}}}) computes the kernel density estimate and pointwise confidence intervals for all observations or on a grid v by exact computation {fh, fhl, fhu} = denxcb (x {,h {,alpha {,K} {,v}}}) computes the kernel density estimate and uniform confidence bands for all observations or on a grid v by exact computation

To derive confidence intervals or confidence bands for one has to know its sampling distribution. The following result about the asymptotic distribution of is known from the literature (Härdle; 1991, p. 62). Suppose that exists and . Then

 (6.7)

which can be recalculated to

 (6.8)

Assuming that the bias term is negligible with respect to the standard deviation, the resulting confidence interval for can be written as

 (6.9)

The value is the -quantile of the standard normal distribution.

The functions denci and denxci compute pointwise confidence intervals for a grid of points ( denci using WARPing) or all observations ( denxci using exact computation). The following quantlet code computes the confidence intervals for bandwidth 0.2 (Gaussian kernel) and significance level . For graphical exploration, the pointwise intervals are drawn as curves and displayed in a plot. The resulting graphics is shown in Figure 6.6.

  {fh,fl,fu}=denci(netinc,0.2,0.05,"gau")
plot(fl,fu,fh)
setgopt(plotdisplay,1,1,"title","Density Confidence Intervals")


Uniform confidence bands for can only be derived under some rather restrictive assumptions, see Härdle (1991, p. 65). Suppose that is a density on and , . Then the following formula holds under some regularity for all (Bickel and Rosenblatt; 1973):

 (6.10)

where

and

In practice, the data are transformed to the interval , then the confidence bands are computed and rescaled to the original scale of . The functions dencb and denxcb can hence be directly applied to the original data . The following quantlet code computes the confidence bands for the netinc data and displays them in a plot:
  {fh,fl,fu}=dencb(netinc,0.2,0.05,"gau")