6.1 Kernel Density Estimation

The goal of density estimation is to approximate the probability density function $f(\bullet)$ of a random variable . Assume we have independent observations $x_1,\ldots,x_n$ from the random variable . The kernel density estimator $\widehat{f}_{h}(x)$ for the estimation of the density value at point is defined as

$\displaystyle \widehat{f}_{h}(x)= \frac{1}{nh}\sum_{i=1}^{n} K\left(\frac{x_i-x}{h}\right),\$

(6.1)

$K(\bullet)$ denoting a so-called kernel function, and

denoting the bandwidth. A number of possible kernel functions is listed in the following table.

Kernel		XploRe function

Uniform	$\frac{1}{2} {\boldsymbol{I}}(\vert u \vert \le 1)$	`uni`
Triangle	$(1-\vert u \vert) {\boldsymbol{I}}(\vert u \vert \le 1)$	`trian`
Epanechnikov	$\frac{3}{4}(1-u^{2}) {\boldsymbol{I}}(\vert u \vert \le 1)$	`epa`
Quartic	$\frac{15}{16}(1-u^{2})^{2} {\boldsymbol{I}}(\vert u \vert \le 1)$	`qua`
Triweight	$\frac{35}{32}(1-u^{2})^{3} {\boldsymbol{I}}(\vert u \vert \le 1)$	`tri`
Gaussian	$\frac{1}{\sqrt{2\pi}} \exp(-\frac{1}{2}u^2)$	`gau`
Cosinus	$\frac{\pi}{4}\cos(\frac{\pi}{2}u) {\boldsymbol{I}}(\vert u \vert \le 1)$	`cosi`

In order to illustrate how the kernel density estimator works, let us rewrite equation (6.1) for the Uniform kernel:

$\displaystyle \widehat{f}_{h}(x)$	$\displaystyle =$	$\displaystyle \frac{1}{2nh}\,\sum_{i=1}^{n} {\boldsymbol{I}}\left(\vert x_i-x\vert\le h\right)$
	$\displaystyle =$	$\displaystyle \frac{1}{2nh}\, \{\textrm{number of observations } x_i \textrm{ that fall in } [x-h,x+h]\}. \$

It is easy to see that for estimating the density at point

, the relative frequency of all observations

falling in an interval around

is counted. The factor

is needed to ensure that the resulting density estimate has integral $\int \widehat{f}_{h}(x)\,dx=1$ .

The Uniform kernel implies that all observations in the neighborhood of are given the same weight, since observations close to carry also information about . It is reasonable to attribute more weights to observations that are near to and less weight to distant observations. All other kernels in Table 6.1 have this property. Figure 6.1 visualizes the construction of a kernel density estimate of 10 data points ( red + ) using the Gaussian kernel.

**Figure:** Kernel density estimate as a sum of bumps. `XLGsmoo01.xpl`
$\includegraphics[scale=0.425]{smootherbumps}$

6.1.1 Computational Aspects

To understand the structure of the smoother library, we need to explain shortly how kernel estimates can be computed in practice. Consider the equation (6.1) again. To obtain the kernel density estimates for all observations $x_1,\ldots,x_n$ we must compute

$\displaystyle \widehat{f}_{h}(x_j)= \frac{1}{nh}\sum_{i=1}^{n} K\left(\frac{x_i-x_j}{h}\right)\,, \quad j=1,\ldots,n.$

In the case of an exact computation of these expressions, the kernel function $K(\bullet)$ must be evaluated to $O(h\cdotp n^2)$ times. The expression $O(h\cdotp n^2)$ indicates a value proportional to

(the squared sample size). This increases the computation time if the sample size

is large.

For exploratory purposes, i.e. for graphing the density estimate it is not necessary to evaluate the $\widehat{f}_h(\bullet)$ at all observations $x_1,\ldots,x_n$ . Instead, the estimate can be computed for example on an equidistant grid $v_1,\ldots,v_m$ :

$\displaystyle v_k = x_{\min} + \frac{k}{m} \left(x_{\max}-x_{\min}\right), \quad k=1,\ldots,m << n.$

The evaluation of $\widehat{f}_h(v_k)$ , $k=1,\ldots,m$ , requires then only $O(h\cdotp n\cdotp m)$ steps.

The number of $O(h\cdotp n\cdotp m)$ evaluations of the kernel function $K(\bullet)$ is however time consuming if the sample size is large. An alternative and faster way is to approximate the kernel density estimate by the WARPing method (Härdle and Scott; 1992).

The basic idea of WARPing (Weighted Average of Rounded Points) is the ``binning'' of the data in bins of length starting at the origin . Each observation point is then replaced by the bincenter of the corresponding bin (that means in fact that each point is rounded to the precision given by ). A usual choice for is to use or $(x_{\max}-x_{\min})/100$ . In the latter case, the effective sample size for the computation (the number of nonempty bins) can be at most 101.

For the WARPing method, the kernel function needs to be evaluated only at $\ell\cdotp d / h$ , $\ell=1,...,s$ , where is the number of bins which contains the support of the kernel function. The calculation of the estimated density reduces then to

$\displaystyle \widetilde{f}_h(w_j) = \frac{n_j}{nh} \sum_{i=1}^{r} n_i K\left\{\frac{(i-j)\cdotp d}{h}\right\}\,, \quad j=1,\ldots,r$

computed on the grid $w_j = (j+0.5)\cdotp d$ (

integer) with

denoting the number of observations in the

-th and

-th bins, respectively. The WARPing approximation requires $O(h\cdotp r/d)$ evaluations of the kernel function and $O(n) +O(h\cdotp r/d)$ steps in total. This is considerably faster than the exact computation, when the sample size is large.

To have a choice between both the exact and the WARPing computation of kernel estimates and related functions, the smoother library offers most functionality in two functions:

Functionality	Exact	WARPing
density estimate	`denxest`	`denest`
density confidence intervals	`denxci`	`denci`
density confidence bands	`denxcb`	`dencb`
density bandwidth selection	-	`denbwsel`

6.1.2 Computing Kernel Density Estimates

fh = denest (x {,h {,K} {,d}}): computes the kernel density estimate on a grid using the WARPing method
fh = denxest (x {,h {,K} {,v}}): computes the kernel density estimate for all observations or on a grid v by exact computation

The easiest way to compute kernel density estimates is the function 11786 denest . This function is based on the WARPing method which makes the computation very fast. Let us apply the 11789 denest routine to the nicfoo data which contain observations on household netincome in the first column (see Data Sets (B.1)). 11794 denest requires the data as a minimum input. Optionally, the bandwidth h, the kernel function K and the discretization parameter d which is used for the WARPing can be specified. For example,

  netinc=read("nicfoo")
  netinc=netinc[,1]
  h=(max(netinc)-min(netinc))*0.15
  fh=denest(netinc,h)

XLGsmoo02.xpl

calculates the density estimate using 15% of the range of the data as bandwidth h. If h is not given, the rule of thumb bandwidth (6.6) is used, see Subsection 6.1.4 for more information. Note that we did not specify the kernel function in the above code. Therefore the default Quartic kernel "qua" and the default binsize d are used.

To display the estimated function you may use these lines of code:

  fh=setmask(fh,"line")
  plot(fh)
  tl="Density Estimate"
  xl="netincome"
  yl="density fh"
  setgopt(plotdisplay,1,1,"title",tl,"xlabel",xl,"ylabel",yl)

The resulting density is plotted in Figure 6.2.

**Figure 6.2:** Kernel density estimate for the Netincome data.
$\includegraphics[scale=0.425]{smootherdens1}$

As an alternative to 11804 denest , the function 11807 denxest can be used for obtaining a density estimate by exact computation. As with 11810 denest the data are the minimum input. Optionally, the bandwidth h, the kernel function K and a grid v for the computation can be specified. The following code computes the density estimate for all observations:

  h=(max(netinc)-min(netinc))*0.15
  fh=denxest(netinc,h)

To compute the density on a grid of 30 equidistant points within the range of x, the previous example can be modified as follows:

  h=(max(netinc)-min(netinc))*0.15
  v=grid( min(netinc), (max(netinc)-min(netinc))/29, 30)
  fh=denxest(netinc,h,"qua",v)

XLGsmoo03.xpl

Finally, let us calculate the density estimate for a family of different bandwidths in order to see the dependence on the bandwidth parameter. We modify our first code in the following way:

  hfac=(max(netinc)-min(netinc))
  fh =setmask( denest(netinc,0.15*hfac) ,"line")
  fh1=setmask( denest(netinc,0.05*hfac) ,"line","thin","magenta")
  fh2=setmask( denest(netinc,0.10*hfac) ,"line","thin","red")
  fh3=setmask( denest(netinc,0.20*hfac) ,"line","thin","blue")
  fh4=setmask( denest(netinc,0.25*hfac) ,"line","thin","green")
  plot(fh1,fh2,fh3,fh4,fh)
  tl="Family of Density Estimates"
  xl="netincome"
  yl="density fh"
  setgopt(plotdisplay,1,1,"title",tl,"xlabel",xl,"ylabel",yl)

XLGsmoo04.xpl

As you can see from the resulting graph (Figure 6.3), smaller bandwidths reveal more structure of the data, whereas larger bandwidths give smoother functions.

**Figure 6.3:** Family of kernel density estimates for the Netincome data.
$\includegraphics[scale=0.425]{smootherdens2}$

6.1.3 Kernel Choice

h2 = canker (h1, K1, K2): does the canonical bandwidth transformation of a bandwidth value for kernel K1 into an equivalent bandwidth for kernel K2

Kernel density estimation requires two parameters: the kernel function and the bandwidth parameter . In practice, the choice of the kernel is not nearly as important as the choice of the kernel. The theoretical background of this observation is that kernel functions can be rescaled such that the difference between two kernel density estimates using two different kernels is almost negligible (Marron and Nolan; 1988).

Instead of rescaling kernel functions, we can also rescale the bandwidths. This rescaling works as follows: Suppose we have estimated an unknown density using some kernel $K_{A}$ and bandwidth $h_{A}$ and we are considering estimating with a different kernel, say $K_{B}$ . Then the appropriate bandwidth to use with is

$\displaystyle h_B=\frac{h^{A}}{\delta_{0}^{A}}\,, \$

(6.2)

where the scale factors $\delta_{0}^{A}$ (the so-called canonical bandwidths) can be found in Table 6.2 for some selected kernel functions.

Kernel XploRe function		$\delta _{0}$

Uniform `uni`	$\left(\frac{9}{2}\right)^{1/5}$	$\cong$	1.3510
Epanechnikov `epa`	$15^{1/5}$	$\cong$	1.7188
Quartic `qua`	$35^{1/5}$	$\cong$	2.0362
Gaussian `gau`	$\left(\frac{1}{4\pi}\right)^{1/10}$	$\cong$	0.7764

XploRe offers the function 11987 canker to do this recomputation of bandwidths:

  hqua=canker(1.0,"gau","qua")
  hqua

computes the bandwidth

  Contents of hqua
  [1,]   2.6226

that can be used with the Quartic kernel "qua" in order to find the kernel density estimate that is equivalent to using the Gaussian kernel "gau" and bandwidth 1.0.

6.1.4 Bandwidth Selection

hrot = denrot (x {,K {,opt}}): computes a rule-of-thumb bandwidth for density estimation
{hcrit,crit} = denbwsel (x {,h {,K {,d}}}): starts an interactive tool for kernel density bandwidth selection using the WARPing method

Now we can consider the problem of bandwidth selection. The optimal bandwidth for a kernel density estimate is typically calculated on the basis of an estimate for the integrated squared error

$\displaystyle ISE(\widehat{f}_h) = \int \{\widehat{f}_h(x)-f(x)\}^2\,dx$

or the mean integrated squared error

$\displaystyle MISE(\widehat{f}_h) = \int E \{\widehat{f}_h(x)-f(x)\}^2\,dx.$

Both criteria should be minimized to obtain a good approximation of the unknown density

. For

, it is easy to see that

$\displaystyle ISE=\{\widehat f_{h}\} \int \widehat{f}^{2}_{h}(x)\,dx - 2 \int \{\widehat{f}_{h}f\}(x)\,dx + \int f^{2}(x)\,dx.\$

(6.3)

Since the third term does not depend on

, the classical least squares cross-validation criterion ( $\textrm{LSCV}$ ) is an estimate for the first and second term of (6.3). Cross-validation variants that are based on

are the biased cross-validation criterion ( $\textrm{BCV}$ ), the smoothed cross-validation criterion ( $\textrm{SCV}$ ) and the Jones, Marron and Park cross-validation criterion ( $\textrm{JMP}$ ).

Plug-in approaches attempt to estimate . Under the asymptotic conditions $h\to 0$ and $nh\to\infty$ , it is possible to approximate by

$\displaystyle \ MISE(\widehat f_{h}) \ \mathrel{\mathop{\approx}\limits_{\textr... ...t ^{2}_{2} + \ \frac{h^{4}}{4} \{\mu_{2}(K)\}^{2} \, \Vert f'' \Vert^{2}_{2}\,.$

(6.4)

The terms $\Vert K \Vert ^{2}_{2}$ and $\{\mu_{2}(K)\}^{2}$ are constants depending on the kernel function

. Analogously, $\Vert f'' \Vert^{2}_{2}$ denotes a constant depending on the second derivative of the unknown density

. Optimizing (6.4) with respect to

yields the following optimal bandwidth

$\displaystyle h_{\textrm{opt}}=\left( \frac{\Vert K \Vert^{2}_{2}}{\Vert f'' \Vert^{2}_{2}\,\{\mu_{2}(K)\}^{2}n}\right)^{1/5}\,. \$

(6.5)

$\Vert f'' \Vert^{2}_{2}$ is the only unknown term in (6.5). The idea behind plug-in estimates is to replace

by an estimate. Silverman's rule of thumb computes

as if

where the density of the normal distribution $N(\mu,\sigma^2)$ . This gives

$\displaystyle \widehat{h}_0 = \left( \frac{4 \widehat{\sigma}^5}{3n} \right)^{1/5} \cong 1.06\,\widehat{\sigma}\,n^{-1/5}$

if the kernel function

is also assumed to be the Gaussian kernel.

If a different kernel function is used, the bandwidth $\widehat{h}_0$ can be rescaled by using equation (6.2). For example, the rule-of-thumb bandwidth for the Quartic kernel computes as

$\displaystyle \widehat{h}_0 = 2.78\,\widehat{\sigma}\,n^{-1/5}$

(6.6)

This rule of thumb is implemented in 12239

denrot .

Of course, $\widehat{h}_0$ is a very crude estimate, in particular if the true density is far from looking like a normal density. Refined plug-in methods such as the Park and Marron plug-in and the Sheater and Jones plug-in use nonparametric estimates for .

The quantlet 12242 denbwsel needs the data vector as input:

  netinc=read("nicfoo")
  netinc=netinc[,1]
  tmp=denbwsel(netinc)

XLGsmoo05.xpl

This opens a selection box which offers you the choice between the different bandwidth selectors:

**Figure 6.4:** Least Squares CV bandwidth selector for Netincome data.
$\includegraphics[width=0.8\defpicwidth]{smootherdenlscv.ps}$

Choose for instance the Least Squares CV item to use $\textrm{LSCV}$ . Here, a graphical display appears which shows the $\textrm{LSCV}$ criterion in the upper left, the chosen optimal bandwidth in the upper right, the resulting kernel density estimate in the lower left, and some information about the search grid and the kernel in the lower right. This graphical display is shown in Figure 6.4.

You can play around now with the different bandwidth selectors. Try out the Park and Marron plug-in for example. Here, in contrast to the cross-validation methods, the aim is to find the root of the bandwidth selection criterion. The graphical display for the Park and Marron selector is shown in Figure 6.5.

**Figure 6.5:** Park and Marron plug-in for Netincome data.
$\includegraphics[width=0.8\defpicwidth]{smootherdenpm.ps}$

During the use of 12256 denbwsel , all currently obtained results are displayed in the output window. Note that for some selectors, it will be necessary to increase the search grid for . This can be done by using the menu item (Change h range or K).

It is difficult to give a general decision of which bandwidth selector is the best among all. As mentioned above, Silverman's rule of thumb works well when the true density is approximately normal. It may oversmooth for bimodal or multimodal densities. A nice feature of the cross-validation methods is that the selected bandwidth automatically adapts to the smoothness of . From a theoretical point of view, the and bandwidth selectors have the optimal asymptotic rate of convergence, but need either a very large sample size or pay with a large variance. As a consequence, we recommend determining bandwidths by different selection methods and comparing the resulting density estimates.

For detailed overviews on plug-in and cross-validation methods of bandwidth selection, see Härdle (1991), Park and Turlach (1992), and Turlach (1993). Practical experiences are available from the simulation studies in Marron (1989), Jones et al. (1992), Park and Turlach (1992), and Cao et al. (1992).

6.1.5 Confidence Intervals and Bands

{fh, fhl, fhu} = denci (x {,h {,alpha {,K} {,d}}}): computes the kernel density estimate and pointwise confidence intervals on a grid using the WARPing method
{fh, fhl, fhu} = dencb (x {,h {,alpha {,K} {,d}}}): computes the kernel density estimate and uniform confidence bands on a grid using the WARPing method
{fh, fhl, fhu} = denxci (x {,h {,alpha {,K} {,v}}}): computes the kernel density estimate and pointwise confidence intervals for all observations or on a grid v by exact computation
{fh, fhl, fhu} = denxcb (x {,h {,alpha {,K} {,v}}}): computes the kernel density estimate and uniform confidence bands for all observations or on a grid v by exact computation

To derive confidence intervals or confidence bands for $\widehat{f}_{h}(x)$ one has to know its sampling distribution. The following result about the asymptotic distribution of $\widehat{f}_{h}(x)$ is known from the literature (Härdle; 1991, p. 62). Suppose that exists and $h=cn^{-1/5}$ . Then

$\displaystyle n^{2/5} \left\{\widehat{f}_{h}(x)-f(x)\right\}\ \mathrel{\mathop{... ... c^{2} }{2} f''(x)\mu_{2}(K)\,,\, \frac{1}{c}f(x)\Vert K \Vert^{2}_{2} \right),$

(6.7)

which can be recalculated to

$\displaystyle \left\{\widehat{f}_{h}(x)-f(x)\right\}\ \mathrel{\mathop{\sim}\li... ...2} }{2} f''(x)\mu_{2}(K)\,,\, \frac{1}{{nh}} f(x)\Vert K \Vert^{2}_{2} \right).$

(6.8)

Assuming that the bias term $\frac{ h^{2} }{2} f''(x)\mu_{2}(K)$ is negligible with respect to the standard deviation, the resulting confidence interval for

can be written as

$\displaystyle \left[\;\widehat{f}_{h}(x)-z_{1-\frac{\alpha}{2}}\sqrt{\frac{\wid... ...alpha}{2}} \sqrt{\frac{\widehat{f}_{h}(x)\Vert K \Vert _{2}^{2}}{nh}}\;\right].$

(6.9)

The value $z_{1-\frac{\alpha}{2}}$ is the $(1-\frac{\alpha}{2})$ -quantile of the standard normal distribution.

The functions 12630 denci and 12633 denxci compute pointwise confidence intervals for a grid of points ( 12636 denci using WARPing) or all observations ( 12639 denxci using exact computation). The following quantlet code computes the confidence intervals for bandwidth 0.2 (Gaussian kernel) and significance level $\alpha =0.05$ . For graphical exploration, the pointwise intervals are drawn as curves and displayed in a plot. The resulting graphics is shown in Figure 6.6.

  {fh,fl,fu}=denci(netinc,0.2,0.05,"gau")
  fh=setmask(fh,"line","blue")
  fl=setmask(fl,"line","blue","thin","dashed")
  fu=setmask(fu,"line","blue","thin","dashed")
  plot(fl,fu,fh)
  setgopt(plotdisplay,1,1,"title","Density Confidence Intervals")

XLGsmoo06.xpl

**Figure 6.6:** Pointwise confidence intervals.
$\includegraphics[scale=0.425]{smootherdci}$

Uniform confidence bands for can only be derived under some rather restrictive assumptions, see Härdle (1991, p. 65). Suppose that is a density on and $h=n^{-\delta}$ , $\delta \in (\frac{1}{5},\frac{1}{2})$ . Then the following formula holds under some regularity for all $x \in [0,1]$ (Bickel and Rosenblatt; 1973):

$\displaystyle P\Bigg(\widehat{f}_{h}(x)- z_{n,\alpha} \sqrt{\frac{\widehat{f}_{... ...nh}}\Bigg) \mathrel{\mathop{\sim}\limits_{\textrm{\small as.}}^{}} 1-\alpha, \$

(6.10)

where

$\displaystyle z_{n,\alpha}=\left\{\frac{-\log\{-\frac12 \log(1-\alpha)\}}{(2\delta \log{n})^{1/2}}+d_{n}\right\}^{1/2}$

and

$\displaystyle d_{n}=(2\delta\log{n})^{1/2}+(2\delta\log{n})^{-1/2} \log{\left(\frac{1}{2\pi}\frac{\Vert K'\Vert _{2}}{\Vert K \Vert _{2}}\right)}^{1/2}.$

In practice, the data $x_1,\ldots,x_n$ are transformed to the interval

, then the confidence bands are computed and rescaled to the original scale of $x_1,\ldots,x_n$ . The functions 12654

dencb and 12657

denxcb can hence be directly applied to the original data $x_1,\ldots,x_n$ . The following quantlet code computes the confidence bands for the netinc data and displays them in a plot:

  {fh,fl,fu}=dencb(netinc,0.2,0.05,"gau")
  fh=setmask(fh,"line","blue")
  fl=setmask(fl,"line","blue","thin","dashed")
  fu=setmask(fu,"line","blue","thin","dashed")
  plot(fl,fu,fh)
  setgopt(plotdisplay,1,1,"title","Density Confidence Bands")

XLGsmoo07.xpl

The resulting graph is shown in Figure 6.7. It can be clearly seen that the confidence bands are wider than the confidence intervals (Figure 6.6). This is explained by the fact that here the whole function $f(\bullet)$ lies within the bands with probability 95%.

**Figure 6.7:** Uniform confidence bands.
$\includegraphics[scale=0.425]{smootherdcb}$