6.1 Kernel Density Estimation

The goal of density estimation is to approximate the probability density function $ f(\bullet)$ of a random variable $ X$. Assume we have $ n$ independent observations $ x_1,\ldots,x_n$ from the random variable $ X$. The kernel density estimator $ \widehat{f}_{h}(x)$ for the estimation of the density value $ f(x)$ at point $ x$ is defined as

$\displaystyle \widehat{f}_{h}(x)= \frac{1}{nh}\sum_{i=1}^{n} K\left(\frac{x_i-x}{h}\right),\ $ (6.1)

$ K(\bullet)$ denoting a so-called kernel function, and $ h$ denoting the bandwidth. A number of possible kernel functions is listed in the following table.

Table 6.1: Kernel functions.
Kernel $ K(u)$ XploRe function
Uniform $ \frac{1}{2} {\boldsymbol{I}}(\vert u \vert \le 1)$ 11396 uni
Triangle $ (1-\vert u \vert) {\boldsymbol{I}}(\vert u \vert \le 1)$ 11401 trian
Epanechnikov $ \frac{3}{4}(1-u^{2}) {\boldsymbol{I}}(\vert u \vert \le 1)$ 11406 epa
Quartic $ \frac{15}{16}(1-u^{2})^{2} {\boldsymbol{I}}(\vert u \vert \le 1)$ 11411 qua
Triweight $ \frac{35}{32}(1-u^{2})^{3} {\boldsymbol{I}}(\vert u \vert \le 1)$ 11416 tri
Gaussian $ \frac{1}{\sqrt{2\pi}} \exp(-\frac{1}{2}u^2)$ 11419 gau
Cosinus $ \frac{\pi}{4}\cos(\frac{\pi}{2}u) {\boldsymbol{I}}(\vert u \vert \le 1)$ 11424 cosi

In order to illustrate how the kernel density estimator works, let us rewrite equation (6.1) for the Uniform kernel:

$\displaystyle \widehat{f}_{h}(x)$ $\displaystyle =$ $\displaystyle \frac{1}{2nh}\,\sum_{i=1}^{n}
{\boldsymbol{I}}\left(\vert x_i-x\vert\le h\right)$  
  $\displaystyle =$ $\displaystyle \frac{1}{2nh}\,
\{\textrm{number of observations } x_i \textrm{ that fall in } [x-h,x+h]\}.
\ $  

It is easy to see that for estimating the density at point $ x$, the relative frequency of all observations $ x_i$ falling in an interval around $ x$ is counted. The factor $ 1/(2nh)$ is needed to ensure that the resulting density estimate has integral $ \int \widehat{f}_{h}(x)\,dx=1$.

The Uniform kernel implies that all observations $ x_i$ in the neighborhood of $ x$ are given the same weight, since observations close to $ x$ carry also information about $ f(x)$. It is reasonable to attribute more weights to observations that are near to $ x$ and less weight to distant observations. All other kernels in Table 6.1 have this property. Figure 6.1 visualizes the construction of a kernel density estimate of 10 data points ( red + ) using the Gaussian kernel.

Figure: Kernel density estimate as a sum of bumps. 11437 XLGsmoo01.xpl

6.1.1 Computational Aspects

To understand the structure of the smoother library, we need to explain shortly how kernel estimates can be computed in practice. Consider the equation (6.1) again. To obtain the kernel density estimates for all observations $ x_1,\ldots,x_n$ we must compute

$\displaystyle \widehat{f}_{h}(x_j)=
K\left(\frac{x_i-x_j}{h}\right)\,, \quad j=1,\ldots,n.

In the case of an exact computation of these expressions, the kernel function $ K(\bullet)$ must be evaluated to $ O(h\cdotp n^2)$ times. The expression $ O(h\cdotp n^2)$ indicates a value proportional to $ n^2$ (the squared sample size). This increases the computation time if the sample size $ n$ is large.

For exploratory purposes, i.e. for graphing the density estimate it is not necessary to evaluate the $ \widehat{f}_h(\bullet)$ at all observations $ x_1,\ldots,x_n$. Instead, the estimate can be computed for example on an equidistant grid $ v_1,\ldots,v_m$:

$\displaystyle v_k = x_{\min} + \frac{k}{m} \left(x_{\max}-x_{\min}\right),
\quad k=1,\ldots,m << n.$

The evaluation of $ \widehat{f}_h(v_k)$, $ k=1,\ldots,m$, requires then only $ O(h\cdotp n\cdotp m)$ steps.

The number of $ O(h\cdotp n\cdotp m)$ evaluations of the kernel function $ K(\bullet)$ is however time consuming if the sample size is large. An alternative and faster way is to approximate the kernel density estimate by the WARPing method (Härdle and Scott; 1992).

The basic idea of WARPing (Weighted Average of Rounded Points) is the ``binning'' of the data in bins of length $ d$ starting at the origin $ x_0=0$. Each observation point is then replaced by the bincenter of the corresponding bin (that means in fact that each point is rounded to the precision given by $ d$). A usual choice for $ d$ is to use $ h/5$ or $ (x_{\max}-x_{\min})/100$. In the latter case, the effective sample size $ r$ for the computation (the number of nonempty bins) can be at most 101.

For the WARPing method, the kernel function needs to be evaluated only at $ \ell\cdotp d / h$, $ \ell=1,...,s$, where $ s$ is the number of bins which contains the support of the kernel function. The calculation of the estimated density reduces then to

$\displaystyle \widetilde{f}_h(w_j) = \frac{n_j}{nh}
\sum_{i=1}^{r} n_i K\left\{\frac{(i-j)\cdotp d}{h}\right\}\,,
\quad j=1,\ldots,r

computed on the grid $ w_j = (j+0.5)\cdotp d$ ($ j$ integer) with $ n_i$, $ n_j$ denoting the number of observations in the $ i$-th and $ j$-th bins, respectively. The WARPing approximation requires $ O(h\cdotp r/d)$ evaluations of the kernel function and $ O(n) +O(h\cdotp r/d)$ steps in total. This is considerably faster than the exact computation, when the sample size is large.

To have a choice between both the exact and the WARPing computation of kernel estimates and related functions, the smoother library offers most functionality in two functions:

Functionality Exact WARPing
density estimate 11580 denxest 11583 denest
density confidence intervals 11586 denxci 11589 denci
density confidence bands 11592 denxcb 11595 dencb
density bandwidth selection - 11598 denbwsel

6.1.2 Computing Kernel Density Estimates

fh = 11780 denest (x {,h {,K} {,d}})
computes the kernel density estimate on a grid using the WARPing method
fh = 11783 denxest (x {,h {,K} {,v}})
computes the kernel density estimate for all observations or on a grid v by exact computation

The easiest way to compute kernel density estimates is the function 11786 denest . This function is based on the WARPing method which makes the computation very fast. Let us apply the 11789 denest routine to the nicfoo data which contain observations on household netincome in the first column (see Data Sets (B.1)). 11794 denest requires the data as a minimum input. Optionally, the bandwidth h, the kernel function K and the discretization parameter d which is used for the WARPing can be specified. For example,

11798 XLGsmoo02.xpl

calculates the density estimate using 15% of the range of the data as bandwidth h. If h is not given, the rule of thumb bandwidth (6.6) is used, see Subsection 6.1.4 for more information. Note that we did not specify the kernel function in the above code. Therefore the default Quartic kernel "qua" and the default binsize d are used.

To display the estimated function you may use these lines of code:

  tl="Density Estimate"
  yl="density fh"
The resulting density is plotted in Figure 6.2.

Figure 6.2: Kernel density estimate for the Netincome data.

As an alternative to 11804 denest , the function 11807 denxest can be used for obtaining a density estimate by exact computation. As with 11810 denest the data are the minimum input. Optionally, the bandwidth h, the kernel function K and a grid v for the computation can be specified. The following code computes the density estimate for all observations:

To compute the density on a grid of 30 equidistant points within the range of x, the previous example can be modified as follows:
  v=grid( min(netinc), (max(netinc)-min(netinc))/29, 30)
11814 XLGsmoo03.xpl

Finally, let us calculate the density estimate for a family of different bandwidths in order to see the dependence on the bandwidth parameter. We modify our first code in the following way:

  fh =setmask( denest(netinc,0.15*hfac) ,"line")
  fh1=setmask( denest(netinc,0.05*hfac) ,"line","thin","magenta")
  fh2=setmask( denest(netinc,0.10*hfac) ,"line","thin","red")
  fh3=setmask( denest(netinc,0.20*hfac) ,"line","thin","blue")
  fh4=setmask( denest(netinc,0.25*hfac) ,"line","thin","green")
  tl="Family of Density Estimates"
  yl="density fh"
11820 XLGsmoo04.xpl

As you can see from the resulting graph (Figure 6.3), smaller bandwidths reveal more structure of the data, whereas larger bandwidths give smoother functions.

Figure 6.3: Family of kernel density estimates for the Netincome data.

6.1.3 Kernel Choice

h2 = 11968 canker (h1, K1, K2)
does the canonical bandwidth transformation of a bandwidth value for kernel K1 into an equivalent bandwidth for kernel K2

Kernel density estimation requires two parameters: the kernel function $ K$ and the bandwidth parameter $ h$. In practice, the choice of the kernel is not nearly as important as the choice of the kernel. The theoretical background of this observation is that kernel functions can be rescaled such that the difference between two kernel density estimates using two different kernels is almost negligible (Marron and Nolan; 1988).

Instead of rescaling kernel functions, we can also rescale the bandwidths. This rescaling works as follows: Suppose we have estimated an unknown density $ f$ using some kernel $ K_{A}$ and bandwidth $ h_{A}$ and we are considering estimating $ f$ with a different kernel, say $ K_{B}$. Then the appropriate bandwidth $ h_B$ to use with $ K_B$ is

$\displaystyle h_B=\frac{h^{A}}{\delta_{0}^{A}}\,, \ $ (6.2)

where the scale factors $ \delta_{0}^{A}$ (the so-called canonical bandwidths) can be found in Table 6.2 for some selected kernel functions.

Table 6.2: $ \delta _{0}$ for different kernels.
Kernel XploRe function   $ \delta _{0}$  
Uniform 11971 uni $ \left(\frac{9}{2}\right)^{1/5}$ $ \cong$ 1.3510
Epanechnikov 11974 epa $ 15^{1/5}$ $ \cong$ 1.7188
Quartic 11977 qua $ 35^{1/5}$ $ \cong$ 2.0362
Gaussian 11980 gau $ \left(\frac{1}{4\pi}\right)^{1/10}$ $ \cong$ 0.7764

XploRe offers the function 11987 canker to do this recomputation of bandwidths:

computes the bandwidth
  Contents of hqua
  [1,]   2.6226
that can be used with the Quartic kernel "qua" in order to find the kernel density estimate that is equivalent to using the Gaussian kernel "gau" and bandwidth 1.0.

6.1.4 Bandwidth Selection

hrot = 12228 denrot (x {,K {,opt}})
computes a rule-of-thumb bandwidth for density estimation
{hcrit,crit} = 12231 denbwsel (x {,h {,K {,d}}})
starts an interactive tool for kernel density bandwidth selection using the WARPing method

Now we can consider the problem of bandwidth selection. The optimal bandwidth for a kernel density estimate is typically calculated on the basis of an estimate for the integrated squared error

$\displaystyle ISE(\widehat{f}_h) = \int \{\widehat{f}_h(x)-f(x)\}^2\,dx$

or the mean integrated squared error

$\displaystyle MISE(\widehat{f}_h) = \int E \{\widehat{f}_h(x)-f(x)\}^2\,dx.$

Both criteria should be minimized to obtain a good approximation of the unknown density $ f$. For $ ISE$, it is easy to see that

$\displaystyle ISE=\{\widehat f_{h}\} \int \widehat{f}^{2}_{h}(x)\,dx - 2 \int \{\widehat{f}_{h}f\}(x)\,dx + \int f^{2}(x)\,dx.\ $ (6.3)

Since the third term does not depend on $ h$, the classical least squares cross-validation criterion ( $ \textrm{LSCV}$) is an estimate for the first and second term of (6.3). Cross-validation variants that are based on $ MISE$ are the biased cross-validation criterion ( $ \textrm{BCV}$), the smoothed cross-validation criterion ( $ \textrm{SCV}$) and the Jones, Marron and Park cross-validation criterion ( $ \textrm{JMP}$).

Plug-in approaches attempt to estimate $ MISE$. Under the asymptotic conditions $ h\to 0$ and $ nh\to\infty$, it is possible to approximate $ MISE$ by

$\displaystyle \ MISE(\widehat f_{h}) \ \mathrel{\mathop{\approx}\limits_{\textr...
...t ^{2}_{2} + \ \frac{h^{4}}{4} \{\mu_{2}(K)\}^{2} \, \Vert f'' \Vert^{2}_{2}\,.$ (6.4)

The terms $ \Vert K \Vert ^{2}_{2}$ and $ \{\mu_{2}(K)\}^{2}$ are constants depending on the kernel function $ K$. Analogously, $ \Vert f'' \Vert^{2}_{2}$ denotes a constant depending on the second derivative of the unknown density $ f$. Optimizing (6.4) with respect to $ h$ yields the following optimal bandwidth

$\displaystyle h_{\textrm{opt}}=\left( \frac{\Vert K \Vert^{2}_{2}}{\Vert f'' \Vert^{2}_{2}\,\{\mu_{2}(K)\}^{2}n}\right)^{1/5}\,. \ $ (6.5)

$ \Vert f'' \Vert^{2}_{2}$ is the only unknown term in (6.5). The idea behind plug-in estimates is to replace $ f''$ by an estimate. Silverman's rule of thumb computes $ f''$ as if $ f$ where the density of the normal distribution $ N(\mu,\sigma^2)$. This gives

$\displaystyle \widehat{h}_0 = \left( \frac{4 \widehat{\sigma}^5}{3n} \right)^{1/5}
\cong 1.06\,\widehat{\sigma}\,n^{-1/5}$

if the kernel function $ K$ is also assumed to be the Gaussian kernel.

If a different kernel function is used, the bandwidth $ \widehat{h}_0$ can be rescaled by using equation (6.2). For example, the rule-of-thumb bandwidth for the Quartic kernel computes as

$\displaystyle \widehat{h}_0 = 2.78\,\widehat{\sigma}\,n^{-1/5}$ (6.6)

This rule of thumb is implemented in 12239 denrot .

Of course, $ \widehat{h}_0$ is a very crude estimate, in particular if the true density $ f$ is far from looking like a normal density. Refined plug-in methods such as the Park and Marron plug-in and the Sheater and Jones plug-in use nonparametric estimates for $ f''$.

The quantlet 12242 denbwsel needs the data vector as input:

12246 XLGsmoo05.xpl

This opens a selection box which offers you the choice between the different bandwidth selectors:


Figure 6.4: Least Squares CV bandwidth selector for Netincome data.

Choose for instance the Least Squares CV item to use $ \textrm{LSCV}$. Here, a graphical display appears which shows the $ \textrm{LSCV}$ criterion in the upper left, the chosen optimal bandwidth in the upper right, the resulting kernel density estimate in the lower left, and some information about the search grid and the kernel in the lower right. This graphical display is shown in Figure 6.4.

You can play around now with the different bandwidth selectors. Try out the Park and Marron plug-in for example. Here, in contrast to the cross-validation methods, the aim is to find the root of the bandwidth selection criterion. The graphical display for the Park and Marron selector is shown in Figure 6.5.

Figure 6.5: Park and Marron plug-in for Netincome data.

During the use of 12256 denbwsel , all currently obtained results are displayed in the output window. Note that for some selectors, it will be necessary to increase the search grid for $ h$. This can be done by using the menu item (Change h range or K).

It is difficult to give a general decision of which bandwidth selector is the best among all. As mentioned above, Silverman's rule of thumb works well when the true density is approximately normal. It may oversmooth for bimodal or multimodal densities. A nice feature of the cross-validation methods is that the selected bandwidth automatically adapts to the smoothness of $ f$. From a theoretical point of view, the $ BCV$ and $ SCV$ bandwidth selectors have the optimal asymptotic rate of convergence, but need either a very large sample size or pay with a large variance. As a consequence, we recommend determining bandwidths by different selection methods and comparing the resulting density estimates.

For detailed overviews on plug-in and cross-validation methods of bandwidth selection, see Härdle (1991), Park and Turlach (1992), and Turlach (1993). Practical experiences are available from the simulation studies in Marron (1989), Jones et al. (1992), Park and Turlach (1992), and Cao et al. (1992).

6.1.5 Confidence Intervals and Bands

{fh, fhl, fhu} = 12608 denci (x {,h {,alpha {,K} {,d}}})
computes the kernel density estimate and pointwise confidence intervals on a grid using the WARPing method
{fh, fhl, fhu} = 12611 dencb (x {,h {,alpha {,K} {,d}}})
computes the kernel density estimate and uniform confidence bands on a grid using the WARPing method
{fh, fhl, fhu} = 12614 denxci (x {,h {,alpha {,K} {,v}}})
computes the kernel density estimate and pointwise confidence intervals for all observations or on a grid v by exact computation
{fh, fhl, fhu} = 12617 denxcb (x {,h {,alpha {,K} {,v}}})
computes the kernel density estimate and uniform confidence bands for all observations or on a grid v by exact computation

To derive confidence intervals or confidence bands for $ \widehat{f}_{h}(x)$ one has to know its sampling distribution. The following result about the asymptotic distribution of $ \widehat{f}_{h}(x)$ is known from the literature (Härdle; 1991, p. 62). Suppose that $ f''$ exists and $ h=cn^{-1/5}$. Then

$\displaystyle n^{2/5} \left\{\widehat{f}_{h}(x)-f(x)\right\}\ \mathrel{\mathop{...
... c^{2} }{2} f''(x)\mu_{2}(K)\,,\, \frac{1}{c}f(x)\Vert K \Vert^{2}_{2} \right),$ (6.7)

which can be recalculated to

$\displaystyle \left\{\widehat{f}_{h}(x)-f(x)\right\}\ \mathrel{\mathop{\sim}\li...
...2} }{2} f''(x)\mu_{2}(K)\,,\, \frac{1}{{nh}} f(x)\Vert K \Vert^{2}_{2} \right).$ (6.8)

Assuming that the bias term $ \frac{ h^{2} }{2} f''(x)\mu_{2}(K)$ is negligible with respect to the standard deviation, the resulting confidence interval for $ f(x)$ can be written as

$\displaystyle \left[\;\widehat{f}_{h}(x)-z_{1-\frac{\alpha}{2}}\sqrt{\frac{\wid...
...alpha}{2}} \sqrt{\frac{\widehat{f}_{h}(x)\Vert K \Vert _{2}^{2}}{nh}}\;\right].$ (6.9)

The value $ z_{1-\frac{\alpha}{2}}$ is the $ (1-\frac{\alpha}{2})$-quantile of the standard normal distribution.

The functions 12630 denci and 12633 denxci compute pointwise confidence intervals for a grid of points ( 12636 denci using WARPing) or all observations ( 12639 denxci using exact computation). The following quantlet code computes the confidence intervals for bandwidth 0.2 (Gaussian kernel) and significance level $ \alpha =0.05$. For graphical exploration, the pointwise intervals are drawn as curves and displayed in a plot. The resulting graphics is shown in Figure 6.6.

  setgopt(plotdisplay,1,1,"title","Density Confidence Intervals")
12643 XLGsmoo06.xpl

Figure 6.6: Pointwise confidence intervals.

Uniform confidence bands for $ f$ can only be derived under some rather restrictive assumptions, see Härdle (1991, p. 65). Suppose that $ f$ is a density on $ [0,1]$ and $ h=n^{-\delta}$, $ \delta
\in (\frac{1}{5},\frac{1}{2})$. Then the following formula holds under some regularity for all $ x \in [0,1]$ (Bickel and Rosenblatt; 1973):

$\displaystyle P\Bigg(\widehat{f}_{h}(x)- z_{n,\alpha} \sqrt{\frac{\widehat{f}_{...
...nh}}\Bigg) \mathrel{\mathop{\sim}\limits_{\textrm{\small as.}}^{}} 1-\alpha, \ $ (6.10)


$\displaystyle z_{n,\alpha}=\left\{\frac{-\log\{-\frac12 \log(1-\alpha)\}}{(2\delta


$\displaystyle d_{n}=(2\delta\log{n})^{1/2}+(2\delta\log{n})^{-1/2}
\log{\left(\frac{1}{2\pi}\frac{\Vert K'\Vert _{2}}{\Vert K
\Vert _{2}}\right)}^{1/2}.$

In practice, the data $ x_1,\ldots,x_n$ are transformed to the interval $ [0,1]$, then the confidence bands are computed and rescaled to the original scale of $ x_1,\ldots,x_n$. The functions 12654 dencb and 12657 denxcb can hence be directly applied to the original data $ x_1,\ldots,x_n$. The following quantlet code computes the confidence bands for the netinc data and displays them in a plot:
  setgopt(plotdisplay,1,1,"title","Density Confidence Bands")
12661 XLGsmoo07.xpl

The resulting graph is shown in Figure 6.7. It can be clearly seen that the confidence bands are wider than the confidence intervals (Figure 6.6). This is explained by the fact that here the whole function $ f(\bullet)$ lies within the bands with probability 95%.

Figure 6.7: Uniform confidence bands.