next up previous contents index
Next: 4.4 Visualization of Trivariate Up: 4. Multivariate Density Estimation Previous: 4.2 Visualization

Subsections



4.3 Density Estimation Algorithms and Theory

This section includes enough algorithms and results to obtain a basic understanding of the opportunities and issues. Fortunately, there have been a number of readable monographs available for the reader interested in pursuing this subject in depth. In rough chronological order, excluding books primarily dealing with nonparametric regression, the list includes [47,65,25,11,42,10,23,14,37,49,60,44,3], and [48].

The purpose of the next section is to provide a survey of important results without delving into the theoretical underpinnings and details. The references cited above are well-suited for that purpose.


4.3.1 A High-Level View of Density Theory


Smoothing Parameters

Every algorithm for nonparametric density estimation has one or more design parameters which are called the smoothing parameter(s) or bandwidth(s) of the procedure. The smoothing parameter controls the final appearance of the estimate. For an equally-spaced histogram, the bin width plays the primary role of a smoothing parameter. Of course, the bins may be shifted and the location of the bin edges is controlled by the bin origin, which plays the role of a secondary smoothing parameter. For a kernel estimator, the scale or width of the kernel serves as the smoothing parameter. For an orthogonal series estimator, the number of basis functions serves as the smoothing parameter. The smoothing parameters of a spline estimator also include the location of the knots. Similarly, a histogram with completely flexible bin widths has many more than two smoothing parameters.

No Unbiased Density Estimators

As a point estimator of $ f(x)$, [27] proved that every nonparametric density estimator, $ \widehat{f}(x)$ is biased. However, it is usually true that the integral of all of the pointwise biases is 0. Thus mean squared error (MSE) and integrated mean squared error (IMSE) are the appropriate criteria to optimize the tradeoff between pointwise/integrated variance and squared bias.

Nonparametric density estimators always underestimate peaks and overestimate valleys in the true density function. Intuitively, the bias is driven by the degree of curvature in the true density. However, since the bias function is continuous and integrates to 0, there must be a few points where the bias does vanish. In fact, letting the smoothing parameter vary pointwise, there are entire intervals where the bias vanishes, including the difficult-to-estimate tail region. This fact has been studied by [15] and [31]. Since the bias of a kernel estimator does not depend on the sample size, these zero-bias or bias-annihilating estimates have more than a theoretical interest. However, there is much more work required for practical application. Alternatively, in higher dimensions away from peaks and valleys, one can annihilate pointwise bias by balancing directions of positive curvature against directions of negative curvature; see [54]. An even more intriguing idea literally adjusts the raw data points towards peaks and away from valleys to reduce bias; see [6].

Rates of Convergence

The rate of convergence of a nonparametric density estimator to the true density is much slower than in the parametric setting, assuming in the latter case that the correct parametric model is known. If the correct parametric model is not known, then the parametric estimates will converge but the bias will not vanish. The convergence is slower still in high dimensions, a fact which is often referred to as the curse of dimensionality. Estimating the derivative of a density function is even harder than coping with an additional dimension of data.

If the $ k$-th derivative of a density is known to be smooth, then it is theoretically possible to construct an order-$ k$ nonparametric density estimation algorithm. The pointwise bias is driven by the $ k$-th derivative at $ x$, $ f^{(k)}(x)$. However, if $ k>2$, then the density estimate will take on negative values for some points, $ x$. It is possible to define higher-order algorithms which are non-negative, but these estimates do not integrate to $ 1$; see [52]. Thus higher-order density estimation algorithms violate one of the two conditions for a true density: $ f(x)\ge 0$ and $ \int_{-\infty}^{\infty} f(x)\,{\text{d}}x=1$. Of course, there are cases where the first condition is violated for lower-order estimators. Two such cases include orthogonal series estimators ([18,62]) and boundary-corrected kernel estimators ([26]). Note that positive kernel estimators correspond to $ k = 2$. [57] studied the efficacy of higher-order procedures and suggested $ k=3$ often provided superior estimates. [37] also studied this question and found some improvement when $ k=3$, which must be traded off against the disadvantages of negative estimates.


Choosing Bandwidths in Practice

Picking the best smoothing parameter from data is an important task in practice. If the smoothing parameter is too small, the estimate is too noisy, exhibiting high various and extraneous wiggles. If the smoothing parameter is too large, then the estimate may miss key features due to oversmoothing, washing out small details. Such estimates have low variance but high bias in many regions. In practice, bandwidths that do not differ by more than 10-15% from the optimal bandwidth are usually satisfactory.

A statistician experienced in EDA is likely to find all estimates informative for bandwidths ranging from undersmoothed to oversmoothed. With a complicated density function, no single choice for the bandwidth may properly represent the density for all values of $ x$. Thus the same bandwidth may result in undersmoothing for some intervals of $ x$, oversmoothing in another interval, and yet near optimal smoothing elsewhere. However, the practical difficulty of constructing locally adaptive estimators makes the single-bandwidth case of most importance. Simple transformations of the data scale can often be an effective strategy ([61]). This strategy was used with the lipid data in Fig. 4.1, which were transformed to a $ \log _{10}$ scale.

Consider the $ 21640$ $ x$ points shown in frame (b) of Fig. 4.4. Histograms of these data with various numbers of bins are shown in Fig. 4.7. With so much data, the oversmoothed histogram Fig. 4.7a captures the major features, but seems biased downwards at the peaks. The final frame shows a histogram that is more useful for finding data anomalies than as a good density estimate.

Figure 4.7: Histograms of $ x$ variable in Fig. 4.4b with $ 15$, $ 35$, and $ 100$ bins
\includegraphics[clip]{text/3-4/fig7.eps}

The differences are more apparent with a smaller sample size. Consider the $ 320 \log_{10}$-cholesterol levels shown in Fig. 4.1. Three histograms are shown in Fig. 4.8. The extra one or two modes are at least suggested in the middle panel, while the histogram in the first panel only suggests a rather unusual non-normal appearance. The third panel has many large spurious peaks. We conclude from these two figures that while an oversmoothed estimator may have a large error relative to the optimal estimator, the absolute error may still be reasonably small for very large data samples.

Figure 4.8: Histograms of $ \log _{10}$-cholesterol variable in Fig. 4.1 with $ 9$, $ 19$, and $ 39$ bins
\includegraphics[clip]{text/3-4/fig8.eps}

Oversmoothed Bandwidths

While there is no limit on how complicated a density may be (for which $ \int f^{(k)}(x)^2\,{\text{d}}x$ may grow without bound), the converse is not true. [53] and [51] show that for a particular scale of a density (for example, the range, standard deviation, or interquartile range), there is in fact a lower bound among continuous densities for the roughness quantity

$\displaystyle R\left(f^{(k)}\right)=\int_{-\infty}^{\infty}f^{(k)}(x)^2\,{\text{d}}x\,.$ (4.1)

In a real sense, such densities are the smoothest possible and are the easiest to estimate. The optimal bandwidth for these ''oversmoothed densities'' serves as an upper bound. Specifically, any other density with the same scale will have more complicated structure and will require a smaller bandwidth to more accurately estimate those features. Since oversmoothed bandwidths (and reference bandwidths as well) only use the data to estimate the scale (variance, for example), these data-based estimates are quite stable. Obtaining similar highly stable data-based nearly optimal bandwidth estimators requires very sophisticated estimates of the roughness function given in 4.1. One algorithm by [13] is often highly rated in practice ([17]). seemed closely related to the oversmoothed bandwidths. These approaches all rely on asymptotic expansions of IMSE rather than an unbiased risk estimate, which underlies the least-squares or unbiased cross-validation algorithm introduced by [28] and [2]. However, the unbiased risk approach has numerous extensions; see [30] and [39]. Another algorithm that should be mentioned is the bootstrap bandwidth. For a Gaussian kernel, the bootstrap with an infinite number of repetitions has a closed form expression; see [50]. Multivariate extensions are discussed by [29].

Many details of these ideas may be found in the literature and in the many textbooks available. The following section provides some indication of this research.


4.3.2 The Theory of Histograms

The basic results of density estimation are perhaps most easily understood with the ordinary histogram. Thus more time will be spent on the histogram with only an outline of results for more sophisticated and more modern algorithms.

Given an equally spaced mesh $ \{t_k\}$ over the entire real line with $ t_{k+1}-t_k=h$, the density histogram is given by

$\displaystyle \widehat{f}(x) = \frac{\nu_k}{nh}$   for$\displaystyle \; t_k<x<t_{k+1}\,,$ (4.2)

where $ \nu_k$ is the number of data points falling in the $ k$-th bin. Clearly, $ \sum_{k}\,\nu_k=n$ and $ \nu_k$ is a Binomial random variable with mean $ p_k=\int_{t_k}^{t_{k+1}} f(x)\,{\mathrm{d}}x$; hence, $ \mathrm{E}\nu_k=np_k$ and $ \mathrm{Var}\nu_k=np_k(1-p_k)$. Thus the pointwise variance of the histogram (4.2) is $ np_k(1-p_k)/(nh)^2$, which is constant for all $ x$ in the $ k$-th bin. Thus, the integrated variance (IV) over $ (-\infty,\infty)$ may be found by integrating the pointwise variance over the $ k$-th bin (i.e., multiply by the bin width $ h$), and summing over all bins:

$\displaystyle \mathrm{IV}=\sum_{k=-\infty}^{\infty} \frac{np_k(1-p_k)}{(nh)^2}\...
...\infty}^{\infty} \frac{p_k(1-p_k)}{nh} = \frac{1}{nh}-\sum_k \frac{pk^2}{nh}\,,$ (4.3)

since $ \sum p_k=\int_{-\infty}^{\infty}f(x)\,{\mathrm{d}}x=1$. The final term may be shown to approximate $ n^{-1}\int f(x)^2\,{\mathrm{d}}x$, which is asymptotically negligible. Thus the global integrated variance of the histogram can be controlled by collecting more data or choosing a wider bin width.

Next consider the bias of $ \,\widehat{\mathit{f}}$ at a fixed point, $ x$, which is located in the $ k$-th bin. Note that $ \mathrm{E}\,\widehat{f}(x)=np_k/nh=p_k/h$. A useful approximation to the bin probability is

$\displaystyle p_k = \int_{t_k}^{t_{k+1}} f(y)\,{\mathrm{d}}y= h\,f(x) + h^2 \left( \frac{1}{2}-\frac{x-t_k}{h} \right) f^{\prime}(x) + \ldots\,,$ (4.4)

replacing the integrand $ f(y)$ by $ f(x)+(y-x)f^{\prime}(x)+\ldots$. Thus the pointwise bias may be approximated by

$\displaystyle \mathrm{Bias}\,\widehat{f}(x)= \mathrm{E}\,\widehat{f}(x)-f(x) = ...
...}-f(x) = h \left( \frac{1}{2}-\frac{x-t_k}{h} \right) f^{\prime}(x) + \ldots\,.$ (4.5)

Therefore, the bias is controlled by the first derivative of the unknown density at $ x$. Since $ t_k<x<t_{k+1}$, then the factor $ (1/2-(x-t_k)/h)$ in (4.5) varies from $ -1/2$ to $ 1/2$. Thus the bias is also directly proportional to the bandwidth, $ h$. To control the bias of the histogram estimate, the bandwidth $ h$ should be small. Comparing (4.3) and (4.5), the global consistency of the histogram can be guaranteed if, as $ n\rightarrow \infty $, $ h\rightarrow 0$ while ensuring that the product $ nh\rightarrow\infty$ as well, for example, if $ h=1/\sqrt{n}$ (see [12]).

A more complete analysis of the bias ([32]) shows that the integrated squared bias is approximately $ h^2\,{R}(\,f^{\prime})/12$, where $ {R}(\,f^{\prime})=\int f^{\prime}(x)^2\,{\mathrm{d}}x$, so that the IMSE is given by

$\displaystyle \mathrm{IMSE}\, \left[ \,\widehat{f}_k \right] = \frac{1}{nh} + \frac{1}{12} h^2 {R}(f^{\prime}) + O(n^{-1})\,.$ (4.6)

From this equation, the optimal bandwidth is seen to be

$\displaystyle h^{\ast} = \left[ \frac{6}{n{R}(f^{\prime})} \right]^{1/3}$   and$\displaystyle \quad \mathrm{IMSE}^{\ast} = \left(\frac{9}{16}\right)^{1/3}\, {R}(f^{\prime})^{1/3}\, n^{-2/3}\,.$ (4.7)

Thus the optimal bandwidth approaches zero at the rate $ O(n^{-1/3})$ and not the rate $ O(n^{-1/2})$ as suggested by [12] nor the rate $ O({1}/{\log n})$ as suggested by [45]. With regards to IMSE, the best rate a histogram can achieve is of order $ O(n^{-2/3})$, which falls well short of the parametric rate of $ O(n^{-1})$. From (4.7), the larger the value of the roughness $ {R}(f^{\prime})$ of the true density, the smaller the optimal bandwidth and the larger the average error.

Finally, the smoothest density with variance $ \sigma ^2$ is

$\displaystyle g(x)=\frac{15}{16\sqrt{7}\sigma} \left( 1 - \frac{x^2}{7\sigma^2} \right) ^2 \qquad -\sqrt{7}\sigma < x < \sqrt{7}\sigma$ (4.8)

and zero elsewhere, for which $ R(g')=15\sqrt{7}/(343\sigma^3)$. Since $ {R}(f^{\prime})\ge {R}(g')$ for any other continuous density, $ f$,

$\displaystyle h^{\ast} = \left[ \frac{6}{n{R}(f^{\prime})} \right]^{1/3} \le \l...
...eft[ \frac{686\,\sigma^3}{5\sqrt{7}n} \right]^{1/3} = 3.73\,\sigma\,n^{-1/3}\,,$ (4.9)

which is the ''oversmoothed bandwidth'' rule. Consider the normal reference rule, $ f=\phi=N(\mu,\sigma^2)$, for which $ {R}(\phi')=1/(4\sqrt{\pi}\sigma^3)$, which when substituted into (4.7) gives $ h^{\ast}=3.49\,\sigma\,n^{-1/3}$, a value that is only $ 6.4\,{\%}$ narrower than the oversmoothed bandwidth.

The oversmoothing rule (4.9) may be inverted when the scale is the range of the density to obtain a lower bound of $ \sqrt[3]{2n}$ on the number of bins in the optimal histogram. This formula should be compared to Sturges' rule of $ 1+\log_2 n$, which is in common use in many statistical packages ([45]). In fact, the histograms in the first frames of Figs. 4.7 and 4.8 correspond to Sturges' rule, while the second frames of these figures correspond to the oversmoothed bandwidths. Presumably the optimal bandwidth would occur somewhere between the second and third frames of these figures. Clearly Sturges' rule results in oversmoothed graphs since the optimal number of bins is severely underestimated.


4.3.3 ASH and Kernel Estimators

The histogram is an order-one density estimator, since the bias is determined by the first derivative. The estimators used most in practice are order-two estimators. (Recall that order-three estimators are not non-negative.) Perhaps the most unexpected member of the order-two class is the frequency polygon (FP), which is the piecewise linear interpolant of the midpoints of a histogram. ([33]) showed that

$\displaystyle \mathrm{IMSE}\, \Bigl[\, \widehat{f}_{\mathrm{FP}} \Bigr] = \frac{2}{3nh} + \frac{49}{2880} h^4 {R}(f^{\prime\prime}) + O\left(n^{-1}\right).$ (4.10)

Compared to Equation (4.6), the integrated variance of a FP is $ 33\,{\%}$ smaller and the integrated squared bias is two orders of magnitude smaller. Clearly, $ h^{\ast}=O(n^{-1/5})$ and $ \mathrm{IMSE}^{\ast}=O(n^{-4/5})$. Thus the error converges at a faster rate than the histogram, by using bins which are wider and an estimator which is not discontinuous. Examples and other results such as oversmoothing may be found in [37].

The use of wider bins means that the choice of the bin origin has a larger impact, at least subjectively. Given a set of $ m$ shifted histogram, $ \widehat{f}_1(x),\ldots,\widehat{f}_m(x)$, one might use cross-validation to try to pick the best one. Alternatively, Scott ([34]) suggested the averaged shifted histogram (ASH), which is literally defined:

$\displaystyle \widehat{f}_{\mathrm{ASH}}(x) = \frac{1}{m}\sum_{k=1}^m \widehat{f}_k(x)\,.$ (4.11)

To be specific, suppose the collection of $ m$ histograms has meshes shifted by an amount $ \delta=h/m$ from each other. Recompute the bin counts, $ \nu_k$, on the finer mesh, $ t^{\prime}_k=k\delta, -\infty<k<\infty$. Then a bin count for one of the histograms with bin width $ h$ may be computed by adding $ m$ of the bin counts on the finer mesh. For $ x$ in the $ \ell$-th (narrow) bin, there are $ m$ shifted histograms that include the (narrow) bin count,  $ \nu_{\ell}$. Adding these $ m$ shifted histograms together and averaging gives:

$\displaystyle \frac{ \nu_{\ell+1-m} + 2 \nu_{\ell+2-m} + \ldots + m\,\nu_{\ell} + \ldots + 2 \nu_{\ell+m-2} + \nu_{\ell+m-1} }{m\times nh}\,,$ (4.12)

or after re-arranging

$\displaystyle \widehat{f}_{\mathrm{ASH}}(x) = \frac{1}{nh}\sum_{k=1-m}^{m-1} \left( 1 - \frac{\vert k\vert}{m} \right) \, \nu_{\ell+k}\,.$ (4.13)

As the number of shifted histograms $ m\rightarrow\infty$, the weights on the bin counts approaches the triangular kernel given by $ K(t)=1-\vert t\vert$ for $ \vert t\vert<1$ and zero elsewhere. The ASH may be generalized to handle general weights by sampling from an arbitrary kernel function, $ K(t)$, which is any symmetric probability density defined on the interval $ [-1,1]$. In this case,

$\displaystyle \widehat{f}_{\mathrm{ASH}}(x) = \frac{1}{nh}\sum_{k=1-m}^{m-1} w_m(k)\, \nu_{\ell+k}$   where$\displaystyle \quad w_m(k) \propto K(k/m)\,.$ (4.14)

Like the FP, the ASH is an order-two algorithm, but more efficient in the statistical sense.

In Fig. 4.9, two ASHs of the $ \log _{10}$-cholesterol data are shown. The bin edge effects and discontinuities apparent in the ordinary histogram in Fig. 4.8 are removed. The extra features in the distribution are hinted at.

Figure 4.9: Averaged shifted histograms of the $ \log _{10}$-cholesterol data
\includegraphics[width=75mm,clip]{text/3-4/fig9.eps}

The extension of the ASH to bivariate (and multivariate) data is straightforward. A number of bivariate (multivariate) histograms are constructed with equal shifts along the coordinate axes and then averaged together. Figure 4.10 displays a bivariate ASH of the same lipid data displayed in Fig. 4.1. The strong bimodal and weak trimodal features are evident. The third mode is perhaps more clearly represented in a perspective plot; see Fig. 4.11. (Note that for convenience, the data were rescaled to the intervals $ (0,1)$ for these plots, unlike Fig. 4.1.) The precise location of the third mode above (and between) the two primary modes results in the masking of the multiple modes when viewed along the cholesterol axis alone. This masking feature is commonplace and a primary reason for trying to extend the dimensions available for visualization of the density function.

Figure 4.10: Bivariate ASH of the $ \log _{10}$-cholesterol and $ \log _{10}$-triglyceride data
\includegraphics[width=88mm,clip]{text/3-4/fig10.eps}

Figure 4.11: Perspective view of the Bivariate ASH in Fig. 4.10
\includegraphics[width=70mm,clip]{text/3-4/fig11.eps}


4.3.4 Kernel and Other Estimators

The ASH is a discretized representation of a kernel estimator. Binned kernel estimators are of great interest to reduce the computational burden. An alternative to the ASH is the fast Fourier transform approach of [41]. Kernel methods were introduced by [27] and [24] with earlier work by Evelyn Fix and Joe Hodges completed by 1951 in San Antonio, Texas (see [43]).

Given a kernel function, $ K(t)$, which is generally taken to be a symmetric probability density function, the kernel density estimate is defined by

$\displaystyle \widehat{f}_K(x) = \frac{1}{nh}\sum_{i=1}^n K\left( \frac{x-x_i}{h} \right) = \frac{1}{n} \sum_{i=1}^n K_h(x-x_i)\,,$ (4.15)

letting $ K_h$ denote the kernel density transformed by the scale factor, $ h$; that is, $ K_h(t)=K(t/h)/h$. Among kernels with finite support, Beta densities shifted to the interval $ (-1,1)$ are popular choices. Among kernels with infinite support, the normal density is by far the most common choice. An important paper by [40] showed that the normal kernel has the unique property that the number of modes in the kernel estimate monotonically decreases as the smoothing parameter increases. For many exploratory purposes, this property alone is reason to use only the normal kernel. [22] proposed graphing the locations of all modes at all bandwidths in the ''mode tree.'' [21] proposed an extension of Silverman's bootstrap test ([40]) for the number of modes to test individual modes. Software for the ASH, kernel estimates, and the various mode tests may be found on the web; see statlib at www.stat.cmu.edu, for example.

Multivariate extensions of the kernel approach generally rely upon the product kernel. For example, with bivariate data $ \{(x_i,y_i),\
i=1,\ldots,n\}$, the bivariate (product) kernel estimator is

$\displaystyle \widehat{f}_K(x,y) = \frac{1}{n} \sum_{i=1}^n K_{h_x}(x-x_i) K_{h_y}(y-y_i)\,.$ (4.16)

A different smoothing parameter for each variable generally gives sufficient control. A full bivariate normal kernel may be used in special circumstances, effectively adding one additional smoothing parameter in the form of the correlation coefficient. However, an equivalent estimate may be obtained by rotating the data so that the correlation in the kernel vanishes, so that the product kernel may be used on the transformed data.

In higher dimensions, some care must be exercised to minimize the effects of the curse of dimensionality. First, marginal variable transformations should be explored to avoid a heavily skewed appearance or heavy tails. Second, a principal components analysis should be performed to determine if the data are of full rank. If so, the data should be projected into an appropriate subspace. No nonparametric procedure works well if the data are not of full rank. Finally, if the data do not have many significant digits, the data should be carefully blurred. Otherwise the data may have many repeated values, and cross-validation algorithms may believe the data are discrete and suggest using $ h=0$. Next several kernel or ASH estimates may be calculated and explored to gain an understanding of the data, as a preliminary step towards further analyses.

An extensive body of work also exists for orthogonal series density estimators. Originally, the Fourier basis was studied, but more modern choices for the basis functions include wavelets. These can be re-expressed as kernel estimators, so we do not pursue these further. In fact, a number of workers have shown how almost any nonparametric density algorithm can be put into the form of a kernel estimator; see [59] and [54], for example. More recent work on local likelihood algorithms for density estimation further shows how closely related parametric and nonparametric thinking really is; see [19] for details and literature.


next up previous contents index
Next: 4.4 Visualization of Trivariate Up: 4. Multivariate Density Estimation Previous: 4.2 Visualization