# 3.6 Multivariate Kernel Density Estimation

Often one is not only interested in estimating one-dimensional densities, but also multivariate densities. Recall e.g. the U.K. Family Expenditure data where we have in fact observations for net-income and expenditures on different goods, such as housing, fuel, food, clothing, durables, transport, alcohol & tobacco etc.

Consider a -dimensional random vector where , ..., are one-dimensional random variables. Drawing a random sample of size in this setting means that we have observations for each of the random variables, . Suppose that we collect the th observation of each of the random variables in the vector :

where is the th observation of the random variable . Our goal now is to estimate the probability density of , which is just the joint pdf of the random variables

 (3.53)

From our previous experience with the one-dimensional case we might consider adapting the kernel density estimator to the -dimensional case, and write
 (3.54)

denoting a multivariate kernel function operating on arguments. Note, that (3.54) assumes that the bandwidth is the same for each component. If we relax this assumption then we have a vector of bandwidths and the multivariate kernel density estimator becomes

 (3.55)

What form should the multidimensional kernel take on? The easiest solution is to use a multiplicative kernel

 (3.56)

where denotes a univariate kernel function. In this case (3.55) becomes

 (3.57)

EXAMPLE 3.3
To get a better understanding of what is going on here let us consider the two-dimensional case where . In this case (3.55) becomes
 (3.58)

Each of the observations is of the form , where the first component gives the value that the random variable takes on at the th observation and the second component does the same for . For illustrational purposes, let us take to be the Epanechnikov kernel. Then we get

Note that we get a contribution to the sum for observation only if falls into the interval and if falls into the interval . If even one of the two components fails to fall into the respective interval then one of the indicator functions takes the value 0 and consequently the observation does not enter the frequency count.

Note that for kernels with support (as the Epanechnikov kernel) observations in a cube around are used to estimate the density at the point . An alternative is to use a true multivariate function , as e.g. the multivariate Epanechnikov

where denotes proportionality. These multivariate kernels can be obtained from univariate kernel functions by taking

 (3.59)

where denotes the Euclidean norm of the vector . Kernels of the form (3.59) use observations from a circle around to estimate the pdf at . This type of kernel is usually called spherical or radial-symmetric since has the same value for all on a sphere around zero.

EXAMPLE 3.4
Figure 3.11 shows the contour lines from a bivariate product and a bivariate radial-symmetric Epanechnikov kernel, on the left and right hand side respectively.

Note that the kernel weights in Figure 3.11 correspond to equal bandwidth in each direction, i.e. . When we use different bandwidths, the observations around in the density estimate will be used with different weights in both dimensions.

EXAMPLE 3.5
The contour plots of product and radial-symmetric Epanechnikov weights with different bandwidths, i.e. , are shown in Figure 3.12. Here we used and which naturally includes fewer observations in the second dimension.

A very general approach is to use a bandwidth matrix (nonsingular). The general form for the multivariate density estimator is then

 (3.60)

see Silverman (1986) and Scott (1992). Here we used the short notation

analogously to in the one-dimensional case. A bandwidth matrix includes all simpler cases as special cases. An equal bandwidth in all dimensions as in (3.54) corresponds to where denotes the identity matrix. Different bandwidths as in (3.55) are equivalent to , the diagonal matrix with elements .

What effect has the inclusion of off-diagonal elements? We will see in Subsection 3.6.2 that a good rule of thumb is to use a bandwidth matrix proportional to where is the covariance matrix of the data. Hence, using such a bandwidth corresponds to a transformation of the data, so that they have an identity covariance matrix. As a consequence we can use bandwidth matrices to adjust for correlation between the components of . We have plotted the contour curves of product and radial-symmetric Epanechnikov weights with bandwidth matrix

i.e. , in Figure 3.13.

In the following subsection we will consider statistical properties of bias, variance, the issue of bandwidth selection and applications for this estimator.

## 3.6.1 Bias, Variance and Asymptotics

First let us mention that as a consequence of the standard assumption

 (3.61)

the estimate is a density function, i.e. . Also, the estimate is consistent in any point :

 (3.62)

see e.g. Ruppert & Wand (1994). The derivation of and is principally analogous to the one-dimensional case. We will only sketch the asymptotic expansions and hence just move on to the derivation of .

A detailed derivation of the components of can be found in Scott (1992) or Wand & Jones (1995) and the references therein. As in the univariate case we use a second order Taylor expansion. Here and in the following formulae we denote with the gradient and with the Hessian matrix of second partial derivatives of a function (here ). Then the Taylor expansion of around is

see Wand & Jones (1995, p. 94). This leads to the expectation
 (3.63)

If we assume additionally to (3.61)
 (3.64) (3.65)

then (3.63) yields hence

 (3.66)

For the variance we find
 (3.67)

with denoting the -dimensional squared -norm of . Therefore we have the following formula for the multivariate kernel density estimator

 (3.68)

Let us now turn to the problem of how to choose the optimal bandwidth. Again this is the bandwidth which balances bias-variance trade-off in . Denote a scalar, such that and . Then can be written as

If we only allow changes in the optimal orders for the smoothing parameter and are

Hence, the multivariate density estimator has a slower rate of convergence compared to the univariate one, in particular when is large.

If we consider (the same bandwidth in all dimensions) and we fix the sample size , then the optimal bandwidth has to be considerably larger than in the one-dimensional case to make sure that the estimate is reasonably smooth. Some ideas of comparable sample sizes to reach the same quality of the density estimates over different dimensions can be found in Silverman (1986, p. 94) and Scott & Wand (1991). Moreover, the computational effort of this technique increases with the number of dimensions . Therefore, multidimensional density estimation is usually not applied if .

## 3.6.2 Bandwidth Selection

The problem of an automatic, data-driven choice of the bandwidth has actually more importance for the multivariate than for the univariate case. In one or two dimensions it is easy to choose an appropriate bandwidth interactively just by looking at the plot of density estimates for different bandwidths. But how can this be done in three, four or more dimensions? Here arises the problem of graphical representation which we address in the next subsection. As in the one-dimensional case,

• plug-in bandwidths, in particular rule-of-thumb bandwidths, and
• cross-validation bandwidths
are used. We will introduce generalizations for Silverman's rule-of-thumb and least squares cross-validation to show the analogy with the one-dimensional bandwidth selectors.

### 3.6.2.1 Rule-of-thumb Bandwidth

Rule-of-thumb bandwidth selection gives a formula arising from the optimal bandwidth for a reference distribution. Obviously, the pdf of a multivariate normal distribution is a good candidate for a reference distribution in the multivariate case. Suppose that the kernel is multivariate Gaussian, i.e. the pdf of . Note that and in this case. Hence, from (3.68) and the fact that

cf. Wand & Jones (1995, p. 98), we can easily derive rule-of-thumb formulae for different assumptions on and .

In the simplest case, i.e. that we consider and to be diagonal matrices and , this leads to

 (3.69)

Note that this formula coincides with Silverman's rule of thumb in the case , see (3.24) and Silverman (1986, p. 45). Replacing the s with estimates and noting that the first factor is always between 0.924 and 1.059, we arrive at Scott's rule:

 (3.70)

see Scott (1992, p. 152).

It is not possible to derive the rule-of-thumb for general and . However, (3.69) shows that it might be a good idea to choose the bandwidth matrix proportional to . In this case we get as a generalization of Scott's rule:

 (3.71)

We remark that this rule is equivalent to applying a Mahalanobis transformation to the data (to transform the estimated covariance matrix to identity), then computing the kernel estimate with Scott's rule (3.70) and finally retransforming the estimated pdf back to the original scale.

Principally all plug-in methods for the one-dimensional kernel density estimation can be extended to the multivariate case. However, in practice this is cumbersome, since the derivation of asymptotics involves multivariate derivatives and higher order Taylor expansions.

### 3.6.2.2 Cross-validation

As we mentioned before, the cross-validation method is fairly independent of the special structure of the parameter or function estimate. Considering the bandwidth choice problem, cross-validation techniques allow us to adapt to a wider class of density functions than the rule-of-thumb approach. (Remember that the rule-of-thumb bandwidth is optimal for the reference pdf, hence it will fail for multimodal densities for instance.)

Recall, that in contrast to the rule-of-thumb approach, least squares cross-validation for density estimation does not estimate the optimal but the optimal bandwidth. Here we approximate the integrated squared error

 (3.72)

Apparently, this is the same formula as in the one-dimensional case and with the same arguments the last term of (3.72) can be ignored. The first term can again be easily calculated from the data. Hence, only the second term of (3.72) is unknown and must be estimated. However, observe that , where the only new aspect now is that is -dimensional. The resulting expectation, though, is a scalar. As in (3.32) we estimate this term by a leave-one-out estimator

where

is simply the multivariate version of (3.33). Also, the multivariate generalization of (3.37) is straightforward, which yields the multivariate cross-validation criterion as a perfect generalization of in the one-dimensional case:

The difficulty comes in the fact that the bandwidth is now a matrix . In the most general case this means to minimize over parameters. Still, if we assume to a diagonal matrix, this remains a -dimensional optimization problem. This holds for other cross-validation approaches, too.

## 3.6.3 Computation and Graphical Representation

Consider now the problem of graphically displaying our multivariate density estimates. Assume first . Here we are still able to show the density estimate in a three-dimensional plot. This is particularly useful if the estimated function can be rotated interactively on the computer screen. For a two-dimensional presentation a contour plot gives often more insight into the structure of the data.

EXAMPLE 3.6
Figures 3.14 and 3.15 display such a two-dimensional density estimate

for two explanatory variables on East-West German migration intention in Spring 1991, see Example 1.4. We use the subscript to indicate that we used a diagonal bandwidth matrix . Aside from some categorical variables on an educational level, professional status, existence of relations to Western Germany and regional dummies, our data set contains observations on age, household income and environmental satisfaction. In Figure 3.14 we plotted the joint density estimate for age and household income. Additionally Figure 3.15 gives a contour plot of this density estimate. It is easily observed that the age distribution is considerably left skewed.

Here and in the following plots the bandwidth was chosen according to the general rule of thumb (3.71), which tends to oversmooth bimodal structures of the data. The kernel function is always the product Quartic kernel.

Consider now how to display three- or even higher dimensional density estimates. One possible approach is to hold one variable fixed and to plot the density function only in dependence of the other variables. For three-dimensional data this gives three plots: vs. , vs. and vs. .

EXAMPLE 3.7
We display this technique in Figure 3.16 for data from a credit scoring sample, using duration of the credit, household income and age as variables (Fahrmeir & Tutz, 1994). The title of each panel indicates which variable is held fixed at which level.

EXAMPLE 3.8
Alternatively, we can plot contours of the density estimate, now in three dimensions, which means three-dimensional surfaces. Figure 3.17 shows this for the credit scoring data. In the original version of this plot, red, green and blue surfaces show the values of the density estimate at the levels (in percent) indicated on the right. Colors and the possibility of rotating the contours on the computer screen ease the exploration of the data structures significantly.

For alternative texts on kernel density estimation we refer to the monographs by Silverman (1986), Härdle (1990), Scott (1992) and Wand & Jones (1995).

A particular field of interest and ongoing research is the matter of bandwidth selection. In addition to what we have presented, a variety of other cross-validation approaches and refined plug-in bandwidth selectors have been proposed.

In particular, the following methods are based on the cross-validation idea: Pseudo-likelihood cross-validation (Duin, 1976; Habbema et al., 1974), biased cross-validation (Scott & Terrell, 1987; Cao et al., 1994) and smoothed cross-validation (Hall et al., 1992). The latter two approaches also attempt to find optimal bandwidths. Hence their performance is also assessed by relative convergence to the optimal bandwidth . A detailed treatment of many cross-validation procedures can be found in Park & Turlach (1992).

Regarding other refined plug-in bandwidth selectors, the methods of Sheather and Jones (Sheather & Jones, 1991) and Hall, Sheather, Jones, and Marron (Hall et al., 1991) should be mentioned, as they have have good asymptotic properties (-convergence). A number of authors provide extensive simulation studies for smoothing parameter selection, we want to mention in particular Marron (1989), Jones et al. (1996), Park & Turlach (1992), and Cao et al. (1994).

A alternative approach is introduced by Chaudhuri & Marron (1999)'s SiZer (significance zero crossings of derivatives) which tries to directly find features of a curve, such as bumps and valleys. At the same time it is a tool for visualizing the estimated curve at a range of different bandwidth values. SiZer provides thus a way around the issue of smoothing parameter selection.

EXERCISE 3.1   Calculate the exact values of

for the Gaussian, Epanechnikov and Quartic kernels.

EXERCISE 3.2   Show that for the normal density holds

EXERCISE 3.3   Show that the density estimate is itself a pdf if the kernel is one (i.e. if ).

EXERCISE 3.4   The statistical bureau of Borduria questions households about their income. The young econometrician Tintin proposes to use these data , ... for a nonparametric estimate of the income density . Tintin suggests computing a confidence interval for his kernel density estimate

with kernel function . Explain how this can be done. Simulate this on the basis of a lognormal sample with parameters and .

EXERCISE 3.5   Explain the constant in the better rule of thumb" (3.29).

EXERCISE 3.6   Show that for a rescaled kernel

EXERCISE 3.7   Derive formula (3.44) for the canonical bandwidth.

EXERCISE 3.8   Derive the formula for in the case (Gaussian Kernel).

EXERCISE 3.9   Show that

EXERCISE 3.10   Simulate a mixture of normal densities with pdf

and plot the density and its estimate with a cross-validated bandwidth.

EXERCISE 3.11   Show how to get from equation (3.12) to (3.13).

EXERCISE 3.12   Compute for the Quartic kernel.

EXERCISE 3.13   One possible way to construct a multivariate kernel is to use a one-dimensional kernel . This relationship is given by the formula (3.59), i.e.

where . Find an appropriate constant for a two-dimensional a) Gaussian, b) Epanechnikov, and c) Triangle kernel.

EXERCISE 3.14   Show that . Assume that possesses a second derivative and .

EXERCISE 3.15   Explain why averaging over the leave-one-out estimator (3.32) is the appropriate way to estimate the expected value of w.r.t. an independent random variable .

EXERCISE 3.16   Show that

Summary
Kernel density estimation is a generalization of the histogram. The kernel density estimate at point

corresponds to the histogram bar height for the bin if we use the uniform kernel.
The bias and variance of the kernel density estimator are

The of the kernel density estimator is

By using the normal distribution as a reference distribution for calculating we get Silverman's rule-of-thumb bandwidth

which assumes the kernel to be Gaussian. Other plug-in bandwidths can be found by using more sophisticated replacements for .
When using as a goodness-of-fit criterion for we can derive the least squares cross-validation criterion for bandwidth selection:

The concept of canonical kernels allows us to separate the kernel choice from the bandwidth choice. We find a canonical bandwidth for each kernel function which gives us the equivalent degree of smoothing. This equivalence allows one to adjust bandwidths from different kernel functions to obtain approximately the same value of . For bandwidth and kernel the bandwidth

is the equivalent bandwidth for kernel . So for instance, Silverman's rule-of-thumb bandwidth has to be adjusted by a factor of for using it with the Quartic kernel.
The asymptotic normality of the kernel density estimator

allows us to compute confidence intervals for . Confidence bands can be computed as well, although under more restrictive assumptions on .
The kernel density estimator for univariate data can be easily generalized to the multivariate case

where the bandwidth matrix now replaces the bandwidth parameter. The multivariate kernel is typically chosen to be a product or radial-symmetric kernel function. Asymptotic properties and bandwidth selection are analogous, but more cumbersome. Canonical bandwidths can be used as well to adjust between different kernel functions.
A special problem is the graphical display of multivariate density estimates. Lower dimensional intersections, projections or contour plot may display only part of the features of a density function.