3.6 Multivariate Kernel Density Estimation

Often one is not only interested in estimating one-dimensional densities, but also multivariate densities. Recall e.g. the U.K. Family Expenditure data where we have in fact observations for net-income and expenditures on different goods, such as housing, fuel, food, clothing, durables, transport, alcohol & tobacco etc.

Consider a $ d$-dimensional random vector $ {\boldsymbol{X}}=(X_{1},\ldots, X_{d})^\top$ where $ X_{1}$, ..., $ X_{d}$ are one-dimensional random variables. Drawing a random sample of size $ n$ in this setting means that we have $ n$ observations for each of the $ d$ random variables, $ X_{1},\ldots,X_{d}$. Suppose that we collect the $ i$th observation of each of the $ d$ random variables in the vector $ {{\boldsymbol{X}}_{i}}$:

$\displaystyle {{\boldsymbol{X}}_{i}}= \left(\begin{array}{c} X_{i1} \\ \vdots \\ X_{id}
\end{array}\right), \quad i=1,\ldots,n,$

where $ X_{ij}$ is the $ i$th observation of the random variable $ X_{j}$. Our goal now is to estimate the probability density of $ {\boldsymbol{X}}=(X_{1},\ldots, X_{d})^\top$, which is just the joint pdf $ f$ of the random variables $ X_{1},\ldots,X_{d}$

$\displaystyle f({\boldsymbol{x}})=f(x_{1},\ldots,x_{d}).$ (3.53)

From our previous experience with the one-dimensional case we might consider adapting the kernel density estimator to the $ d$-dimensional case, and write
$\displaystyle \widehat{f}_{h}({{\boldsymbol{x}}})$ $\displaystyle =$ $\displaystyle \frac{1}{n}\sum_{i=1}^{n}\frac{1}{h^{d}}{\mathcal{K}}\left(\frac{{
{\boldsymbol{x}}-{\boldsymbol{X}}_{i}}}{h}\right)$ (3.54)
  $\displaystyle =$ $\displaystyle \frac{1}{n}\sum_{i=1}^{n}\frac{1}{h^{d}}

$ {\mathcal{K}}$ denoting a multivariate kernel function operating on $ d$ arguments. Note, that (3.54) assumes that the bandwidth $ h$ is the same for each component. If we relax this assumption then we have a vector of bandwidths $ {\boldsymbol{h}}=(h_{1},\ldots, h_{d})^\top$ and the multivariate kernel density estimator becomes

$\displaystyle \widehat{f}_{{\boldsymbol{h}}}({\boldsymbol{x}})= \frac{1}{n}\sum...
...hcal{K}}\left(\frac{x_1-X_{i1}}{h_{1}},\ldots, \frac{x_d-X_{id}}{h_{d}}\right).$ (3.55)

What form should the multidimensional kernel $ {\mathcal{K}}({\boldsymbol{u}})={\mathcal{K}}(u_{1},\dots,u_{d})$ take on? The easiest solution is to use a multiplicative kernel

$\displaystyle {\mathcal{K}}({\boldsymbol{u}})=K(u_{1})\cdotp \ldots \cdotp K(u_{d}),$ (3.56)

where $ K$ denotes a univariate kernel function. In this case (3.55) becomes

$\displaystyle \widehat{f}_{{\boldsymbol{h}}}({\boldsymbol{x}})= \frac{1}{n}\sum...
... \prod_{j=1}^{d}h_{j}^{-1}\, K \left(\frac{x_{j}-X_{ij}}{h_{j}}\right)\right\}.$ (3.57)

To get a better understanding of what is going on here let us consider the two-dimensional case where $ {\boldsymbol{X}}=\left(X_{1}, X_{2}\right)^\top$. In this case (3.55) becomes
$\displaystyle \widehat{f}_{{\boldsymbol{h}}}({\boldsymbol{x}})$ $\displaystyle =$ $\displaystyle \frac{1}{n}\sum_{i=1}^{n}\frac{1}{h_{1}}\frac{1}{h_{2}}{\mathcal{K}}\left( \frac{
  $\displaystyle =$ $\displaystyle \frac{1}{n}\sum_{i=1}^{n}\frac{1}{h_{1}}\frac{1}{h_{2}}
K\left(\frac{x_{2}-X_{i2}}{h_{2}}\right).$ (3.58)

Each of the $ n$ observations is of the form $ (X_{i1},X _{i2})$, where the first component gives the value that the random variable $ X_{1}$ takes on at the $ i$th observation and the second component does the same for $ X_{2}$. For illustrational purposes, let us take $ K$ to be the Epanechnikov kernel. Then we get
$\displaystyle \widehat{f}_{{\boldsymbol{h}}}({\boldsymbol{x}})$ $\displaystyle =$ $\displaystyle \widehat{f}_{h_{1},h_{2}}(x_{1},x_{2})$  
  $\displaystyle =$ $\displaystyle \frac{1}{n}\sum_{i=1}^{n}\frac{1}{h_{1}}\frac{1}{h_{2}}
K\left( \frac{x_{1}-X_{i1}}{h_{1}}\right)
K\left( \frac{x_{2}-X_{i2}}{h_{2}}\right)$  
  $\displaystyle =$ $\displaystyle \frac{1}{n}\sum_{i=1}^{n}
\Ind\left(\left\vert\frac{x_{1}-X_{i1}}{h_{1}}\right\vert\leq 1\right)$  
    $\displaystyle \quad\quad \cdotp
\frac{3}{4} \left\{1-\left(\frac{x_{1}-X_{i2}}{...
\Ind\left(\left\vert\frac{x_{2}-X_{i2}}{h_{2}}\right\vert\leq 1\right).$  

Note that we get a contribution to the sum for observation $ i$ only if $ X_{i1}$ falls into the interval $ [x_{1}-h_{1},x_{1}+h_{1})$ and if $ X_{i2}$ falls into the interval $ [x_{2}-h_{2},x_{2}+h_{2})$. If even one of the two components fails to fall into the respective interval then one of the indicator functions takes the value 0 and consequently the observation does not enter the frequency count. $ \Box$

Note that for kernels with support $ [-1,1]$ (as the Epanechnikov kernel) observations in a cube around $ {\boldsymbol{x}}$ are used to estimate the density at the point $ {\boldsymbol{x}}$. An alternative is to use a true multivariate function $ {\mathcal{K}}({\boldsymbol{u}})$, as e.g. the multivariate Epanechnikov

$\displaystyle {\mathcal{K}}({\boldsymbol{u}}) \propto (1-{\boldsymbol{u}}^\top{\boldsymbol{u}})\;\textrm{I}
({\boldsymbol{u}}^\top{\boldsymbol{u}}\le 1),$

where $ \propto$ denotes proportionality. These multivariate kernels can be obtained from univariate kernel functions by taking

$\displaystyle {\mathcal{K}}({\boldsymbol{u}}) \propto K(\Vert{\boldsymbol{u}}\Vert),$ (3.59)

where $ \Vert{\boldsymbol{u}}\Vert = \sqrt{{\boldsymbol{u}}^\top{\boldsymbol{u}}}$ denotes the Euclidean norm of the vector $ {\boldsymbol{u}}$. Kernels of the form (3.59) use observations from a circle around $ {\boldsymbol{x}}$ to estimate the pdf at $ {\boldsymbol{x}}$. This type of kernel is usually called spherical or radial-symmetric since $ {\mathcal{K}}({\boldsymbol{u}})$ has the same value for all $ {\boldsymbol{u}}$ on a sphere around zero.

Figure 3.11 shows the contour lines from a bivariate product and a bivariate radial-symmetric Epanechnikov kernel, on the left and right hand side respectively. $ \Box$

Figure: Contours from bivariate product (left) and bivariate radial-symmetric (right) Epanechnikov kernel with equal bandwidths

Note that the kernel weights in Figure 3.11 correspond to equal bandwidth in each direction, i.e. $ {\boldsymbol{h}}=(h_1,h_2)^\top=(1,1)^\top$. When we use different bandwidths, the observations around $ {\boldsymbol{x}}$ in the density estimate $ \widehat{f}_{{\boldsymbol{h}}}(x)$ will be used with different weights in both dimensions.

The contour plots of product and radial-symmetric Epanechnikov weights with different bandwidths, i.e. $ {\mathcal{K}}_{{\boldsymbol{h}}}({\boldsymbol{u}})
= {\mathcal{K}}({u_1}/{h_1},{u_2}/{h_2})/(h_1h_2)$, are shown in Figure 3.12. Here we used $ h_1=1$ and $ h_2=0.5$ which naturally includes fewer observations in the second dimension. $ \Box$

Figure: Contours from bivariate product (left) and bivariate radial-symmetric (right) Epanechnikov kernel with different bandwidths

A very general approach is to use a bandwidth matrix $ {\mathbf{H}}$ (nonsingular). The general form for the multivariate density estimator is then

$\displaystyle \widehat{f}_{{\mathbf{H}}}({\boldsymbol{x}}) = \frac{1}{n}\sum_{i...
...{\mathcal{K}}_{{\mathbf{H}}}\left({\boldsymbol{x}}-{\boldsymbol{X}}_{i}\right),$ (3.60)

see Silverman (1986) and Scott (1992). Here we used the short notation

$\displaystyle {\mathcal{K}}_{{\mathbf{H}}}(\bullet) = \frac{1}{\mathop{\rm {det}}({\mathbf{H}})}

analogously to $ K_h$ in the one-dimensional case. A bandwidth matrix includes all simpler cases as special cases. An equal bandwidth $ h$ in all dimensions as in (3.54) corresponds to $ {\mathbf{H}}=h{\mathbf{I}}_d$ where $ {\mathbf{I}}_d$ denotes the $ d\times d$ identity matrix. Different bandwidths as in (3.55) are equivalent to $ {\mathbf{H}}=\mathop{\hbox{diag}}(h_1, \ldots, h_d)$, the diagonal matrix with elements $ h_1,\ldots, h_d$.

What effect has the inclusion of off-diagonal elements? We will see in Subsection 3.6.2 that a good rule of thumb is to use a bandwidth matrix proportional to $ \widehat{{\mathbf{\Sigma}}}^{-1/2}$ where $ \widehat{{\mathbf{\Sigma}}}$ is the covariance matrix of the data. Hence, using such a bandwidth corresponds to a transformation of the data, so that they have an identity covariance matrix. As a consequence we can use bandwidth matrices to adjust for correlation between the components of $ {\boldsymbol{X}}$. We have plotted the contour curves of product and radial-symmetric Epanechnikov weights with bandwidth matrix

\begin{displaymath}{\mathbf{H}}= \left(
1 & 0.5\\ 0.5 & 1

i.e. $ {\mathcal{K}}_{{\mathbf{H}}}({\boldsymbol{u}})= {\mathcal{K}}({\mathbf{H}}^{-1}{\boldsymbol{u}})
/\mathop{\rm {det}}({\mathbf{H}})$, in Figure 3.13.

Figure: Contours from bivariate product (left) and bivariate radial-symmetric (right) Epanechnikov kernel with bandwidth matrix

In the following subsection we will consider statistical properties of bias, variance, the issue of bandwidth selection and applications for this estimator.

3.6.1 Bias, Variance and Asymptotics

First let us mention that as a consequence of the standard assumption

$\displaystyle \int {\mathcal{K}}({\boldsymbol{u}})\;d{\boldsymbol{u}}= 1$ (3.61)

the estimate $ \widehat{f}_{{\mathbf{H}}}$ is a density function, i.e. $ \int \widehat{f}_{{\mathbf{H}}}({\boldsymbol{x}})\;d{\boldsymbol{x}}= 1$. Also, the estimate is consistent in any point $ {\boldsymbol{x}}$:

$\displaystyle \widehat{f}_{{\mathbf{H}}}({\boldsymbol{x}}) = \frac{1}{n}\sum_{i...
...}}\right) \mathrel{\mathop{\longrightarrow}\limits_{}^{P}} f({\boldsymbol{x}}),$ (3.62)

see e.g. Ruppert & Wand (1994). The derivation of $ \mse$ and $ \mise$ is principally analogous to the one-dimensional case. We will only sketch the asymptotic expansions and hence just move on to the derivation of $ \amise$.

A detailed derivation of the components of $ \amise$ can be found in Scott (1992) or Wand & Jones (1995) and the references therein. As in the univariate case we use a second order Taylor expansion. Here and in the following formulae we denote with $ \gradi_{f}$ the gradient and with $ {\mathcal{H}}_{f}$ the Hessian matrix of second partial derivatives of a function (here $ f$). Then the Taylor expansion of $ f(\bullet)$ around $ {\boldsymbol{x}}$ is

$\displaystyle f({\boldsymbol{x}}+{\boldsymbol{t}}) = f({\boldsymbol{x}}) +
+ o({\boldsymbol{t}}^\top{\boldsymbol{t}}),$

see Wand & Jones (1995, p. 94). This leads to the expectation
$\displaystyle E\widehat{f}_{{\mathbf{H}}}({\boldsymbol{x}})$ $\displaystyle =$ $\displaystyle \int {\mathcal{K}}_{{\mathbf{H}}}({\boldsymbol{u}}-{\boldsymbol{x}})\,f({\boldsymbol{u}})\,d{\boldsymbol{u}}$  
  $\displaystyle =$ $\displaystyle \int {\mathcal{K}}({\boldsymbol{s}})\,f({\boldsymbol{x}}+{\mathbf{H}}{\boldsymbol{s}})\,d{\boldsymbol{s}}$  
  $\displaystyle \approx$ $\displaystyle \int {\mathcal{K}}({\boldsymbol{s}})\,\big\{
f({\boldsymbol{x}}) + {\boldsymbol{s}}^\top{\mathbf{H}}^\top {\gradi}_{f}({\boldsymbol{x}})$  
    $\displaystyle \quad\quad\quad\quad +
...}_{f}({\boldsymbol{x}}) {\mathbf{H}}{\boldsymbol{s}}
\big\}\,d{\boldsymbol{s}}.$ (3.63)

If we assume additionally to (3.61)
    $\displaystyle \int {\boldsymbol{u}}{\mathcal{K}}({\boldsymbol{u}})\;d{\boldsymbol{u}}= 0_{d},$ (3.64)
    $\displaystyle \int {\boldsymbol{u}}{\boldsymbol{u}}^\top {\mathcal{K}}({\boldsymbol{u}})\;d{\boldsymbol{u}}=
\mu_{2}({\mathcal{K}}){\mathbf{I}}_{d},$ (3.65)

then (3.63) yields $ E\widehat{f}_{{\mathbf{H}}}({\boldsymbol{x}})
- f({\boldsymbol{x}}) \approx \...
...\hbox{tr}}\{{\mathbf{H}}^\top{\mathcal{H}}_{f}({\boldsymbol{x}}){\mathbf{H}}\},$ hence

$\displaystyle \bias\{\widehat{f}_{{\mathbf{H}}}({\boldsymbol{x}})\} \approx\fra...
...mathcal{H}}_{f}({\boldsymbol{x}}){\mathbf{H}}\} \right]^{2}\;d{\boldsymbol{x}}.$ (3.66)

For the variance we find
$\displaystyle \mathop{\mathit{Var}}\left\{ \widehat{f}_{{\mathbf{H}}}({\boldsymbol{x}})\right\}$ $\displaystyle =$ $\displaystyle \frac{1}{n} \int \left\{ {\mathcal{K}}_{{\mathbf{H}}}({\boldsymbo...
...- \frac{1}{n} \left\{E
  $\displaystyle \approx$ $\displaystyle \int \frac{1}{n\mathop{\rm {det}}({\mathbf{H}})}\,{\mathcal{K}}^{...
  $\displaystyle \approx$ $\displaystyle \int \frac{1}{n\mathop{\rm {det}}({\mathbf{H}})}\,
...op{\mathbf{H}}^\top {\gradi}_{f}({\boldsymbol{x}}) \right\}
  $\displaystyle \approx$ $\displaystyle \frac{1}{n\mathop{\rm {det}}({\mathbf{H}})} \Vert{\mathcal{K}}\Vert _2^{2}\,f({\boldsymbol{x}}),$ (3.67)

with $ \Vert{\mathcal{K}}\Vert _2^2$ denoting the $ d$-dimensional squared $ L_{2}$-norm of $ {\mathcal{K}}$. Therefore we have the following $ \amise$ formula for the multivariate kernel density estimator

$\displaystyle \amise({\mathbf{H}}) =\frac{1}{4} \mu_{2}^{2}({\mathcal{K}}) \int...
...}}+ \frac{1}{n\mathop{\rm {det}}({\mathbf{H}})} \Vert{\mathcal{K}}\Vert _2^{2}.$ (3.68)

Let us now turn to the problem of how to choose the $ \amise$ optimal bandwidth. Again this is the bandwidth which balances bias-variance trade-off in $ \amise$. Denote $ h$ a scalar, such that $ {\mathbf{H}}=h {\mathbf{H}}_{0}$ and $ \mathop{\rm {det}}({\mathbf{H}}_{0})=1$. Then $ \amise$ can be written as

$\displaystyle \amise({\mathbf{H}}) =\frac{1}{4}\,h^{4}\,\mu_{2}^{2}({\mathcal{K...
+ \frac{1}{nh^{d}} \Vert{\mathcal{K}}\Vert^{2}_{2}.$

If we only allow changes in $ h$ the optimal orders for the smoothing parameter $ h$ and $ \amise$ are

$\displaystyle h_{opt}\sim n^{-1/(4+d)}, \quad
\amise(h_{opt}{\mathbf{H}}_{0}) \sim n^{-4/(4+d)}.$

Hence, the multivariate density estimator has a slower rate of convergence compared to the univariate one, in particular when $ d$ is large.

If we consider $ {\mathbf{H}}=h{\mathbf{I}}_{d}$ (the same bandwidth in all $ d$ dimensions) and we fix the sample size $ n$, then the $ \amise$ optimal bandwidth has to be considerably larger than in the one-dimensional case to make sure that the estimate is reasonably smooth. Some ideas of comparable sample sizes to reach the same quality of the density estimates over different dimensions can be found in Silverman (1986, p. 94) and Scott & Wand (1991). Moreover, the computational effort of this technique increases with the number of dimensions $ d$. Therefore, multidimensional density estimation is usually not applied if $ d\geq 5$.

3.6.2 Bandwidth Selection

The problem of an automatic, data-driven choice of the bandwidth $ {\mathbf{H}}$ has actually more importance for the multivariate than for the univariate case. In one or two dimensions it is easy to choose an appropriate bandwidth interactively just by looking at the plot of density estimates for different bandwidths. But how can this be done in three, four or more dimensions? Here arises the problem of graphical representation which we address in the next subsection. As in the one-dimensional case,

are used. We will introduce generalizations for Silverman's rule-of-thumb and least squares cross-validation to show the analogy with the one-dimensional bandwidth selectors. Rule-of-thumb Bandwidth

Rule-of-thumb bandwidth selection gives a formula arising from the optimal bandwidth for a reference distribution. Obviously, the pdf of a multivariate normal distribution $ N_{d}({\boldsymbol{\mu}},{\mathbf{\Sigma}})$ is a good candidate for a reference distribution in the multivariate case. Suppose that the kernel $ {\mathcal{K}}$ is multivariate Gaussian, i.e. the pdf of $ N_{d}(0,{\mathbf{I}})$. Note that $ \mu_{2}({\mathcal{K}})=1$ and $ \Vert{\mathcal{K}}\Vert _2^{2}=2^{-d}\pi^{-d/2}$ in this case. Hence, from (3.68) and the fact that

$\displaystyle {\int [\mathop{\hbox{tr}}\{{\mathbf{H}}^\top{\mathcal{H}}_{f}(x){\mathbf{H}}\}]^{2}\;d{\boldsymbol{x}}}$
  $\displaystyle =$ $\displaystyle \frac{1}{2^{d+2}{\pi}^{d/2}\mathop{\rm {det}}({\mathbf{\Sigma}})^...
...{\hbox{tr}}({\mathbf{H}}^\top{\mathbf{\Sigma}}^{-1}{\mathbf{H}})\}^{2} \right],$  

cf. Wand & Jones (1995, p. 98), we can easily derive rule-of-thumb formulae for different assumptions on $ {\mathbf{H}}$ and $ {\mathbf{\Sigma}}$.

In the simplest case, i.e. that we consider $ {\mathbf{H}}$ and $ {\mathbf{\Sigma}}$ to be diagonal matrices $ {\mathbf{H}}=\mathop{\hbox{diag}}(h_{1},\ldots,h_{d})$ and $ {\mathbf{\Sigma}}=\mathop{\hbox{diag}}(\sigma_{1}^{2},\ldots,\sigma_{d}^{2})$, this leads to

$\displaystyle \widetilde{h}_{j}=\left(\frac{4}{d+2}\right)^{1/(d+4)} n^{-1/(d+4)} \sigma_{j}.$ (3.69)

Note that this formula coincides with Silverman's rule of thumb in the case $ d=1$, see (3.24) and Silverman (1986, p. 45). Replacing the $ \sigma_{j}$s with estimates and noting that the first factor is always between 0.924 and 1.059, we arrive at Scott's rule:

$\displaystyle \widehat{h}_{j}= n^{-1/(d+4)} \widehat\sigma_{j},$ (3.70)

see Scott (1992, p. 152).

It is not possible to derive the rule-of-thumb for general $ {\mathbf{H}}$ and $ {\mathbf{\Sigma}}$. However, (3.69) shows that it might be a good idea to choose the bandwidth matrix $ {\mathbf{H}}$ proportional to $ {\mathbf{\Sigma}}^{1/2}$. In this case we get as a generalization of Scott's rule:

$\displaystyle \widehat{{\mathbf{H}}}= n^{-1/(d+4)} \widehat{{\mathbf{\Sigma}}}^{1/2}.$ (3.71)

We remark that this rule is equivalent to applying a Mahalanobis transformation to the data (to transform the estimated covariance matrix to identity), then computing the kernel estimate with Scott's rule (3.70) and finally retransforming the estimated pdf back to the original scale.

Principally all plug-in methods for the one-dimensional kernel density estimation can be extended to the multivariate case. However, in practice this is cumbersome, since the derivation of asymptotics involves multivariate derivatives and higher order Taylor expansions. Cross-validation

As we mentioned before, the cross-validation method is fairly independent of the special structure of the parameter or function estimate. Considering the bandwidth choice problem, cross-validation techniques allow us to adapt to a wider class of density functions $ f$ than the rule-of-thumb approach. (Remember that the rule-of-thumb bandwidth is optimal for the reference pdf, hence it will fail for multimodal densities for instance.)

Recall, that in contrast to the rule-of-thumb approach, least squares cross-validation for density estimation does not estimate the $ MISE$ optimal but the $ \ise$ optimal bandwidth. Here we approximate the integrated squared error

$\displaystyle \ise({\mathbf{H}}) \index{ISE!multivariate density estimation}$ $\displaystyle =$ $\displaystyle \int \{ \widehat{f}_{{\mathbf{H}}}({\boldsymbol{x}})
  $\displaystyle =$ $\displaystyle \int \widehat{f}_{{\mathbf{H}}}^{2}({\boldsymbol{x}})\,d{\boldsym...
+ \int f^{2}({\boldsymbol{x}})\,d{\boldsymbol{x}}.$ (3.72)

Apparently, this is the same formula as in the one-dimensional case and with the same arguments the last term of (3.72) can be ignored. The first term can again be easily calculated from the data. Hence, only the second term of (3.72) is unknown and must be estimated. However, observe that $ \int \widehat{f}_{{\mathbf{H}}}({\boldsymbol{x}}) f({\boldsymbol{x}})\,d{\boldsymbol{x}}=
E \widehat{f}_{{\mathbf{H}}}({\boldsymbol{X}})$, where the only new aspect now is that $ {\boldsymbol{X}}$ is $ d$-dimensional. The resulting expectation, though, is a scalar. As in (3.32) we estimate this term by a leave-one-out estimator

$\displaystyle \widehat{E \widehat{f}_{{\mathbf{H}}}({\boldsymbol{X}})} =
\frac{1}{n} \sum_{i=1}^{n}


$\displaystyle \widehat{f}_{{\mathbf{H}},-i}({\boldsymbol{x}})=
\frac{1}{n-1} \s...
..., j=1}^{n}

is simply the multivariate version of (3.33). Also, the multivariate generalization of (3.37) is straightforward, which yields the multivariate cross-validation criterion as a perfect generalization of $ CV$ in the one-dimensional case:
$\displaystyle CV({\mathbf{H}})$ $\displaystyle =$ $\displaystyle \frac{1}{n^{2}\mathop{\rm {det}}({\mathbf{H}})}\sum_{i=1}^{n}
\left\{ {\mathbf{H}}^{-1} ({\boldsymbol{X}}_{j}-{\boldsymbol{X}}_{i}) \right\}$  
    $\displaystyle \quad\quad- \frac{2}{n(n-1)} \sum_{i=1}^{n}
\sum_{\scriptstyle j=...
...} ^{n}

The difficulty comes in the fact that the bandwidth is now a $ d\times d$ matrix $ {\mathbf{H}}$. In the most general case this means to minimize over $ d(d+1)/2$ parameters. Still, if we assume to $ {\mathbf{H}}$ a diagonal matrix, this remains a $ d$-dimensional optimization problem. This holds for other cross-validation approaches, too.

3.6.3 Computation and Graphical Representation

Consider now the problem of graphically displaying our multivariate density estimates. Assume first $ d=2$. Here we are still able to show the density estimate in a three-dimensional plot. This is particularly useful if the estimated function can be rotated interactively on the computer screen. For a two-dimensional presentation a contour plot gives often more insight into the structure of the data.

Figure: Two-dimensional density estimate for age and household income from East German SOEP 1991

Figures 3.14 and 3.15 display such a two-dimensional density estimate

$\displaystyle \widehat{f}_{{\boldsymbol{h}}}({\boldsymbol{x}})
= \widehat{f}_{{\boldsymbol{h}}}(x_{1},x_{2})$

for two explanatory variables on East-West German migration intention in Spring 1991, see Example 1.4. We use the subscript $ {\boldsymbol{h}}$ to indicate that we used a diagonal bandwidth matrix $ {\mathbf{H}}= \mathop{\hbox{diag}}({\boldsymbol{h}})$. Aside from some categorical variables on an educational level, professional status, existence of relations to Western Germany and regional dummies, our data set contains observations on age, household income and environmental satisfaction. In Figure 3.14 we plotted the joint density estimate for age and household income. Additionally Figure 3.15 gives a contour plot of this density estimate. It is easily observed that the age distribution is considerably left skewed. $ \Box$

Figure: Contour plot for the two-dimensional density estimate for age and household income from East German SOEP 1991

Here and in the following plots the bandwidth was chosen according to the general rule of thumb (3.71), which tends to oversmooth bimodal structures of the data. The kernel function is always the product Quartic kernel.

Consider now how to display three- or even higher dimensional density estimates. One possible approach is to hold one variable fixed and to plot the density function only in dependence of the other variables. For three-dimensional data this gives three plots: $ x_{1},x_{2}$ vs. $ \widehat
f_{{\boldsymbol{h}}}(x_{1},x_{2},x_{3})$, $ x_{1},x_{3}$ vs. $ \widehat
f_{{\boldsymbol{h}}}(x_{1},x_{2},x_{3})$ and $ x_{2},x_{3}$ vs. $ \widehat

We display this technique in Figure 3.16 for data from a credit scoring sample, using duration of the credit, household income and age as variables (Fahrmeir & Tutz, 1994). The title of each panel indicates which variable is held fixed at which level. $ \Box$

Figure: Two-dimensional intersections for the three-dimensional density estimate for credit duration, household income and age

Figure: Graphical representation by contour plot for the three-dimensional density estimation for credit duration, household income and age

Alternatively, we can plot contours of the density estimate, now in three dimensions, which means three-dimensional surfaces. Figure 3.17 shows this for the credit scoring data. In the original version of this plot, red, green and blue surfaces show the values of the density estimate at the levels (in percent) indicated on the right. Colors and the possibility of rotating the contours on the computer screen ease the exploration of the data structures significantly. $ \Box$

For alternative texts on kernel density estimation we refer to the monographs by Silverman (1986), Härdle (1990), Scott (1992) and Wand & Jones (1995).

A particular field of interest and ongoing research is the matter of bandwidth selection. In addition to what we have presented, a variety of other cross-validation approaches and refined plug-in bandwidth selectors have been proposed.

In particular, the following methods are based on the cross-validation idea: Pseudo-likelihood cross-validation (Duin, 1976; Habbema et al., 1974), biased cross-validation (Scott & Terrell, 1987; Cao et al., 1994) and smoothed cross-validation (Hall et al., 1992). The latter two approaches also attempt to find $ \mise$ optimal bandwidths. Hence their performance is also assessed by relative convergence to the $ \mise$ optimal bandwidth $ h_{opt}$. A detailed treatment of many cross-validation procedures can be found in Park & Turlach (1992).

Regarding other refined plug-in bandwidth selectors, the methods of Sheather and Jones (Sheather & Jones, 1991) and Hall, Sheather, Jones, and Marron (Hall et al., 1991) should be mentioned, as they have have good asymptotic properties ($ \sqrt{n}$-convergence). A number of authors provide extensive simulation studies for smoothing parameter selection, we want to mention in particular Marron (1989), Jones et al. (1996), Park & Turlach (1992), and Cao et al. (1994).

A alternative approach is introduced by Chaudhuri & Marron (1999)'s SiZer (significance zero crossings of derivatives) which tries to directly find features of a curve, such as bumps and valleys. At the same time it is a tool for visualizing the estimated curve at a range of different bandwidth values. SiZer provides thus a way around the issue of smoothing parameter selection.

EXERCISE 3.1   Calculate the exact values of

$\displaystyle \int K^2(u)\;du \quad \textrm{ and } \quad \int u^2 K(u)\;du$

for the Gaussian, Epanechnikov and Quartic kernels.

EXERCISE 3.2   Show that for the $ N\left( 0, \sigma^{2} \right)$ normal density holds

$\displaystyle \Vert f'' \Vert _{2}^{2} \approx 0.212 \cdot \sigma^{-5}.

EXERCISE 3.3   Show that the density estimate $ \widehat f_{h}(x)=\frac{1}{n}\sum_{i=1}^n
K_{h}(x-X_{i})$ is itself a pdf if the kernel $ K$ is one (i.e. if $ \int K(u)\,du =1$).

EXERCISE 3.4   The statistical bureau of Borduria questions $ n=100$ households about their income. The young econometrician Tintin proposes to use these data $ x_1$, ...$ x_n$ for a nonparametric estimate of the income density $ f(x)$. Tintin suggests computing a confidence interval for his kernel density estimate

$\displaystyle \widehat f_{h}(x)=\frac{1}{n}\sum_{i=1}^n K_{h}(x-X_{i})$

with kernel function $ K$. Explain how this can be done. Simulate this on the basis of a lognormal sample with parameters $ \mu=2$ and $ \sigma^2=0.5$.

EXERCISE 3.5   Explain the constant $ 1.34$ in the ``better rule of thumb" (3.29).

EXERCISE 3.6   Show that for a rescaled kernel $ K_{\delta}(\bullet)=\delta^{-1}K(\bullet/\delta)$

$\displaystyle \int K_{\delta}(u)\,du=1.$

EXERCISE 3.7   Derive formula (3.44) for the canonical bandwidth.

EXERCISE 3.8   Derive the formula for $ CV \left( h \right)$ in the case $ K = \varphi$ (Gaussian Kernel).

EXERCISE 3.9   Show that

$\displaystyle \mathop{\mathit{Var}}\left( \widehat{f}''_{h} \left( x \right) \right) \sim n^{-1} h^{-5}
\quad\textrm{as }nh \rightarrow \infty.$

EXERCISE 3.10   Simulate a mixture of normal densities with pdf

$\displaystyle 0.3\, \varphi(x) + \frac{0.7}{0.3}\,\varphi\left(\frac{x-1}{0.3}

and plot the density and its estimate with a cross-validated bandwidth.

EXERCISE 3.11   Show how to get from equation (3.12) to (3.13).

EXERCISE 3.12   Compute $ \delta_0$ for the Quartic kernel.

EXERCISE 3.13   One possible way to construct a multivariate kernel $ \mathcal{K}(\mathbf u)$ is to use a one-dimensional kernel $ K(u)$. This relationship is given by the formula (3.59), i.e.

$\displaystyle {\mathcal{K}}({\boldsymbol{u}}) = c K(\Vert{\boldsymbol{u}}\Vert),$

where $ \Vert{\boldsymbol{u}}\Vert = \sqrt{{\boldsymbol{u}}^\top{\boldsymbol{u}}}$. Find an appropriate constant $ c$ for a two-dimensional a) Gaussian, b) Epanechnikov, and c) Triangle kernel.

EXERCISE 3.14   Show that $ \widehat{f}_h(x)\mathrel{\mathop{\longrightarrow}\limits_{}^{a.s.}}f(x)$. Assume that $ f$ possesses a second derivative and $ \Vert K\Vert < \infty$.

EXERCISE 3.15   Explain why averaging over the leave-one-out estimator (3.32) is the appropriate way to estimate the expected value of $ \widehat{f}_{h}(X)$ w.r.t. an independent random variable $ X$.

EXERCISE 3.16   Show that

$\displaystyle \int y \sum_{i=1}^{n} K_{h}
\left( x - X_{i} \right) K_{h}
\left( y - Y_{i} \right) dy =
\sum_{i=1}^{n} K_{h} \left( x - X_{i} \right) Y_{i}.$

$ \ast$
Kernel density estimation is a generalization of the histogram. The kernel density estimate at point $ x$

$\displaystyle \widehat{f}_h(x) = \frac{1}{nh} \sum_{i=1}^n

corresponds to the histogram bar height for the bin $ [x-h/2,x+h/2)$ if we use the uniform kernel.
$ \ast$
The bias and variance of the kernel density estimator are

$\displaystyle \bias\{\widehat f_{h}(x)\} \approx
\frac{h^{2}}{2} f''(x)\;\mu_{...
...hit{Var}}\{\widehat f_{h}(x)\} \approx
\frac{1}{nh} \Vert K \Vert^{2}_{2} f(x).$

$ \ast$
The $ \amise$ of the kernel density estimator is

$\displaystyle \amise(\widehat f_{h})=\frac{1}{nh}\Vert K \Vert^{2}_{2}+\frac{h^{4}}{4} \{\mu_{2}(K)\}^{2} \Vert f''\Vert^{2}_{2}.$

$ \ast$
By using the normal distribution as a reference distribution for calculating $ \Vert f''\Vert^2_2$ we get Silverman's rule-of-thumb bandwidth

$\displaystyle \widehat{h}= 1.06 \cdotp \widehat{\sigma} \cdotp n^{-1/5},$

which assumes the kernel to be Gaussian. Other plug-in bandwidths can be found by using more sophisticated replacements for $ \Vert f''\Vert^2_2$.
$ \ast$
When using $ \ise$ as a goodness-of-fit criterion for $ \widehat{f}_h$ we can derive the least squares cross-validation criterion for bandwidth selection:

$\displaystyle CV(h) = \frac{1}{n^{2}h}\sum_{i,j}
K \star K \left( \frac{X_{j}-X...
-\,\frac{2}{n(n-1)} \sum_{i=1}^{n}
\sum_{i \neq j} K_{h}(X_{i}-X_{j}).

$ \ast$
The concept of canonical kernels allows us to separate the kernel choice from the bandwidth choice. We find a canonical bandwidth $ \delta_0$ for each kernel function $ K$ which gives us the equivalent degree of smoothing. This equivalence allows one to adjust bandwidths from different kernel functions to obtain approximately the same value of $ \amise$. For bandwidth $ h_{A}$ and kernel $ K^{A}$ the bandwidth

$\displaystyle h_{B}=h_{A}\frac{\delta_{0}^{B}}{\delta_{0}^{A}}$

is the equivalent bandwidth for kernel $ K^{B}$. So for instance, Silverman's rule-of-thumb bandwidth has to be adjusted by a factor of $ 2.623$ for using it with the Quartic kernel.
$ \ast$
The asymptotic normality of the kernel density estimator

$\displaystyle n^{2/5} (\widehat{f}_{h_{n}}(x)-f(x))
...c^{2} }{2} f''(x)\mu_{2}(K) },
{\frac{1}{c}f(x)\Vert K \Vert^{2}_{2} } \right),$

allows us to compute confidence intervals for $ f$. Confidence bands can be computed as well, although under more restrictive assumptions on $ f$.
$ \ast$
The kernel density estimator for univariate data can be easily generalized to the multivariate case

$\displaystyle \widehat{f}_{{\mathbf{H}}}({\boldsymbol{x}})
= \frac{1}{n}\sum_{i...

where the bandwidth matrix $ {\mathbf{H}}$ now replaces the bandwidth parameter. The multivariate kernel is typically chosen to be a product or radial-symmetric kernel function. Asymptotic properties and bandwidth selection are analogous, but more cumbersome. Canonical bandwidths can be used as well to adjust between different kernel functions.
A special problem is the graphical display of multivariate density estimates. Lower dimensional intersections, projections or contour plot may display only part of the features of a density function.