3.2 Statistical Properties

Let us now turn to the statistical properties of kernel density estimators. We are interested in the mean squared error since it combines squared bias and variance.

3.2.1 Bias

For the bias we have

$\displaystyle \bias\{\widehat f_{h}(x)\}$ $\displaystyle =$ $\displaystyle E\{\widehat f_{h}(x)\}-f(x)$ (3.11)
  $\displaystyle =$ $\displaystyle \frac{1}{n}\sum_{i=1}^{n}E\{K_{h}(x-X_{i})\}\ - \ f(x)$  
  $\displaystyle =$ $\displaystyle E\{K_{h}(x-X)\}\ - \ f(x)$  

such that

$\displaystyle \bias\{\widehat f_{h}(x)\} = \int \frac{1}{h} K\left(\frac{x-u}{h}\right)f(u)\; du \ -\ f(x)\,.$ (3.12)

Using the variable $ s=\frac{u-x}{h}$, the symmetry of the kernel, i.e. $ K(-s)=K(s)$, and a second-order Taylor expansion of $ f(u)$ around $ x$ it can be shown (see Exercise 3.11) that

$\displaystyle \bias\{\widehat f_{h}(x)\} = \frac{h^{2}}{2} f''(x)\;\mu_{2}(K)+o(h^{2}), \quad\textrm{as }h \rightarrow 0.$ (3.13)

Here we denote $ \mu_{2}(K)=\int s^{2} K(s) \, ds.$

Observe from (3.13) that the bias is proportional to $ h^{2}$. Hence, we have to choose a small $ h$ to reduce the bias. Moreover, $ \bias\{\widehat f_{h}(x)\}$ depends on $ f''(x)$, the curvature of the density at $ x$. The effects of this dependence are illustrated in Figure 3.6 where the dashed lines mark $ E\{\widehat
f_{h}(\bullet)\}$ and the solid line the true density $ f(\bullet)$. The bias is thus given by the vertical difference between the dashed and the solid line.

Figure: Bias effects
\includegraphics[width=0.03\defepswidth]{quantlet.ps}SPMkdebias
\includegraphics[width=1.2\defpicwidth]{SPMkdebias.ps}

Note that in ``valleys" of $ f$ the bias is positive since $ f''>0$ around a local minimum of $ f$. Consequently, the dashed line is always above the solid line. Near peaks of $ f$ the opposite is true. The magnitude of the bias depends on the curvature of $ f$, reflected in the absolute value of $ f''$. Obviously, large values of $ \vert f''\vert$ imply large values of $ \bias\{\widehat f_{h}(x)\}$.

3.2.2 Variance

For the variance we calculate

$\displaystyle \mathop{\mathit{Var}}\{\widehat f_{h}(x)\}$ $\displaystyle =$ $\displaystyle \mathop{\mathit{Var}}\left\{\frac{1}{n}\sum_{i=1}^{n}K_{h}(x-X_{i})\right\}$ (3.14)
  $\displaystyle =$ $\displaystyle \frac{1}{n^{2}}\sum_{i=1}^{n}\mathop{\mathit{Var}}\left\{K_{h}(x-X_{i})\right\}$  
  $\displaystyle =$ $\displaystyle \frac{1}{n}\mathop{\mathit{Var}}\left\{K_{h}(x-X)\right\}$  
  $\displaystyle =$ $\displaystyle \frac{1}{n} \big(E\{K^{2}_{h}(x-X)\}-[E\{K_{h}(x-X)\}]^{2}\big)$  

Using

$\displaystyle \frac{1}{n} E\{K^{2}_{h}(x-X)\}=
\frac{1}{n} \frac{1}{h^{2}}\int K^{2}
\left( \frac{x-u}{h} \right) f(u)\;du,$

$\displaystyle E\{K_{h}(x-X)\}=f(x)+o(h)$

and similar variable substitution and Taylor expansion arguments as in the derivation of the bias, it can be shown (see Exercise 3.13) that

$\displaystyle \mathop{\mathit{Var}}\{\widehat f_{h}(x)\} = \frac{1}{nh} \Vert K...
...2}_{2} f(x)+o\left(\frac{1}{nh}\right), \quad\textrm{as }nh \rightarrow \infty.$ (3.15)

Here, $ \Vert K\Vert^{2}_{2}$ is shorthand for $ \int K^{2}(s) \, ds$, the squared $ L_2$ norm of $ K$.

Notice that the variance of the kernel density estimator is nearly proportional to $ {nh}^{-1}$. Hence, in order to make the variance small we have to choose a fairly large $ h$. Large values of $ h$ mean bigger intervals $ [x-h,x+h),$ more observations in each interval and hence more observations that get non-zero weight in the sum $ \sum_{i=1}^{n}K_{h}(x-X_{i})$. But, as you may recall from the analysis of the properties of the sample mean in basic statistics, using more observations in a sum will produce sums with less variability.

Similarly, for a given value of $ h$ (be it large or small), increasing the sample size $ n$ will decrease $ \frac{1}{nh}$ and therefore reduce the variance. But this makes sense because having a greater total number of observations means that, on average, there will be more observations in each interval $ [x-h,x+h).$

Also observe that the variance is increasing in $ \Vert K\Vert^{2}_{2}$. This term will be rather small for flat kernels such as the Uniform kernel. Intuitively speaking, we might say that smooth and flat kernels will produce less volatile estimates in repeated sampling since in each sample all realizations are given roughly equal weight.

3.2.3 Mean Squared Error

We have already seen for the histograms that choosing the bandwidth $ h$ is a crucial problem in nonparametric (density) estimation. The kernel density estimator is no exception. If we look at formulae (3.13) and (3.15) we can see that we face the familiar trade-off between variance and bias. We would surely like to keep both variance and bias small but increasing $ h$ will lower the variance while it will raise the bias (decreasing $ h$ will do the opposite). Minimizing the $ \mse$, i.e. the sum between variance and squared bias (cf. (2.20)), represents a compromise between over- and undersmoothing. Figure 3.7 puts variance, bias, and $ \mse$ onto one graph.

Figure: Squared bias part (thin solid), variance part (thin dashed) and $ \mse$ (thick solid) for kernel density estimate
\includegraphics[width=0.03\defepswidth]{quantlet.ps}SPMkdemse
\includegraphics[width=1.2\defpicwidth]{SPMkdemse.ps}

Moreover, looking at the $ \mse$ provides a way of assessing whether the kernel density estimator is consistent. Recall that convergence in mean square implies convergence in probability which is consistency. (3.13) and (3.15) yield

$\displaystyle \mse\{\widehat f_{h}(x)\}=\frac{h^{4}}{4} f''(x)^{2} \mu_{2}(K)^{2}+\frac{1}{nh} \Vert K \Vert ^{2}_{2} f(x)+o(h^{4})+o\left(\frac{1}{nh}\right).$ (3.16)

By looking at (3.16) we can see that the $ \mse$ of the kernel density estimator goes to zero as $ h\to 0$ and $ nh\to\infty$ goes to infinity. Hence, the kernel density estimator is indeed consistent. Unfortunately, by looking at (3.16) we can also observe that the $ \mse$ depends on $ f$ and $ f''$, both functions being unknown in practice. If you derive the value of $ h$ that is minimizing the $ \mse$ (call it $ h_{opt}(x)$) you will discover that both $ f(x)$ and $ f''(x)$ do not drop out in the process of deriving $ h_{opt}(x)$. Consequently, $ h_{opt}(x)$ is not applicable in practice unless we find a way of obtaining suitable substitutes for $ f(x)$ and $ f''(x)$. Note further that $ h_{opt}(x)$ depends on $ x$ and is thus a local bandwidth.

In the case of the histogram we were able to reduce the dimensionality of the problem (in some sense) by using the $ \mise$ instead of the $ \mse$, the former having the added advantage of being a global rather than a local measure of estimation accuracy. Hence, in the following subsections we will turn our attention to the $ \mise$ and derive the $ \mise$-optimal bandwidth.

3.2.4 Mean Integrated Squared Error

For the kernel density estimator the $ \mise$ is given by

$\displaystyle \mise(\widehat f_{h})$ $\displaystyle =$ $\displaystyle \int \mse\{\widehat f_{h}(x)\}\, dx$ (3.17)
  $\displaystyle =$ $\displaystyle \frac{1}{nh} \Vert K \Vert ^{2}_{2} \int f(x)\,dx \nonumber$  
    $\displaystyle \quad\quad + \frac{h^{4}}{4} \{\mu_{2}(K)\}^{2} \, \int \{f''(x)\}^{2}dx +
o\left(\frac{1}{nh}\right) + o(h^{4})$  
  $\displaystyle =$ $\displaystyle \frac{1}{nh} \Vert K \Vert ^{2}_{2} +
\frac{h^{4}}{4} \{\mu_{2}(K)\}^{2} \, \Vert f'' \Vert
^{2}_{2} + o\left(\frac{1}{nh}\right) + o(h^{4}),$  
    $\displaystyle \quad\quad \textrm{as } h \rightarrow 0, nh
\rightarrow \infty.$  

Ignoring higher order terms an approximate formula for the $ \mise$, called $ \amise$, can be given as

$\displaystyle \amise(\widehat f_{h})=\frac{1}{nh}\Vert K \Vert^{2}_{2}+\frac{h^{4}}{4} \{\mu_{2}(K)\}^{2} \Vert f''\Vert^{2}_{2}.$ (3.18)

Differentiating the $ \amise$ with respect to $ h$ and solving the first-order condition for $ h$ yields the $ \amise$ optimal bandwidth

$\displaystyle h_{opt}=\left( \frac{\Vert K \Vert^{2}_{2}}{\Vert f'' \Vert^{2}_{2}\,\{\mu_{2}(K)\}^{2}n}\right)^{1/5} \sim n^{-1/5}.$ (3.19)

Apparently, the problem of having to deal with unknown quantities has not been solved completely as $ h_{opt}$ still depends on $ \Vert f''\Vert^{2}_{2}$. At least we can use $ h_{opt}$ to get a further theoretical result regarding the statistical properties of the kernel density estimator. Inserting $ h_{opt}$ into (3.18) gives

$\displaystyle \amise(\widehat f_{h_{opt}})= \frac{5}{4}\left(\Vert K \Vert^{2}_{2}\right)^{4/5} \left(\mu_{2}(K)\Vert f'' \Vert _{2}\right)^{2/5}$ (3.20)

where we indicated that the terms preceeding $ n^{-4/5}$ are constant with respect to $ n$. Obviously, if we let the sample size get larger and larger, $ \amise$ is converging at the rate $ n^{-4/5}$. If you take the $ \amise$ optimal bandwidth of the histogram (2.25) and plug it into (2.24) you will find out that for the histogram $ \amise$ is converging at the slower rate of $ n^{-2/3}$, giving yet another reason for why the kernel density estimator is superior to the histogram.