Let us now turn to the statistical properties of kernel density estimators. We are interested in the mean squared error since it combines squared bias and variance.
For the bias we have
Observe from (3.13) that the bias is proportional to . Hence, we have to choose a small to reduce the bias. Moreover, depends on , the curvature of the density at . The effects of this dependence are illustrated in Figure 3.6 where the dashed lines mark and the solid line the true density . The bias is thus given by the vertical difference between the dashed and the solid line.
Note that in ``valleys" of the bias is positive since around a local minimum of . Consequently, the dashed line is always above the solid line. Near peaks of the opposite is true. The magnitude of the bias depends on the curvature of , reflected in the absolute value of . Obviously, large values of imply large values of .
For the variance we calculate
Notice that the variance of the kernel density estimator is nearly proportional to . Hence, in order to make the variance small we have to choose a fairly large . Large values of mean bigger intervals more observations in each interval and hence more observations that get non-zero weight in the sum . But, as you may recall from the analysis of the properties of the sample mean in basic statistics, using more observations in a sum will produce sums with less variability.
Similarly, for a given value of (be it large or small), increasing the sample size will decrease and therefore reduce the variance. But this makes sense because having a greater total number of observations means that, on average, there will be more observations in each interval
Also observe that the variance is increasing in . This term will be rather small for flat kernels such as the Uniform kernel. Intuitively speaking, we might say that smooth and flat kernels will produce less volatile estimates in repeated sampling since in each sample all realizations are given roughly equal weight.
We have already seen for the histograms that choosing the bandwidth is a crucial problem in nonparametric (density) estimation. The kernel density estimator is no exception. If we look at formulae (3.13) and (3.15) we can see that we face the familiar trade-off between variance and bias. We would surely like to keep both variance and bias small but increasing will lower the variance while it will raise the bias (decreasing will do the opposite). Minimizing the , i.e. the sum between variance and squared bias (cf. (2.20)), represents a compromise between over- and undersmoothing. Figure 3.7 puts variance, bias, and onto one graph.
|
Moreover, looking at the provides a way of assessing whether the kernel density estimator is consistent. Recall that convergence in mean square implies convergence in probability which is consistency. (3.13) and (3.15) yield
In the case of the histogram we were able to reduce the dimensionality of the problem (in some sense) by using the instead of the , the former having the added advantage of being a global rather than a local measure of estimation accuracy. Hence, in the following subsections we will turn our attention to the and derive the -optimal bandwidth.
For the kernel density estimator the is given by
(3.17) | |||