2.2 Statistical Properties

Let us investigate some of the properties of the histogram as an estimator of the unknown pdf . Suppose that the origin $x_{0}=0$ and that we want to estimate the density at some point $x\in B_{j}=[(j-1)h,jh)$ . The density estimate assigned by the histogram to is

$\displaystyle \widehat f_{h}(x)=\frac{1}{nh} \sum_{i=1}^n \Ind(X_{i}\in B_{j}).$

(2.6)

2.2.1 Bias

Is $E\{\widehat f_{h}(x)\}=f(x)$ , i.e. is the histogram an unbiased estimator? Let us calculate $E\{\widehat f_{h}(x)\}$ to find out:

$\displaystyle E\{\widehat f_{h}(x)\}=\frac{1}{nh} \sum_{i=1}^n E\{\Ind(X_{i}\in B_{j})\}.$

(2.7)

Since the $X_{i} \, (i=1,2,\ldots,n)$ are i.i.d. random variables it follows that the indicator functions $\Ind(X_{i}\in B_{j})$ are also i.i.d. random variables, and we can write

$\displaystyle E\{\widehat f_{h}(x)\}=\frac{1}{nh} n E\{\Ind(X_{i}\in B_{j})\}.$

(2.8)

It remains to find $E\{\Ind(X_{i}\in B_{j})\}$ . It is straightforward to derive the pdf of the random variable $\Ind(X_{i}\in B_{j})$ :

$\displaystyle \Ind(X_{i}\in B_{j})=\left\{ \begin{array}{ll} 1&\textrm{with pro... ...xtrm{with probability} \quad 1-\int^{jh}_{(j-1)h} f(u)\,du. \end{array} \right.$

(2.9)

Hence it is Bernoulli distributed and

$\displaystyle E\{\Ind(X_{i}\in B_{j})\} = P\left\{\Ind(X_{i}\in B_{j})=1\right\} =\int^{jh}_{(j-1)h} f(u)\,du\,,$

(2.10)

therefore

$\displaystyle E\{\widehat f_{h}(x)\}=\frac{1}{h}\int^{jh}_{(j-1)h} f(u)\,du.$

(2.11)

The last term is in general not equal to

. Consequently, the histogram is in general not an unbiased estimator of

$\displaystyle \bias\{\widehat f_{h}(x)\}=E\{\widehat f_{h}(x)-f(x)\}=\frac{1}{h}\int^{jh}_{(j-1)h} f(u)\,du-f(x).$

(2.12)

The precise value of the bias depends on the shape of the true density

. We can, however, derive an approximative bias formula that allows us to make some general remarks about the situations that lead to a small or large bias. Using

$\displaystyle \frac{1}{h}\int^{jh}_{(j-1)h} f(u)\,du-f(x) = \frac{1}{h} \left\{\int^{jh}_{(j-1)h} \{f(u)-f(x)\}\,du\right\}.$

(2.13)

and a first-order Taylor approximation of

around the center $m_{j}=(j-\frac{1}{2})h$ of $B_{j}$ , i.e. $f(u)-f(x)\approx f'(m_{j})(u-x)$ yields

$\displaystyle E\{\widehat f_{h}(x)-f(x)\} \approx f'\left\{m_j\right\} \left\{m_j-x\right\}.$

(2.14)

Note that the absolute value of the approximate bias is increasing in the slope of the true density at the mid point

and that the approximate bias is zero if

2.2.2 Variance

Let us calculate the variance of the histogram to see how its volatility depends on the binwidth :

$\displaystyle \mathop{\mathit{Var}}\{\widehat f_{h}(x)\}=\mathop{\mathit{Var}}\left\{\frac{1}{nh} \sum_{i=1}^n \Ind(X_{i}\in B_{j}) \right\}.$

(2.15)

Since the $X_{i}$ are i.i.d. we can write

$\displaystyle \mathop{\mathit{Var}}\{\widehat f_{h}(x)\}=\frac{1}{n^{2}h^{2}}\s... ... B_{j})\}=\frac{1}{n^{2}h^{2}} n \mathop{\mathit{Var}}\{\Ind(X_{i}\in B_{j})\}.$

(2.16)

From (2.9) we know that $\Ind(X_{i}\in B_{j})$ is Bernoulli distributed with parameter $p=\int_{B_{j}}f(u) \, du$ . Since the variance of any Bernoulli random variable

is given by

we have

$\displaystyle \mathop{\mathit{Var}}\{\widehat f_{h}(x)\} = \frac{1}{n^{2}h^{2}} n \int_{B_{j}}f(u) \, du \left( 1-\int_{B_{j}}f(u) \, du \right).$

You are asked in Exercise 2.3 to show that the right hand side of (2.17) can be written as the sum of and terms that are of smaller magnitude. That is, you are asked to show that there also exists an approximate formula for the variance of $\mathop{\mathit{Var}}\{\widehat f_{h}(x)\}:$

$\displaystyle \mathop{\mathit{Var}}\{\widehat f_{h}(x)\} \approx \frac{1}{nh} f\left( x \right).$

(2.17)

We observe that the variance of the histogram is proportional to

and decreases when

increases. This implies that increasing

will reduce the variance. We know from (2.14) that increasing

will do the opposite to the bias. So how should we choose

if we want to have a small variance and a small bias?

Since variance and bias vary in opposite directions with we have to settle for finding the value of that yields (in some sense) the optimal compromise between variance and bias reduction.

2.2.3 Mean Squared Error

Consider the mean squared error ( $\mse$ ) of the histogram

$\displaystyle \mse\{\widehat f_{h}(x)\}=E[\{\widehat f_{h}(x)-f(x)\}^{2}],$

(2.18)

which can be written as the sum of the variance and the squared bias (this is a general result, not limited to this particular problem)

$\displaystyle \mse\{\widehat f_{h}(x)\} =\mathop{\mathit{Var}}\{\widehat f_{h}(x)\}+[\bias\{\widehat{f}_{h}(x)\}]^{2}.$

(2.19)

Hence, finding the binwidth

that minimizes the $\mse$ might yield a histogram that is neither oversmoothed (i.e. one that overemphasizes variance reduction by employing a relatively large value of

) nor undersmoothed (i.e. one that is overemphasizes bias reduction by using a small binwidth). It can be shown (see Exercise 2.4) that

$\displaystyle \mse\{\widehat f_{h}(x)\}$	$\displaystyle =$	$\displaystyle \frac{1}{nh}f(x)+ f'\left\{\left(j-\frac{1}{2} \right)h\right\}^{2} \left\{\left(j-\frac{1}{2}\right)h-x\right\}^{2}$
		$\displaystyle \quad\quad + o(h) + o\left(\frac{1}{nh}\right),$	(2.20)

where

and $o\left(\frac{1}{nh}\right)$ denote terms not explicitly written down which are of lower order than

and $\frac{1}{nh}$ , respectively.

From (2.21) we can conclude that the histogram converges in mean square to if we let $h\to 0$ , $nh\to\infty.$ That is, if we use more and more observations ( $n\to\infty$ ) and smaller and smaller binwidth ( $h\to 0$ ) but do not shrink the binwidth too quickly ( $nh\to\infty$ ) then the $\mse$ of $\widehat f_{h}(x)$ goes to zero. Since convergence in mean square implies convergence in probability, it follows that $\widehat f_{h}(x)$ converges to in probability or, in other words, that $\widehat f_{h}(x)$ is a consistent estimator of

Figure 2.4 shows the $\mse$ (at ) for estimating the density function given in Exercise 2.9 as a function of the binwidth .

**Figure:** Squared bias (thin solid line), variance (dashed line) and $\mse$ (thick line) for the histogram `SPMhistmse`
$\includegraphics[width=1.2\defpicwidth]{SPMhistmse.ps}$

2.2.4 Mean Integrated Squared Error

The application of the $\mse$ formula is difficult in practice, since the derived formula for the $\mse$ depends on the unknown density function both in the variance and the squared bias term. Instead of looking at the accuracy of $\widehat f_{h}(x)$ as an estimator of at a single point, it might be worthwhile to have a global measure of accuracy. The most widely used global measure of estimation accuracy is the mean integrated squared error ( $\mise$ ):

$\displaystyle \mise(\widehat f_{h})$	$\displaystyle =$	$\displaystyle E\left[\int^{\infty}_{-\infty} \{\widehat f_{h}(x)-f(x)\}^{2}\,dx\right]$	(2.21)
	$\displaystyle =$	$\displaystyle \int^{\infty}_{-\infty}E\left[\{\widehat f_{h}(x)-f(x)\}^{2}\right]\,dx$
	$\displaystyle =$	$\displaystyle \int^{\infty}_{-\infty}\mse\{\widehat f_{h}(x)\}\,dx$	(2.22)

Using (2.21) we can write for any

$\displaystyle \mise(\widehat f_{h})$	$\displaystyle \approx$	$\displaystyle \int \frac{1}{nh}f(x)\,dx$
		$\displaystyle +\int \sum_{j}\Ind(x\in B_{j}) \left\{\left(j-\frac{1}{2}\right)h-x\right\}^{2} \left[f'\left\{\left(j-\frac{1}{2}\right)h\right\} \right]^{2}\,dx$
	$\displaystyle =$	$\displaystyle \frac{1}{nh}+\sum_{j}\int_{B_{j}} \left\{x-\left(j-\frac{1}{2}\ri... ...\right\}^{2} \left\{f'\left[\left(j-\frac{1}{2}\right)h\right] \right\}^{2}\,dx$
	$\displaystyle \approx$	$\displaystyle \frac{1}{nh}+\sum_{j} \left\{f'\left[\left(j-\frac{1}{2}\right)h\... ...t\}^{2} \cdot \int_{B_{j}} \left\{x-\left(j-\frac{1}{2}\right)h\right\}^{2}\,dx$
	$\displaystyle \approx$	$\displaystyle \frac{1}{nh}+\frac{h^2}{12} \int\left\{ f'(x) \right\}^{2}\,dx = \frac{1}{nh}+\frac{h^{2}}{12}\Vert f' \Vert^{2}_{2}, \quad \textrm{ for }h\to 0.$

Here, $\Vert f' \Vert _{2}^2$ denotes the so-called squared $L_{2}$ -norm of the function

. Now, the asymptotic $\mise$ , denoted $\amise$ , is given by

$\displaystyle \amise(\widehat f_{h})=\frac{1}{nh}+\frac{h^{2}}{12}\Vert f' \Vert^{2}_{2}.$

(2.23)

2.2.5 Optimal Binwidth

We are now in a position to employ a precise criterion for selecting an optimal binwidth : select the binwidth that minimizes $\amise$ ! Differentiating $\amise$ with respect to gives

$\displaystyle \frac{\partial \{\amise(\widehat f_{h})\}}{\partial h}=\frac{-1}{nh^{2}}+\frac{1}{6}h\Vert f' \Vert^{2}_{2}=0\;,$

hence

$\displaystyle h_{0}=\left(\frac{6}{n\Vert f' \Vert^{2}_{2}}\right)^{1/3} \sim n^{-1/3},$

(2.24)

where $h_{0}$ denotes the optimal binwidth. Looking at (2.25) it becomes clear that we run into a problem if we want to calculate $h_{0}$ since $\Vert f'\Vert^{2}_{2}$ is unknown.

A way out of this dilemma will be described later, when we deal with kernel density estimation. For the moment, assume that we know the true density . More specifically assume a standard normal distribution, i.e.

$\displaystyle f(x)=\varphi(x)=\frac{1}{\sqrt{2\pi}}\exp\left(-\frac{x^{2}}{2}\right).$

In order to calculate $h_{0}$ we have to find $\Vert f'\Vert^{2}_{2}$ . Using the fact that for the standard normal distribution

it can be shown that

$\displaystyle \Vert f'\Vert^{2}_{2} = \frac{1}{\sqrt{2\pi}}\sqrt{\frac{1}{2}}\i... ...\frac{1}{\sqrt{2\pi}} \frac{1}{\sqrt{\frac{1}{2}}}\exp\left(-x^{2}\right)\,dx.$

Since the term inside the integral is just the formula for computing the variance of a random variable that follows a normal distribution with mean $\mu=0$ and variance $\sigma^{2}=\frac{1}{2}$ , we can write

$\displaystyle \Vert f'\Vert^{2}_{2} =\frac{1}{\sqrt{2\pi}}\sqrt{\frac{1}{2}} \cdot\frac{1}{2}=\frac{1}{4\sqrt{\pi}}.$

(2.25)

Using this result we can calculate $h_{0}$ for this application:

$\displaystyle h_{0}=\left(\frac{6}{n\Vert f'\Vert^{2}_{2}}\right)^{1/3} =\left(... ...right)^{1/3} =\left(\frac{24\sqrt{\pi}}{n}\right)^{1/3} \approx 3.5\, n^{-1/3}.$

(2.26)

Unfortunately, in practice we do not know

(if we did there would be no point in estimating it). However, (2.27) can serve as a rule-of-thumb binwidth (Scott, 1992, Subsection 3.2.3).