2.2 Statistical Properties

Let us investigate some of the properties of the histogram as an estimator of the unknown pdf $ f(x)$. Suppose that the origin $ x_{0}=0$ and that we want to estimate the density at some point $ x\in B_{j}=[(j-1)h,jh)$. The density estimate assigned by the histogram to $ x$ is

$\displaystyle \widehat f_{h}(x)=\frac{1}{nh} \sum_{i=1}^n \Ind(X_{i}\in B_{j}).$ (2.6)

2.2.1 Bias

Is $ E\{\widehat f_{h}(x)\}=f(x)$, i.e. is the histogram an unbiased estimator? Let us calculate $ E\{\widehat f_{h}(x)\}$ to find out:

$\displaystyle E\{\widehat f_{h}(x)\}=\frac{1}{nh} \sum_{i=1}^n E\{\Ind(X_{i}\in B_{j})\}.$ (2.7)

Since the $ X_{i} \, (i=1,2,\ldots,n)$ are i.i.d. random variables it follows that the indicator functions $ \Ind(X_{i}\in B_{j})$ are also i.i.d. random variables, and we can write

$\displaystyle E\{\widehat f_{h}(x)\}=\frac{1}{nh} n E\{\Ind(X_{i}\in B_{j})\}.$ (2.8)

It remains to find $ E\{\Ind(X_{i}\in B_{j})\}$. It is straightforward to derive the pdf of the random variable $ \Ind(X_{i}\in B_{j})$:

$\displaystyle \Ind(X_{i}\in B_{j})=\left\{ \begin{array}{ll} 1&\textrm{with pro...
...xtrm{with probability} \quad 1-\int^{jh}_{(j-1)h} f(u)\,du. \end{array} \right.$ (2.9)

Hence it is Bernoulli distributed and

$\displaystyle E\{\Ind(X_{i}\in B_{j})\} = P\left\{\Ind(X_{i}\in B_{j})=1\right\} =\int^{jh}_{(j-1)h} f(u)\,du\,,$ (2.10)

therefore

$\displaystyle E\{\widehat f_{h}(x)\}=\frac{1}{h}\int^{jh}_{(j-1)h} f(u)\,du.$ (2.11)

The last term is in general not equal to $ f(x)$. Consequently, the histogram is in general not an unbiased estimator of $ f(x)$:

$\displaystyle \bias\{\widehat f_{h}(x)\}=E\{\widehat f_{h}(x)-f(x)\}=\frac{1}{h}\int^{jh}_{(j-1)h} f(u)\,du-f(x).$ (2.12)

The precise value of the bias depends on the shape of the true density $ f(x)$. We can, however, derive an approximative bias formula that allows us to make some general remarks about the situations that lead to a small or large bias. Using

$\displaystyle \frac{1}{h}\int^{jh}_{(j-1)h} f(u)\,du-f(x) = \frac{1}{h} \left\{\int^{jh}_{(j-1)h} \{f(u)-f(x)\}\,du\right\}.$ (2.13)

and a first-order Taylor approximation of $ f(x)-f(u)$ around the center $ m_{j}=(j-\frac{1}{2})h$ of $ B_{j}$, i.e. $ f(u)-f(x)\approx f'(m_{j})(u-x)$ yields

$\displaystyle E\{\widehat f_{h}(x)-f(x)\} \approx f'\left\{m_j\right\} \left\{m_j-x\right\}.$ (2.14)

Note that the absolute value of the approximate bias is increasing in the slope of the true density at the mid point $ m_j,$ and that the approximate bias is zero if $ x=m_j.$

2.2.2 Variance

Let us calculate the variance of the histogram to see how its volatility depends on the binwidth $ h$:

$\displaystyle \mathop{\mathit{Var}}\{\widehat f_{h}(x)\}=\mathop{\mathit{Var}}\left\{\frac{1}{nh} \sum_{i=1}^n \Ind(X_{i}\in B_{j}) \right\}.$ (2.15)

Since the $ X_{i}$ are i.i.d. we can write

$\displaystyle \mathop{\mathit{Var}}\{\widehat f_{h}(x)\}=\frac{1}{n^{2}h^{2}}\s...
... B_{j})\}=\frac{1}{n^{2}h^{2}} n \mathop{\mathit{Var}}\{\Ind(X_{i}\in B_{j})\}.$ (2.16)

From (2.9) we know that $ \Ind(X_{i}\in B_{j})$ is Bernoulli distributed with parameter $ p=\int_{B_{j}}f(u) \, du$. Since the variance of any Bernoulli random variable $ X$ is given by $ p(1-p)$ we have

$\displaystyle \mathop{\mathit{Var}}\{\widehat f_{h}(x)\} = \frac{1}{n^{2}h^{2}} n \int_{B_{j}}f(u) \, du \left( 1-\int_{B_{j}}f(u) \, du \right).$    

You are asked in Exercise 2.3 to show that the right hand side of (2.17) can be written as the sum of $ f(x)/(nh)$ and terms that are of smaller magnitude. That is, you are asked to show that there also exists an approximate formula for the variance of $ \mathop{\mathit{Var}}\{\widehat f_{h}(x)\}:$

$\displaystyle \mathop{\mathit{Var}}\{\widehat f_{h}(x)\} \approx \frac{1}{nh} f\left( x \right).$ (2.17)

We observe that the variance of the histogram is proportional to $ f(x)$ and decreases when $ nh$ increases. This implies that increasing $ h$ will reduce the variance. We know from (2.14) that increasing $ h$ will do the opposite to the bias. So how should we choose $ h$ if we want to have a small variance and a small bias?

Since variance and bias vary in opposite directions with $ h,$ we have to settle for finding the value of $ h$ that yields (in some sense) the optimal compromise between variance and bias reduction.

2.2.3 Mean Squared Error

Consider the mean squared error ($ \mse$) of the histogram

$\displaystyle \mse\{\widehat f_{h}(x)\}=E[\{\widehat f_{h}(x)-f(x)\}^{2}],$ (2.18)

which can be written as the sum of the variance and the squared bias (this is a general result, not limited to this particular problem)

$\displaystyle \mse\{\widehat f_{h}(x)\} =\mathop{\mathit{Var}}\{\widehat f_{h}(x)\}+[\bias\{\widehat{f}_{h}(x)\}]^{2}.$ (2.19)

Hence, finding the binwidth $ h$ that minimizes the $ \mse$ might yield a histogram that is neither oversmoothed (i.e. one that overemphasizes variance reduction by employing a relatively large value of $ h$) nor undersmoothed (i.e. one that is overemphasizes bias reduction by using a small binwidth). It can be shown (see Exercise 2.4) that
$\displaystyle \mse\{\widehat f_{h}(x)\}$ $\displaystyle =$ $\displaystyle \frac{1}{nh}f(x)+ f'\left\{\left(j-\frac{1}{2}
\right)h\right\}^{2}
\left\{\left(j-\frac{1}{2}\right)h-x\right\}^{2}$  
    $\displaystyle \quad\quad + o(h) + o\left(\frac{1}{nh}\right),$ (2.20)

where $ o(h)$ and $ o\left(\frac{1}{nh}\right)$ denote terms not explicitly written down which are of lower order than $ h$ and $ \frac{1}{nh}$, respectively.

From (2.21) we can conclude that the histogram converges in mean square to $ f(x)$ if we let $ h\to 0$, $ nh\to\infty.$ That is, if we use more and more observations ( $ n\to\infty$) and smaller and smaller binwidth ($ h\to 0$) but do not shrink the binwidth too quickly ( $ nh\to\infty$) then the $ \mse$ of $ \widehat f_{h}(x)$ goes to zero. Since convergence in mean square implies convergence in probability, it follows that $ \widehat f_{h}(x)$ converges to $ f(x)$ in probability or, in other words, that $ \widehat f_{h}(x)$ is a consistent estimator of $ f(x).$

Figure 2.4 shows the $ \mse$ (at $ x=0.5$) for estimating the density function given in Exercise 2.9 as a function of the binwidth $ h$.

Figure: Squared bias (thin solid line), variance (dashed line) and $ \mse$ (thick line) for the histogram
\includegraphics[width=0.03\defepswidth]{quantlet.ps}SPMhistmse
\includegraphics[width=1.2\defpicwidth]{SPMhistmse.ps}

2.2.4 Mean Integrated Squared Error

The application of the $ \mse$ formula is difficult in practice, since the derived formula for the $ \mse$ depends on the unknown density function $ f$ both in the variance and the squared bias term. Instead of looking at the accuracy of $ \widehat f_{h}(x)$ as an estimator of $ f$ at a single point, it might be worthwhile to have a global measure of accuracy. The most widely used global measure of estimation accuracy is the mean integrated squared error ($ \mise$):

$\displaystyle \mise(\widehat f_{h})$ $\displaystyle =$ $\displaystyle E\left[\int^{\infty}_{-\infty} \{\widehat
f_{h}(x)-f(x)\}^{2}\,dx\right]$ (2.21)
  $\displaystyle =$ $\displaystyle \int^{\infty}_{-\infty}E\left[\{\widehat
f_{h}(x)-f(x)\}^{2}\right]\,dx$  
  $\displaystyle =$ $\displaystyle \int^{\infty}_{-\infty}\mse\{\widehat f_{h}(x)\}\,dx$ (2.22)

Using (2.21) we can write for any $ x$
$\displaystyle \mise(\widehat f_{h})$ $\displaystyle \approx$ $\displaystyle \int \frac{1}{nh}f(x)\,dx$  
    $\displaystyle +\int \sum_{j}\Ind(x\in B_{j})
\left\{\left(j-\frac{1}{2}\right)h-x\right\}^{2} \left[f'\left\{\left(j-\frac{1}{2}\right)h\right\}
\right]^{2}\,dx$  
  $\displaystyle =$ $\displaystyle \frac{1}{nh}+\sum_{j}\int_{B_{j}}
\left\{x-\left(j-\frac{1}{2}\ri...
...\right\}^{2}
\left\{f'\left[\left(j-\frac{1}{2}\right)h\right] \right\}^{2}\,dx$  
  $\displaystyle \approx$ $\displaystyle \frac{1}{nh}+\sum_{j}
\left\{f'\left[\left(j-\frac{1}{2}\right)h\...
...t\}^{2}
\cdot \int_{B_{j}}
\left\{x-\left(j-\frac{1}{2}\right)h\right\}^{2}\,dx$  
  $\displaystyle \approx$ $\displaystyle \frac{1}{nh}+\frac{h^2}{12}
\int\left\{ f'(x) \right\}^{2}\,dx = \frac{1}{nh}+\frac{h^{2}}{12}\Vert f' \Vert^{2}_{2}, \quad \textrm{ for }h\to 0.$  

Here, $ \Vert f' \Vert _{2}^2$ denotes the so-called squared $ L_{2}$-norm of the function $ f'$. Now, the asymptotic $ \mise$, denoted $ \amise$, is given by

$\displaystyle \amise(\widehat f_{h})=\frac{1}{nh}+\frac{h^{2}}{12}\Vert f' \Vert^{2}_{2}.$ (2.23)

2.2.5 Optimal Binwidth

We are now in a position to employ a precise criterion for selecting an optimal binwidth $ h$: select the binwidth $ h$ that minimizes $ \amise$! Differentiating $ \amise$ with respect to $ h$ gives

$\displaystyle \frac{\partial \{\amise(\widehat f_{h})\}}{\partial
h}=\frac{-1}{nh^{2}}+\frac{1}{6}h\Vert f'
\Vert^{2}_{2}=0\;, $

hence

$\displaystyle h_{0}=\left(\frac{6}{n\Vert f' \Vert^{2}_{2}}\right)^{1/3} \sim n^{-1/3},$ (2.24)

where $ h_{0}$ denotes the optimal binwidth. Looking at (2.25) it becomes clear that we run into a problem if we want to calculate $ h_{0}$ since $ \Vert f'\Vert^{2}_{2}$ is unknown.

A way out of this dilemma will be described later, when we deal with kernel density estimation. For the moment, assume that we know the true density $ f(x)$. More specifically assume a standard normal distribution, i.e.

$\displaystyle f(x)=\varphi(x)=\frac{1}{\sqrt{2\pi}}\exp\left(-\frac{x^{2}}{2}\right).$

In order to calculate $ h_{0}$ we have to find $ \Vert f'\Vert^{2}_{2}$. Using the fact that for the standard normal distribution $ f'(x)=(-x)f(x),$ it can be shown that

$\displaystyle \Vert f'\Vert^{2}_{2} = \frac{1}{\sqrt{2\pi}}\sqrt{\frac{1}{2}}\i...
...\frac{1}{\sqrt{2\pi}}
\frac{1}{\sqrt{\frac{1}{2}}}\exp\left(-x^{2}\right)\,dx.
$

Since the term inside the integral is just the formula for computing the variance of a random variable that follows a normal distribution with mean $ \mu=0$ and variance $ \sigma^{2}=\frac{1}{2}$, we can write

$\displaystyle \Vert f'\Vert^{2}_{2} =\frac{1}{\sqrt{2\pi}}\sqrt{\frac{1}{2}} \cdot\frac{1}{2}=\frac{1}{4\sqrt{\pi}}.$ (2.25)

Using this result we can calculate $ h_{0}$ for this application:

$\displaystyle h_{0}=\left(\frac{6}{n\Vert f'\Vert^{2}_{2}}\right)^{1/3} =\left(...
...right)^{1/3} =\left(\frac{24\sqrt{\pi}}{n}\right)^{1/3} \approx 3.5\, n^{-1/3}.$ (2.26)

Unfortunately, in practice we do not know $ f$ (if we did there would be no point in estimating it). However, (2.27) can serve as a rule-of-thumb binwidth (Scott, 1992, Subsection 3.2.3).