1.2 Histograms

Histograms are density estimates. A density estimate gives a good impression of the distribution of the data. In contrast to boxplots, density estimates show possible multimodality of the data. The idea is to locally represent the data density by counting the number of observations in a sequence of consecutive intervals (bins) with origin $x_0$. Let $B_j(x_0,h)$ denote the bin of length $h$ which is the element of a bin grid starting at $x_0$:

\begin{displaymath}B_j(x_0,h)=[x_0+(j-1)h, x_0+jh),\quad j\in \mathbb{Z},\end{displaymath}

where $[.,.)$ denotes a left closed and right open interval. If $\{x_i\}^n_{i=1}$ is an i.i.d. sample with density $f$, the histogram is defined as follows:
\begin{displaymath}
\widehat f_h(x) = n^{-1}h^{-1}\sum_{j \in \mathbb{Z}}\sum ^n...
...h)\}
{\boldsymbol{I}}\{x\in B_j(x_0,h)\}.
%%\label{indicfunc}
\end{displaymath} (1.7)


In sum (1.7) the first indicator function ${\boldsymbol{I}}\{x_i\in B_j(x_0,h)\}$ (see Symbols & Notation in Appendix A) counts the number of observations falling into bin $B_j(x_0,h)$. The second indicator function is responsible for ``localizing'' the counts around $x$. The parameter $h$ is a smoothing or localizing parameter and controls the width of the histogram bins. An $h$ that is too large leads to very big blocks and thus to a very unstructured histogram. On the other hand, an $h$ that is too small gives a very variable estimate with many unimportant peaks.

Figure 1.6: Diagonal of counterfeit bank notes. Histograms with $x_0=137.8$ and $h=0.1$ (upper left), $h=0.2$ (lower left), $h=0.3$ (upper right), $h=0.4$ (lower right). 2215 MVAhisbank1.xpl
\includegraphics[width=1\defpicwidth]{histo1.ps}

The effect of $h$ is given in detail in Figure 1.6. It contains the histogram (upper left) for the diagonal of the counterfeit bank notes for $x_0=137.8$ (the minimum of these observations) and $h=0.1$. Increasing $h$ to $h=0.2$ and using the same origin, $x_0=137.8$, results in the histogram shown in the lower left of the figure. This density histogram is somewhat smoother due to the larger $h$. The binwidth is next set to $h=0.3$ (upper right). From this histogram, one has the impression that the distribution of the diagonal is bimodal with peaks at about 138.5 and 139.9. The detection of modes requires a fine tuning of the binwidth. Using methods from smoothing methodology (Härdle et al.; 2003) one can find an ``optimal'' binwidth $h$ for $n$ observations:

\begin{displaymath}
h_{opt}=\left( \frac{24\sqrt\pi}{n}\right)^{1/3}.
\end{displaymath}

Unfortunately, the binwidth $h$ is not the only parameter determining the shapes of $\widehat f$.

Figure 1.7: Diagonal of counterfeit bank notes. Histogram with $h=0.4$ and origins $x_{0}=137.65$ (upper left), $x_{0}=137.75$ (lower left), $x_{0}=137.85$ (upper right), $x_{0}=137.95$ (lower right). 2219 MVAhisbank2.xpl
\includegraphics[width=1\defpicwidth]{histo2.ps}

In Figure 1.7, we show histograms with $x_{0}=137.65$ (upper left), $x_0=137.75$ (lower left), with $x_0=137.85$ (upper right), and $x_0=137.95$ (lower right). All the graphs have been scaled equally on the $y$-axis to allow comparison. One sees that--despite the fixed binwidth $h$--the interpretation is not facilitated. The shift of the origin $x_0$ (to 4 different locations) created 4 different histograms. This property of histograms strongly contradicts the goal of presenting data features. Obviously, the same data are represented quite differently by the 4 histograms. A remedy has been proposed by Scott (1985): ``Average the shifted histograms!''. The result is presented in Figure 1.8.

Figure 1.8: Averaged shifted histograms based on all (counterfeit and genuine) Swiss bank notes: there are 2 shifts (upper left), 4 shifts (lower left), 8 shifts (upper right), and 16 shifts (lower right). 2223 MVAashbank.xpl
\includegraphics[width=1\defpicwidth]{ashnotes.ps}

Here all bank note observations (genuine and counterfeit) have been used. The averaged shifted histogram is no longer dependent on the origin and shows a clear bimodality of the diagonals of the Swiss bank notes.

Summary
$\ast$
Modes of the density are detected with a histogram.
$\ast$
Modes correspond to strong peaks in the histogram.
$\ast$
Histograms with the same $h$ need not be identical. They also depend on the origin $x_0$ of the grid.
$\ast$
The influence of the origin $x_0$ is drastic. Changing $x_0$ creates different looking histograms.
$\ast$
The consequence of an $h$ that is too large is an unstructured histogram that is too flat.
$\ast$
A binwidth $h$ that is too small results in an unstable histogram.
$\ast$
There is an ``optimal'' $h=(24\sqrt\pi/n)^{1/3}$.
$\ast$
It is recommended to use averaged histograms. They are kernel densities.