1.2 Histograms
Histograms are density estimates.
A density estimate gives a good impression of the distribution of the
data.
In contrast to boxplots, density estimates show possible multimodality
of the data. The idea is to locally represent the data density by counting
the number of observations in a sequence of consecutive intervals (bins)
with origin
.
Let
denote the bin of length
which is the element
of a bin grid starting at
:
where
denotes a left closed and right open interval.
If
is an i.i.d. sample with density
, the
histogram is defined as follows:
 |
(1.7) |
In sum (1.7) the first indicator function
(see Symbols & Notation in Appendix A)
counts the number of observations falling into bin
.
The second indicator function is responsible for ``localizing''
the counts around
.
The parameter
is a smoothing or localizing parameter and
controls the width of the
histogram bins.
An
that is too large leads to very big blocks and thus to a very
unstructured histogram. On the other hand, an
that is too small
gives a very variable estimate with many unimportant peaks.
Figure 1.6:
Diagonal of counterfeit bank notes.
Histograms with
and
(upper left),
(lower
left),
(upper right),
(lower right).
MVAhisbank1.xpl
|
The effect of
is given in detail in Figure 1.6.
It contains the histogram (upper left) for the diagonal of the
counterfeit bank notes for
(the minimum of these observations)
and
.
Increasing
to
and using the same origin,
,
results in the histogram shown in the lower left of the figure.
This density histogram is somewhat smoother due to the larger
.
The binwidth is next set to
(upper right).
From this histogram, one has the impression that the distribution of
the diagonal is bimodal with peaks at about 138.5 and 139.9.
The detection of modes requires a fine tuning of the binwidth. Using
methods from smoothing methodology (Härdle et al.; 2003) one can
find an ``optimal'' binwidth
for
observations:
Unfortunately, the binwidth
is not the only parameter
determining the shapes of
.
Figure 1.7:
Diagonal of counterfeit bank notes.
Histogram with
and origins
(upper left),
(lower left),
(upper right),
(lower right).
MVAhisbank2.xpl
|
In Figure 1.7, we show histograms with
(upper left),
(lower left),
with
(upper right), and
(lower right).
All the graphs have been scaled equally on the
-axis
to allow comparison.
One sees that--despite the fixed binwidth
--the interpretation is not
facilitated. The shift of the origin
(to 4 different locations)
created 4 different histograms.
This property of histograms strongly contradicts the goal of presenting
data features. Obviously, the same data are represented quite differently by
the 4 histograms. A remedy has been proposed by Scott (1985):
``Average the shifted histograms!''. The result is presented in
Figure 1.8.
Figure 1.8:
Averaged shifted histograms based on all (counterfeit and genuine) Swiss bank notes: there are 2 shifts (upper left), 4 shifts (lower left), 8 shifts (upper right), and
16 shifts (lower right).
MVAashbank.xpl
|
Here all bank note observations (genuine and counterfeit) have been used.
The averaged shifted histogram is no longer dependent on the origin and
shows a clear bimodality of the diagonals of the Swiss bank notes.
Summary

- Modes of the density are detected with a histogram.

- Modes correspond to strong peaks in the histogram.

- Histograms with the same
need not be identical.
They also depend on the origin
of the grid.

- The influence of the origin
is drastic. Changing
creates different looking histograms.

- The consequence of an
that is too large is an
unstructured histogram that is too flat.

- A binwidth
that is too small results in an unstable histogram.

- There is an ``optimal''
.

- It is recommended to use averaged histograms. They are kernel densities.