Visualization of data is a fundamental task in modern statistical
practice. The most common figure for this purpose is the bivariate
scatter diagram. Figure 4.1a
displays the levels of blood fats in a sample of men with heart
disease. The data have been transformed to a logarithm base scale
to minimize the effects of skewness. At a first glance, the data
appear to follow a bivariate normal distribution. The sample
correlation is only 0.22. One might examine each of the two variables
separately as a univariate scatter diagram, which is commonly referred
to as a ''dot plot'', but such figures are rarely
presented. Tukey advocated the histogram-like stem-and-leaf plot or
the box-and-whiskers plot, which displays simple summaries including
the median and quartiles. Figure 4.1b displays
box-and-whisker plots for these variables. Clearly triglyceride values
vary more than cholesterol and may still be right-skewed.
As shown later in Sect. 4.3.3, there may be rather subtle
clusters within these data. The eye can readily detect clusters which
are well-separated, but the eye is not reliable when the clusters are
not well-separated, nor when the sample size is so large that the
scatter diagram is too crowded. For example, consider the Old
Faithful Geyser data ([1]),
, where
measures the waiting time between successive
eruptions of the geyser, and
measures the duration of the
subsequent eruption. The data were blurred by adding uniform noise to
the nearest minute for
and to the nearest second for
.
Figure 4.2 displays histograms of these two variables.
Interestingly, neither appears to follow the normal distribution. The
common feature of interest is the appearance of two modes. One group
of eruptions is only
minutes in duration, while the other averages
over
minutes in duration. Likewise, the waiting time between
eruptions clusters into two groups, one less than an hour and the
other greater than one hour. The distribution of eruption durations
appears to be a mixture of two normal densities, but the distribution
of the waiting times appears more complicated.
Finally, in Fig. 4.3 we examine the scatter diagrams
of both as well as the lagged values of eruption duration,
. The common feature in these two densities is the
presence of three modes. As mentioned earlier, the eye is well-suited
to discerning clusters that are well-separated. From
Fig. 4.3a, short waiting periods are associated with
long eruption durations. From Fig. 4.3b, all
eruptions of short duration are followed by eruptions of long
duration. Missing from Fig. 4.3b are any examples
of eruptions of short duration following eruptions of short duration,
which should be a plus for the disappointed tourist. The observant
reader may notice an odd clustering of points at integer values of the
eruption duration. A quick count shows that
,
, and
of
the original
values occur exactly at
, and
minutes, respectively. Examining the original time
sequence suggests that these measurements occur in clumps; perhaps
accurate measurements were not taken after dark. Exploration of these
data has revealed not only interesting features but also suggest
possible data collection anomalies.
Massive datasets present different challenges. For example, the
Landsat IV remote sensing dataset described
by [37] contains information on pixels of a scene
imaged in 1977 from North Dakota. The variables displayed in
Fig. 4.4 are the time of peak greenness of the crop in each
pixel and the derived value of the maximum greenness, scaled to values
0-255 and blurred with uniform noise. Overplotting is
apparent. Each successive figure drills down into the boxed region
shown. Only
of the points are eliminated going to the
second frame;
eliminated between the second and third
frames; and
between the third and final frames, still
leaving
points. Overplotting is still apparent in the
final frame. Generally, gleaning detailed density information from
scatter diagrams is difficult at best. Nonparametric density
estimation solves this problem.
To see the difficulty of gleaning density information from the graphs in Fig. 4.4, compare the bivariate histogram displayed in Fig. 4.5 for the data in frame (b) from Fig. 4.4. Using only the scatter diagram, there is no way to know the relative frequency of data in the two largest clusters except through the histogram.
The bivariate histogram uses rectangular-shaped bins. An interesting hybrid solution is to use hexagonal-shaped bins and to use a glyph to represent the bin count. [36] compared the statistical power of using squares, hexagons, and equilateral triangles as shapes for bins of bivariate histograms and concluded that hexagons were the best choice. [5] examined the use of drawing a glyph in each bivariate bin rather than the perspective view. For graphical reasons, Carr found hexagonal bins were more effective. The bin count is represented by a hexagonal glyph whose area is proportional to the bin count. Figure 4.6 displays the hexagonal mosaic map of the same data as in Fig. 4.5. This representation gives a quite accurate summary of the density information. No bin counts are obscured as in the perspective view of the bivariate histogram.
In the next section, some of the algorithms for nonparametric density estimation and their theoretical properties are discussed. We then return to the visualization of data in higher dimensions.