4.2 Visualization

4.2.1 Data Visualization

Visualization of data is a fundamental task in modern statistical practice. The most common figure for this purpose is the bivariate scatter diagram. Figure 4.1a displays the levels of blood fats in a sample of men with heart disease. The data have been transformed to a logarithm base scale to minimize the effects of skewness. At a first glance, the data appear to follow a bivariate normal distribution. The sample correlation is only 0.22. One might examine each of the two variables separately as a univariate scatter diagram, which is commonly referred to as a ''dot plot'', but such figures are rarely presented. Tukey advocated the histogram-like stem-and-leaf plot or the box-and-whiskers plot, which displays simple summaries including the median and quartiles. Figure 4.1b displays box-and-whisker plots for these variables. Clearly triglyceride values vary more than cholesterol and may still be right-skewed.

**Figure 4.1:** Cholesterol and triglyceride blood levels for males with heart disease
$\includegraphics[width=100mm,clip]{text/3-4/fig1.eps}$

As shown later in Sect. 4.3.3, there may be rather subtle clusters within these data. The eye can readily detect clusters which are well-separated, but the eye is not reliable when the clusters are not well-separated, nor when the sample size is so large that the scatter diagram is too crowded. For example, consider the Old Faithful Geyser data ([1]), , where measures the waiting time between successive eruptions of the geyser, and measures the duration of the subsequent eruption. The data were blurred by adding uniform noise to the nearest minute for and to the nearest second for . Figure 4.2 displays histograms of these two variables. Interestingly, neither appears to follow the normal distribution. The common feature of interest is the appearance of two modes. One group of eruptions is only $2\,$ minutes in duration, while the other averages over $4\,$ minutes in duration. Likewise, the waiting time between eruptions clusters into two groups, one less than an hour and the other greater than one hour. The distribution of eruption durations appears to be a mixture of two normal densities, but the distribution of the waiting times appears more complicated.

**Figure 4.2:** Waiting time and duration of consecutive eruptions of the Old Faithful Geyser
$\includegraphics[width=88mm,clip]{text/3-4/fig2.eps}$

**Figure 4.3:** Two scatter diagrams of the Old Faithful Geyser data
$\includegraphics[width=100mm,clip]{text/3-4/fig3.eps}$

Finally, in Fig. 4.3 we examine the scatter diagrams of both as well as the lagged values of eruption duration, $(y_{t-1},y_t)$ . The common feature in these two densities is the presence of three modes. As mentioned earlier, the eye is well-suited to discerning clusters that are well-separated. From Fig. 4.3a, short waiting periods are associated with long eruption durations. From Fig. 4.3b, all eruptions of short duration are followed by eruptions of long duration. Missing from Fig. 4.3b are any examples of eruptions of short duration following eruptions of short duration, which should be a plus for the disappointed tourist. The observant reader may notice an odd clustering of points at integer values of the eruption duration. A quick count shows that , , and of the original $299\,$ values occur exactly at , and $4\,$ minutes, respectively. Examining the original time sequence suggests that these measurements occur in clumps; perhaps accurate measurements were not taken after dark. Exploration of these data has revealed not only interesting features but also suggest possible data collection anomalies.

Massive datasets present different challenges. For example, the Landsat IV remote sensing dataset described by [37] contains information on pixels of a scene imaged in 1977 from North Dakota. The variables displayed in Fig. 4.4 are the time of peak greenness of the crop in each pixel and the derived value of the maximum greenness, scaled to values 0-255 and blurred with uniform noise. Overplotting is apparent. Each successive figure drills down into the boxed region shown. Only $5.6\,{\%}$ of the points are eliminated going to the second frame; $35.5\,{\%}$ eliminated between the second and third frames; and $38.1\,{\%}$ between the third and final frames, still leaving $8624\,$ points. Overplotting is still apparent in the final frame. Generally, gleaning detailed density information from scatter diagrams is difficult at best. Nonparametric density estimation solves this problem.

**Figure 4.4:** Drilling into the Landsat IV data with
$\includegraphics[clip]{text/3-4/fig4.eps}$

To see the difficulty of gleaning density information from the graphs in Fig. 4.4, compare the bivariate histogram displayed in Fig. 4.5 for the data in frame (b) from Fig. 4.4. Using only the scatter diagram, there is no way to know the relative frequency of data in the two largest clusters except through the histogram.

**Figure 4.5:** Histogram of data in Fig. 4.4b
$\includegraphics[width=85mm,clip]{text/3-4/fig5.eps}$

The bivariate histogram uses rectangular-shaped bins. An interesting hybrid solution is to use hexagonal-shaped bins and to use a glyph to represent the bin count. [36] compared the statistical power of using squares, hexagons, and equilateral triangles as shapes for bins of bivariate histograms and concluded that hexagons were the best choice. [5] examined the use of drawing a glyph in each bivariate bin rather than the perspective view. For graphical reasons, Carr found hexagonal bins were more effective. The bin count is represented by a hexagonal glyph whose area is proportional to the bin count. Figure 4.6 displays the hexagonal mosaic map of the same data as in Fig. 4.5. This representation gives a quite accurate summary of the density information. No bin counts are obscured as in the perspective view of the bivariate histogram.

**Figure 4.6:** Hexagonal bin glyph of the data in Fig. 4.4b
$\includegraphics[width=50mm,clip]{text/3-4/fig6.eps}$

In the next section, some of the algorithms for nonparametric density estimation and their theoretical properties are discussed. We then return to the visualization of data in higher dimensions.

Next: 4.3 Density Estimation Algorithms Up: 4. Multivariate Density Estimation Previous: 4.1 Introduction