Next: 4.5 Conclusions Up: 4. Multivariate Density Estimation Previous: 4.3 Density Estimation Algorithms

4.4 Visualization of Trivariate Functionals

The field of scientific visualization has greatly enhanced the set of tools available for the statistician interested in exploring the features of a density estimate in more than two dimensions. In this section, we demonstrate by example the exploration of trivariate data.

We continue our analysis of the data given by the duration of $299\,$ consecutive eruptions of the Old Faithful geyser. A graph of the histogram of these data is displayed in Fig. 4.2b. We further modified the data as follows: the $105\,$ values that were only recorded to the nearest minute were blurred by adding uniform noise of $30\,$ seconds in duration. (The remaining data points were recorded to the nearest second). An easy way to generate high-dimensional data from a univariate time series is to group adjacent values. In Fig. 4.12, ASH's of the univariate data $\{y_t\}$ and the lagged data $\{(y_{t-1},y_t)\}$ are shown. The obvious question is whether knowledge of $\,y_{t-1}$ is useful for predicting the value of $\,y_t$ . Clearly, the answer is in the affirmative, but the structure would not be well-represented by an autoregressive model.

**Figure 4.12:** Averaged shifted histograms of the Old Faithful geyser duration data
$\includegraphics[width=98mm,clip]{text/3-4/fig12.eps}$

Next, we computed the ASH for the trivariate lagged data $\{(y_{t-2},y_{t-1},y_t)\}$ . The resulting estimate, $\,\widehat{f}_{\mathrm{ASH}}(y_{t-2},y_{t-1},y_t)$ , may be explored in several fashions. The question is whether knowing $y_{t-2}$ can be used to predict the joint behavior of $(y_{t-1},y_t)$ . This may be accomplished, for example, by examining slices of the trivariate density. Since the (univariate) density has two modes at and minutes, we examine the slices $\widehat{f}_{\mathrm{ASH}}(1.88,y_{t-1},y_t)$ and $\widehat{f}_{\mathrm{ASH}}(4.33,y_{t-1},y_t)$ ; see Fig. 4.13. The $297\,$ data points were divided into two groups, depending on whether $y_{t-2}<3.0$ or not. The first group of points was added to Fig. 4.13a, while the second group was added to Fig. 4.13b.

**Figure 4.13:** Slices of the trivariate averaged shifted histogram of lagged values of the Old Faithful geyser duration data
$\includegraphics[width=99mm,clip]{text/3-4/fig13.eps}$

**Figure 4.14:** Visualization of the $\alpha = 58{\%}$ contour of the trivariate ASH of the lagged geyser duration data
$\includegraphics[width=65mm,clip]{text/3-4/fig14.eps}$

Since each axis was divided into 100 bins, there are 98 other views one might examine like Fig. 4.13. (An animation is actually quite informative.) However, one may obtain a holistic view by examining level sets of the full trivariate density. A level set is the set of all points $\textbf{x}$ such that $\widehat{f}_{\mathrm{ASH}}(\textbf{x})=\alpha \widehat{f}_{\mathrm{max}}$ , where $\widehat{f}_{\text{max}}$ is the maximum or modal value of the density estimate, and $\alpha\in (0,1)$ is a constant that determines the contour level. Such contours are typically smooth surfaces in $\Re^3$ . When $\alpha=1$ , then the ''contour'' is simply the modal location point. In Fig. 4.14, the contour corresponding to $\alpha=58\,{\%}$ is displayed. Clearly these data are multimodal, as five well-separated high-density regions are apparent. Each cluster corresponds to a different sequence of eruption durations, such as long-long-long. The five clusters are now also quite apparent in both frames of Fig. 4.13. Of the eight possible sequences, three are not observed in this sequence of $299\,$ eruptions.

A single contour does not convey as much information as several. Depending on the display device, one may reasonably view three to five contours, using transparency to see the higher density contours that are ''inside'' the lower density contours. Consider adding a second contour corresponding to $\alpha= 28\,{\%}$ to that in Fig. 4.14. Rather than attempt to use transparency, we choose an alternative representation which emphasizes the underlying algorithms. The software which produced these figures is called ashn and is available at the author's website. ASH values are computed on a three-dimensional lattice. The surfaces are constructed using the marching cubes algorithm ([20]), which generates thousands of triangles that make up each surface. In Fig. 4.15, we choose not to plot all of the triangles but only every other ''row'' along the second axis. The striped effect allows one to interpolate and complete the low-density contour, while allowing one to look inside and see the high-density contour. Since there are five clusters, this is repeated five times. A smaller sixth cluster is suggested as well.

**Figure 4.15:** Visualization of the $\alpha =28{\%}$ and $58{\%}$ contours of the trivariate ASH of the lagged geyser duration data
$\includegraphics[clip]{text/3-4/fig15.eps}$

Next: 4.5 Conclusions Up: 4. Multivariate Density Estimation Previous: 4.3 Density Estimation Algorithms