Next: 11.6 Geometry Up: 11. The Grammar of Previous: 11.4 Scales

11.5 Statistics

Visualization and statistics are inseparable. Statisticians have known this for a long time, but non-statisticians in the visualization field have largely ignored the role of statistics in charts, maps, and graphics. Non-statisticians often believe that visualization follows data analysis. We aggregate, summarize, model, and then display the results. In this view, visualization is the last step in the chain and statistics is the first.

In GOG, statistics falls in the middle of the chain. The consequence of this architecture is that statistical methods are an integral part of the system. We can construct dynamic graphics, in which statistical methods can be changed (for exploratory purposes) without altering any other part of the specification and without restructuring the data. By including statistical methods in its architecture, GOG also makes plain the independence of statistical methods and geometric displays. There is no necessary connection between regression methods and curves or between confidence intervals and error bars or between histogram binning and histograms.

In GOG, the statistics component receives a varset, computes various statistics, and outputs another varset. In the simplest case, the statistical method is an identity. We do this for scatterplots. Data points are input and the same data points are output. In other cases, such as histogram binning, a varset with rows is input and and a varset with rows is output, where is the number of bins (). With smoothers (regression or interpolation), a varset with rows is input and and a varset with rows is output, where is the number of knots in a mesh over which smoothed values are computed. With point summaries (means, medians, $\ldots$ ), a varset with rows is input and a varset with one row is output. With regions (confidence intervals, ranges, $\ldots$ ), a varset with rows is input and and a varset with two rows is output.

Understanding how the statistics component works reveals an important reason for mapping values to cases in a varset rather than the other way around. If

$\displaystyle \mathrm{A} = \left[\mathbb{R}, \{\langle \cdot \rangle, \langle ... ...2.7\rightarrow \langle 2 \rangle, 1.8\rightarrow \langle 3 \rangle\}\right]\,,$

then

$\displaystyle \mathop{\text{mean}}(\text{A}) = \left[\mathbb{R}, \{\langle \cdo... ..., \cdot \rangle, \ldots \}, \{2.0\rightarrow \langle 1, 2, 3 \rangle\}\right].$

Notice that the list of caseIDs that is produced by mean is contained in the one row of the output varset. We do not lose case information in this mapping, the way we do when we compute results from an ordinary SQL query on a database or when we compute a data cube for an OLAP or when we pre-summarize data to produce a simple graphic. This aspect of GOG is important for dynamic graphics systems that allow drill-down or queries regarding metadata when the user hovers over a particular graphic element.

Figure 11.8 shows an application of a statistical method to the city data. We linearly regress 2000 population on 1980 population to see if population growth is proportional to city size. On log-log scales, the estimated values fall on a line whose slope is greater than , suggesting that larger cities grow faster than smaller. Ordinarily, we would draw a line to represent the regression and we would include the data points as well. We would also note that Lagos grew at an unusual rate (with a Studentized residual of 3.4). Nevertheless, our main point is to show that the statistical regression produces data points that are exchangeable with the raw data insofar as the entire GOG system is concerned. How we choose to represent the regressed values graphically is the subject of the next section.

**Figure 11.8:** `pop1980` $\ast$ estimate(`pop2000`), *xlog*, *ylog*
$\includegraphics[width=82mm,clip]{text/2-11/figure7.eps}$

Next: 11.6 Geometry Up: 11. The Grammar of Previous: 11.4 Scales