next up previous contents index
Next: 11.11 Software Up: 11. The Grammar of Previous: 11.9 Layout

Subsections



11.10 Analytics

If conclusions based on statistical charts are to be useful, we must identify and interpret the statistical models underlying charts. A statistical model determines how the location of a representation element (point, line, $ \ldots$) in a frame (a measurable region of representation) is computed from the values of a variable. Statistical models usually (but not necessarily) incorporate error terms and help us to articulate the domains of generalizations and inferences we make from examining a chart. [13] summarize these issues from a data mining context. Because chart algebra is based on statistical design algebras, it can be used to generate statistical models for visual data mining or predictive analytics.

This section presents the statistical model equivalents of chart algebra expressions. In each subsection, we show the chart algebra notation on the left of each equivalence expression and the statistical model notation on the right. The terms on the left comprise varsets and the terms on the right comprise variables. Note that some symbols (e.g., +) are common to both notations but have different meanings. The general linear statistical models presented in this section are due to ([10,11]). More recent introductions to the design notation used for statistical models are ([18]) and ([21]).

11.10.1 Statistical Model Equivalents

In the following subsections, we assume a functional model $ Z
= f(X,Y)$, where $ Z$ is a (possibly multivariate) variable. $ Z$ corresponds to a varset Z, which itself might be produced from a chart algebra expression. In statistical terms, we sometimes call $ Z$ a dependent variable and $ X$ and $ Y$ independent variables. In this section, we ignore $ Z$ and focus on expressions involving $ X$ and $ Y$. These expressions are used to construct statistical models that help to predict or estimate $ Z$.


11.10.1.1 Cross

$\displaystyle \mathrm{X}\ast\mathrm{Y} \sim \mathcal{C}+X+Y+XY$    

The cross operator corresponds to a fully factorial experimental design specification. This design employs a product set that includes every combination of levels of a set of experimental factors or treatments. The terms on the right of the similarity represent the linear model for fitting fully factorial designs. The terms in the model are:

$\displaystyle \nonumber \mathcal{C}$ $\displaystyle :$   constant term (grand mean)    
$\displaystyle \nonumber X$ $\displaystyle :$   levels of factor $\displaystyle \mathrm{X}$   ($\displaystyle \mathrm{X}$   main effect)    
$\displaystyle \nonumber Y$ $\displaystyle :$   levels of factor $\displaystyle \mathrm{Y}$   ($\displaystyle \mathrm{Y}$   main effect)    
$\displaystyle \nonumber XY$ $\displaystyle :$   product of factors $\displaystyle \mathrm{X}$   and $\displaystyle \mathrm{Y}$   (interactions)    

We could use boldface for the variables on the right because the most general form of the model includes factors (multidimensional categorical variables) having more than one level. These multivariate terms consist of sets of binary categorical variables whose values denote presence or absence of each level in the factor. Alternatively, terms based on continuous variables are called covariates.

An example of a two-way factorial design would be the basis for a study of how teaching method and class size affect the job satisfaction of teachers. In such a design, each teaching method (factor $ \mathrm{X}$) is paired with each class size (factor $ \mathrm{Y}$) and teachers and students in a school are randomly assigned to the combinations.

11.10.1.2 Nest

$\displaystyle \nonumber \mathrm{X}/\mathrm{Y} \sim \mathcal{C}+Y+X(Y)$    

The terms on the right of the similarity are:

$\displaystyle \nonumber \mathcal{C}$ $\displaystyle :$   constant term    
$\displaystyle \nonumber Y$ $\displaystyle :$   levels of factor $\displaystyle \mathrm{Y}$   ($\displaystyle \mathrm{Y}$   main effect)    
$\displaystyle \nonumber X(Y)$ $\displaystyle : X$   levels nested within levels of $\displaystyle \mathrm{Y}$    

The term $ X(Y)$ represents the series $ X \mid (Y=Y_{1})+X \mid
(Y=Y_{2})+\ldots$

Notice that there is no interaction term involving $ X$ and $ Y$ because $ X$ is nested within $ Y$. Not all combinations of the levels of $ X$ and $ Y$ are defined. An example of a nested design would be the basis for a study of the effectiveness of different teachers and schools in raising reading scores. Teachers are nested within schools when no teacher in the study can teach at more than one school. With nesting, two teachers with the same name in different schools are different people. With crossing, two teachers with the same name in different schools may be the same person.

11.10.1.3 Blend

$\displaystyle \nonumber \mathrm{X}+\mathrm{Y} \sim \mathcal{C}+ F_{XY}$    

The terms on the right of the similarity are:

$\displaystyle \nonumber \mathcal{C}$ $\displaystyle :$   constant term    
$\displaystyle \nonumber F_{XY}$ $\displaystyle :$   function of $\displaystyle X$   and $\displaystyle Y \ ($   e.g.$\displaystyle , X-Y)$    

The blend operator usually corresponds to a time series design. In such a design, we predict using functions of a time series. When the blend involves dependent variables, this is often called a repeated measures design. The simplest case is a prediction based on first differences of a series. Time is not the only possible dimension for ordering variables, of course. Other multivariate functional models can be used to analyze the results of blends ([31]).

An example of a repeated measures design would be the basis for a study of improvement in reading scores under different curricula. Students are randomly assigned to curriculum and the comparison of interest involves differences between pre-test and post-test reading scores.

It would appear that analytics have little to do with the process of building a chart. If visualization is at the end of a data-flow pipeline, then statistics is simply a form of pre-processing. In our model, however, analytics are an intrinsic part of chart construction. Through chart algebra, the structure of a graph implies a statistical model. Given this model, we can employ likelihood, information, or goodness-of-fit measures to identify parsimonious models. We will explore some graphic uses of statistical models in this section.

11.10.2 Subset Model Fitting

The factorial structure of most chart algebra expressions can produce rather complex models. We need to consider strategies for selecting subset models that are adequate fits to the data. We will discuss one simple approach in this section. This approach involves eliminating interactions (products of factors) in factorial designs.

Interactions are often regarded as nuisances because they are difficult to interpret. Comprehending interactions requires thinking about partial derivatives. A three-way interaction $ XYZ$, for example, means that the relation between $ X$ and $ Y$ depends on the level of $ Z$. And the relation between $ X$ and $ Z$ depends on the level of $ Y$. And the relation between $ Y$ and $ Z$ depends on the level of $ X$. Without any interaction, we can speak about these simple relations unconditionally. Thus, one strategy for fitting useful subset models is to search for subsets with as few interactions as possible. In this search, we require that any variables in an interaction be present as a main-effect in the model.

Figure 11.13 shows a submodel tree for the three-way crossing $ X \ast Y \ast Z$. Not all possible submodels are included in the tree, because the convention in modeling factorial designs is to include main effects for every variable appearing in an interaction. This reduces the search space for plausible submodels. By using branch-and-bound methods, we can reduce the search even further. [27] and [22] cover this area in more detail.

Figure 11.13: Model subset tree
\includegraphics{text/2-11/models.eps}

11.10.3 Lack of Fit

Statistical modeling and data mining focus on regularity: averages, summaries, smooths, and rules that account for the significant variation in a dataset. Often, however, the main interest of statistical graphics is in locating aspects that are discrepant, surprising, or unusual: under-performing sales people, aberrant voting precincts, defective parts.

An outlier is a case whose data value and fitted value (using some model) are highly discrepant relative to the other data-fitted discrepancies in the dataset. ([4]). Casewise discrepancies are called residuals by statisticians. Outliers can be flagged in a display by highlighting (e.g., coloring) large residuals in the frame. Outliers are only one of several indicators of a poorly fit model, however. Relative badness-of-fit can occur in one or more cells of a table, for example. We can use subset modeling to highlight such cells in a display. [33] do this for log-linear models. Also, we can use autocorrelation and cross-correlation diagnostic methods to identify dependencies in the residuals and highlight areas in the display where this occurs.

11.10.4 Scalability

Subset design modeling is most suited for deep and narrow (many rows, few columns) data tables or low-dimensional data cubes. Other data mining methods are designed for wide data tables or high-dimensional cubes ([15,17]). Subset design modeling makes sense for visualization applications because the design space in these applications does not tend to be high-dimensional. Visual data exploration works best in a few dimensions. Higher-dimensional applications work best under the guidance of other data mining algorithms.

Estimating design models requires $ O(n)$ computations with regard to cases, because only one pass through the cases is needed to compute the statistics for estimating the model. Although computing design models can be worse-case $ O(p^{2})$ in the number of dimensions, sparse matrix methods can be used to reduce this overhead because many of the covariance terms are usually zero.

11.10.5 An Example

Smoothing data reveals systematic structure. [39] used the word in a specific sense, by pairing the two equations

$\displaystyle \nonumber data$ $\displaystyle = fit + residual$    
$\displaystyle \nonumber data$ $\displaystyle = smooth + rough$    

Tukey's use of the word is different from other mathematical meanings, such as functions having many derivatives.

Figure 11.14: Crash data
\includegraphics{text/2-11/crash1.eps}

We smooth data in graphics to highlight selected patterns in order to make inferences. We present an example involving injury to the heads of dummies in government frontal crash tests. Figure 11.14 shows NHTSA crash test results for selected vehicles tested before 1999. The dependent variable shown on the horizontal axis of the chart is the Head Injury Index computed by the agency. The full model is generated by the chart algebra $ {H} \ast {T} / ({M} \ast {V})
\ast {O}$. This expression corresponds to the model:

$\displaystyle \nonumber H = \mathcal{C}+M+V+O+T(MV)+MV+MO+VO+OT(MV)+MVO$    

where the symbols are:

$\displaystyle \nonumber H$ $\displaystyle :$   Head Injury Index    
$\displaystyle \nonumber \mathcal{C}$ $\displaystyle :$   constant term (grand mean)    
$\displaystyle \nonumber M$ $\displaystyle :$   Manufacturer    
$\displaystyle \nonumber V$ $\displaystyle :$   Vehicle (car/truck)    
$\displaystyle \nonumber O$ $\displaystyle :$   Occupant (driver/passenger)    
$\displaystyle \nonumber T$ $\displaystyle :$   Model    

This display is difficult to interpret. We need to fit a model and order the display to reveal the results of the model fit. Fig. 11.15 charts fitted values from the following subset model:

$\displaystyle \nonumber H = \mathcal{C}+V+O+T(MV)$    

Figure 11.15 has several notable features. First, the models are sorted according to the estimated Head Injury Index. This makes it easier to compare different cells. Second, some values have been estimated for vehicles with missing values (e.g., GM G-20 driver data). Third, the trends are smoother than the raw data. This is the result of fitting a subset model. We conclude that passengers receive more head injuries than drivers, occupants of trucks and SUVs (sports utility vehicles) receive more head injuries than occupants of cars, and occupants of some models receive more injuries than occupants of others.

Figure 11.15: Smoothed crash data
\includegraphics{text/2-11/crash2.eps}


next up previous contents index
Next: 11.11 Software Up: 11. The Grammar of Previous: 11.9 Layout