Next: 11.3 Algebra Up: 11. The Grammar of Previous: 11.1 Introduction

11.2 Variables

We begin with data. We assume the data that we wish to graph are organized in one or more tables. The column(s) of each table represent a set of fields, each field containing a set of measurements or attributes. The row(s) of this table represent a set of logical records, each record containing the measurements of an object on each field. Usually, a relational database management system (RDBMS) produces such a table from organized queries specified in Structured Query Language (SQL) or another relational language. When we do not have data stored in a relational database (e.g., live data feeds), we need custom software to provide such a table using Extensible Markup Language (XML), Perl scripts, or other languages.

The first thing we have to do is convert such tables of data to something called a varset. A varset is a set of one or more variables. While a column of a table of data might superficially be considered to be a variable, there are differences. A variable is both more general (in regard to generalizability across samples) and more specific (in regard to data typing and other constraints) than a column of data. First, we define a variable, then a varset.

11.2.1 Variable

A statistical variable is a mapping $f: O \rightarrow V$ , which we consider as a triple:

$\displaystyle X = \left[O, V, f\right]$

The domain

is a set of objects.
The codomain

is a set of values.
The function

assigns to each element of

an element in

The image of under contains the values of . We denote a possible value as , where $x\in V$ . We denote a value of an object as , where $o\in O$ . A variable is continuous if is an interval. A variable is categorical if is a finite subset of the integers (or there exists an injective map from to a finite subset of the integers).

Variables may be multidimensional. $\textbf{\textit{X}}$ is a -dimensional variable made up of one-dimensional variables:

$\displaystyle \boldsymbol{X}$	$\displaystyle = (X_{1}, \ldots, X_{p})$
	$\displaystyle = [{O},{V}_{i}, {f} \ ]\,, \quad i = 1, \ldots, p$
	$\displaystyle = [{O},\boldsymbol{V}, {f}\ ]$

The element $\boldsymbol{x} = (x_{1}, \ldots, x_{p})$ , $\boldsymbol{x} \in \boldsymbol{V}$ , is a

-dimensional value of $\boldsymbol {X}$ . We use multidimensional variables in multivariate analysis.

A random variable is a real-valued function defined on a sample space $\Omega$ :

$\displaystyle \nonumber X = [\Omega, \mathbb{R}, f]$

Random variables may be multidimensional. In the elementary probability model, each element $\omega\in\Omega$ is associated with a probability function

. The range of

is the interval $\left[0, 1\right]$ , and $P(\Omega)=1$ . Because of the associated probability model, we can make probability statements about outcomes in the range of the random variable, such as:

$\displaystyle P(X=2) = P({\omega:\omega\in\Omega, X(\omega) =2})$

11.2.2 Varset

We call the triple

$\displaystyle \mathrm{X} = [V, \widetilde{O}, f]$

a varset. The word varset stands for variable set. If X is multidimensional, we use boldface $\boldsymbol {X}$ . A varset inverts the mapping used for variables. That is,

The domain is a set of values.
The codomain $\widetilde{O}$ is a set of all possible ordered lists of objects.
The function assigns to each element of an element in $\widetilde{O}$ .

We invert the mapping customarily used for variables in order to simplify the definitions of graphics algebra operations on varsets. In doing so, we also replace the variable's set of objects with the varset's set of ordered lists. We use lists in the codomain because it is possible for a value to be mapped to an object more than once (as in repeated measurements).

11.2.3 Converting a Table of Data to a Varset

To convert a table to a varset, we must define the varset's domain of values and range of objects and specify a reference function that maps each row in the table to an element in the varset.

We define the domain of values in each varset by identifying the measurements its values represent. If our measurements include weight, for example, we need to record the measurement units and the interval covered by the range of weights. If the domain is categorical, we need to decide whether there are categories not found in the data that should be included in the domain (refused to reply, don't know, missing, $\ldots$ ). And we may need to identify categories found in the data that are not defined in the domain (mistakes, intransitivities, $\ldots$ ). The varset's domain is a key component in the GOG system. It is used for axes, legends, and other components of a graphic. Because the actual data for a chart are only an instance of what we expect to see in a varset's domain, we let the domain control the structure of the chart.

We define the range of potential objects in each varset by identifying the class they represent. If the rows of our table represent measurements of a group of school children, for example, we may define the range to be school children, children in general, people, or (at the most abstract level) objects. Our decision about the level of generality may affect how a graphic will be titled, how legends are designed, and so on.

Finally, we devise a reference system by indexing objects in the domain. This is usually as simple as deriving a caseID from a table row index. Or, as is frequently done, we may choose the value of a key variable (e.g., Social Security Number) to create a unique index.

Next: 11.3 Algebra Up: 11. The Grammar of Previous: 11.1 Introduction