3.3 Multivariate Graphics


6484 plot (x1 {, x2 {, ... {x5}}})
plots the three-dimensional data sets x1, ..., x5
gr = 6487 grsurface (x {, col})
generates surface from the function $ f(x,y)$
gr = 6490 grcontour2 (x {, c {,col}})
generates the contour lines $ f(x,y)=c$
gr = 6493 grcontour3 (x {, c {,col}})
generates the contour lines $ f(x,y,z)=c$
gr = 6496 grsunflower (x {, d {, o {,col}}})
generates a sunflower plot
gr = 6499 grlinreg (x {,col})
generates the linear regression line
gr = 6502 grlinreg2 (x {, n {,col}})
generates the linear regression plane
6505 plot2 (x {, prep {,col}})
plots two variables
6508 plotstar (x {, prep {,col}})
plots a star diagram
6511 plotscml (x {, varnames})
plots a scatter-plot matrix
6514 plotandrews (x {, prep {,col}})
plots Andrews curves
6517 plotpcp (x {, prep {,col}})
plots parallel coordinates

The optional parameter col allows us to produce a graphical object in another color other than black. For details, see Subsection 3.4.3.


3.3.1 Three-Dimensional Plots

Up to now, we have just generated graphical objects or data sets and plotted them in a two-dimensional plot. Sometimes the analysis of data is easier if we can show them as three-dimensional data. Rotating the data point cloud will give us more insight into the data. We can use the 6600 plot quantlet for this.

  library("plot")             ; loads library plot
  data = read ("bostonh")     ; reads Boston Housing data
  x = data[,6|13|14]          ; selects columns 6, 13 and 14
  plot(x)                     ; plots the data set
6604 XLGgraph41.xpl

Figure: Variables 6 (RM), 13 (LSTAT) and 14 (MEDV) of the Boston Housing data plotted in a 3D scatter plot.
\includegraphics[scale=0.425]{grfig41}

Now click with the mouse in the window and use the cursor keys to rotate the data set. Note that the only change we made for three-dimensional plotting is that the input matrix we use in 6616 plot consists of three vectors. We can apply here the quantlet 6619 setmask to color the data points.

We see in the plot of RM, LSTAT and MEDV a nonlinear relationship which may allow us to estimate the median house prices (MEDV) as a parametric function of percentage of lower status people (LSTAT) and average number of rooms (RM).


3.3.2 Surface Plots

Surfaces are the three-dimensional analogs of curves in two dimensions. Since we have already plotted three-dimensional data, we can imagine what we have to do: generate a data set, generate some lines, and plot them. The quantlet 6658 grsurface does this for data on a rectangular mesh.

  library ("plot")               ; loads library plot
  x0 = #(-3, -3)              
  h  = #(0.2, 0.2)
  n  = #(31, 31)
  x  = grid(x0, h, n)            ; generates a bivariate grid
  f  = exp(-(x[,1]^2+x[,2]^2)/1.5)/(1.5*pi)
                                 ; computes density of bivariate 
                                 ;   normal with correlation 0.5
  gr = grsurface(x~f)            ; generates surface
  plot(gr)                       ; plots the surface
6662 XLGgraph42.xpl

Figure: Surface plot of the density of the bivariate standard normal distribution.
\includegraphics[scale=0.425]{grfig42}

Most of the upper program is used to generate the underlying grid and the density function. We may plot the data set itself by plot(g$ \sim$f). The surface quantlet 6674 grsurface needs three parameters: the $ x$- and the $ y$-coordinates of the grid and the function values at $ f(x_i,y_j)$.


3.3.3 Contour Plots

If we view surfaces, then understanding them might be difficult, even if we can rotate them. We can produce contour plots with contour lines $ f(x,y) = $ constant.

  library ("plot")         ; loads the library plot
  x0 = #(-3, -3)              
  h  = #(0.2, 0.2)
  n  = #(31, 31)
  x  = grid(x0, h, n)      ; generates a bivariate grid
  f  = exp(-(x[,1]^2+x[,2]^2)/1.5)/(1.5*pi)
                           ; computes density of bivariate 
                           ;    normal with correlation 0.5
  c  = 0.2*(1:4).*max(f)   ; selects contour lines as 10%,...,90% 
                           ;    times the maximum density
  gr = grcontour2(x~f, c)  ; generates contours
  plot(gr)                 ; plots the contours
6720 XLGgraph43.xpl

Figure: Contour plot of the density of the bivariate standard normal distribution.
\includegraphics[scale=0.425]{grfig43}

The quantlet 6732 grcontour3 can be used to compute contour lines for a function $ f(x,y,z) = $ constant. This gives a two-dimensional surface in a three-dimensional space. An example is the XploRe logo.


3.3.4 Sunflower Plots

Sunflower plots avoid the overplotting if we have many data points. We can consider a sunflower plot as a combination of two-dimensional histograms with a contour plot. As with histograms, we must define a two-dimensional binwidth d and a two-dimensional origin o.

Let's compare a bivariate normal distribution with $ 1000$ data points with the equivalent sunflower plot.

  library ("graphic")          ; loads library graphic
  x  = normal(1000, 2)         ; generates bivariate normal data 
  d  = createdisplay(2,1)      ; creates a display with two plots
  show (d, 1, 1, x)            ; plots the original data
  gr = grsunflower(x)          ; generates sunflower plot
  show (d, 2, 1, gr)           ; plots the sunflower plot
6775 XLGgraph44.xpl

Figure: Standard 2D plot and sunflower plot of a large random sample of a bivariate standard normal distribution.
\includegraphics[scale=0.425]{grfig44}


3.3.5 Linear Regression

An important statistical task is to find a relationship between two variables. The most frequently used technique to quantify the relationship is the least squares linear regression:

$\displaystyle \sum_{i=1}^n (y_i - b_0 - b_1 x_i)^2 \rightarrow \textrm{ minimal.} $

To understand (graphically) how well the linear regression picks up the true relationship between two variables, we show the data points and the regression line in one plot:
  library("plot")                ; loads library plot
  data = read ("bostonh")        ; reads Boston Housing data
  x0 = data[,13:14]              ; selects columns 13 and 14
  l0 = grlinreg(x0)              ; generates regression line
  x1 = log(data[,13:14])         ; logarithm of columns 13 and 14
  l1 = grlinreg(x1)              ; generates regression line
  d  = createdisplay(1,2)        ; creates display with two plots
  show (d, 1, 1, x0, l0)         ; plots data and regression l0
  show (d, 1, 2, x1, l1)         ; plots data and regression l1
6832 XLGgraph45.xpl

Figure: Linear regressions of the 14th variable (MEDV) by the 13th variable (LSTAT), left: original variables, right: both variables transformed by logarithm.
\includegraphics[scale=0.425]{grfig45}

We see that the regression line (median house price = $ b_0$ + $ b_1$ percentage of lower status people) does not fit well the data. Either a transformation of the data or a nonlinear regression technique seems to be useful. In our example we have taken logarithms of $ x$ and $ y$, thus our model is

$\displaystyle \textrm{median house price}
= \exp(b_0)\ \textrm{percentage lower status people}^{b_1} $

The transformation of the house price turns out to be especially useful, since it avoids having to achieve negative values for the house prices in the model.

Obviously we can choose other explanatory variables, e.g. nitric oxygen in parts per ten million (NOXSQ) and average numbers of rooms (RM). This results in the following program which draws the three-dimensional data set and the regression plane ( $ b_0 + b_1 x_1 + b_2 x_2$) with a $ 4\times4$ mesh.

  library("plot")           ; loads library plot
  data = read ("bostonh")   ; reads Boston Housing data
  x  = data[,5|6|14]        ; selects columns 5, 6 and 14
  p  = grlinreg2(x, 5|5)    ; generates regression plane 
                            ;    with 4x4 mesh
  plot(x, p)                ; plots data set and regression plane
6847 XLGgraph46.xpl

Figure: Bivariate linear regressions of the 14th variable (MEDV) by the 5th (NOXSQ) and 6th variables (RM).
\includegraphics[scale=0.425]{grfig46}


3.3.6 Bivariate Plots

We have already plotted two variables, the percentage of lower status people against the median house price. The quantlet 6925 plot2 provides a much more powerful way of plotting multivariate data sets. The following program shows the first two principal components (based on the correlation matrix) of the Boston Housing data set. Additionally, if the median house price is less than the mean of the median house price, then we color the observation green, otherwise we color the observation blue.

  library("plot")            ; loads library plot
  data = read ("bostonh")    ; reads Boston Housing data
  col  = grc.col.green-grc.col.blue
  col  = grc.col.blue+col*(data[,14]<mean(data[,14])) 
                             ; colors observations blue and green
  plot2 (data, grc.prep.pcacorr, col)    
                             ; plots two first principal axes
6929 XLGgraph47.xpl

Figure: First two principal components based on the correlation matrix of the Boston Housing data.
\includegraphics[scale=0.425]{grfig47}

When we load the library graphic, the library installs an object grc which contains some often used graphical constants, e.g. grc.col.red for the color red. grc.prep.none makes no transformation on the data and plots the first two variables (here: per capita crime rate, CRIM, and proportion of residential land zoned for lots over 25000 square feet, ZN).

grc.prep.standard
standardizes the data,
grc.prep.zeroone
transforms the data on the interval $ [0,1]$ before plotting,
grc.prep.pcacov
takes the first two principal components based on the covariance instead of the two first two variables,
grc.prep.pcacorr
takes the first two principal components based on the correlation instead of the two first two variables and
grc.prep.sphere
spheres the data and the takes the first two components.

Since we assume a relation between all variables, we choose grc.prep.pcacorr and color the data points as described above. It seems to be an interesting (nonlinear) relationship for the house prices. A more complex example in Subsection 3.4.3 describes how the coloring works.


3.3.7 Star Diagrams

Star diagrams are used to visualize a multivariate data set. For each variable, we plot a star. Each axis of a star represents one variable. Obviously, we need to standardize the variables to a common interval which is internally done by $ z_{i,j}=(x_{i,j}-\min_j)/(\max_j-\min_j)$. Depending on the method the user chooses, we select $ \min_j$ and $ \max_j$. The default method is grc.prep.zeroone.

  library("plot")            ; loads library plot
  data = read ("bostonh")    ; reads Boston Housing data
  data = data[1:70,]         ; select first 70 observations
  col  = grc.col.green-grc.col.blue
  col  = grc.col.blue+col*(data[,14]<mean(data[,14])) 
                             ; colors observations blue and green
  plotstar (data, grc.prep.zeroone, col)
                             ; shows star diagram of the data
7000 XLGgraph48.xpl

Figure: Star diagram of the Boston Housing data. Green stars represent observations below the average house price, blue stars represent house prices above the average house price.
\includegraphics[scale=0.425]{grfig48}

We note immediately that we have several groups of similar looking data points. The Boston Housing data set is well-known for its outliers and subgroups in the data.


3.3.8 Scatter-Plot Matrices

Another possibility of analyzing multivariate data is the scatter-plot matrix. Here, we plot a set of scatter plots such that we see every possible variable combination. Obviously we should not throw too much variables in a scatter-plot matrix, since our screen size is limited. In fact, 7053 plotscml limits the number of variables to eight.

  library("plot")                   ; loads library plot
  data = read ("bostonh")           ; reads Boston Housing data
  x = data[,5|6|13|14]              ; selects columns 5, 6, 13 14
  names="NOXSQ"~"RM"~"LSTAT"~"MEDV" ; names of the variables
  plotscml (x, names)               ; shows scatter-plot matrix
7057 XLGgraph49.xpl

We see a clear nonlinear relationship between RM and LSTAT as well as between LSTAT and MEDV.

Figure: Scatter-plot matrix of the 5th (NOXSQ), 6th (RM), 13th (LSTAT) and 14th variables (MEDV) of the Boston Housing data.
\includegraphics[scale=0.55]{grfig49}


3.3.9 Andrews Curves

The idea of Andrews curves is based on replacing each point by a curve such that some properties of the data points are transferred to properties of the curves. For example, it holds that

$\displaystyle \int_{-\pi}^{\pi} (f_i(t)-f_j(t))^2 dt = \pi \Vert x_i - x_j \Vert $

with $ x_i$, $ x_j$ as data points and $ f_i$, $ f_j$ representing curves generated from the data points. We see that distant points in $ p$-space will generate quite different curves. The curve generation is defined by

$\displaystyle f_i(t) = \frac{x_{i,1}}{\sqrt{2}} + x_{i,2} \sin(t) + x_{i,3} \cos(t) + x_{i,4} \sin(2t) + x_{i,5} \cos(2t) + \ldots. $

Again, we need the variables to be on comparable levels. In our example, we choose grc.prep.pcacorr. We see that the curves are quite different, but they cross near $ t=2$. For a fixed $ t$, we can interpret the data points as a projection onto a very specific projection vector $ (1/\sqrt{2},\sin(2),\cos(2),...)^Tx_i$. Thus we conclude that one principal component of the correlation matrix is zero. In fact the fourth variable (CHAS), the Charles river index, has a rather small correlation with all other variables. Since the variable is not continuous, we should not include it in our picture.
  library("plot")                ; loads library plot
  data = read ("bostonh")        ; reads Boston Housing data
  data = data[,1:3]~data[,5:14]  ; Boston Housing without the fourth column
  data = data[21:40]             ; observations 21 to 40
  plotandrews (data, grc.prep.pcacorr)
                                 ; shows Andrews curves based on 
                                 ;    principal components of
                                 ; correlation matrix
7101 XLGgraph4A.xpl

Figure: Andrews curves based on the principal components of 20 observations of the Boston Housing data.
\includegraphics[scale=0.425]{grfig410}

Note also that the order of the variables plays an important role. The last variable is represented with a rather high frequency. The human brain will not easily consider two high frequent curves as really distinct.


3.3.10 Parallel Coordinate Plots

A completely different approach are parallel coordinate plots. Instead of insisting on orthogonality of the projections (e.g. in the scatter-plot matrix), we give it up. We plot on the $ j$th parallel axis all data points $ x_{ij}$. Then we connect the intersections such that each curve represents one data point. Some properties between the variables create specific patterns in the graphics. For example, a correlation of 1 between two variables is represented by parallel lines between the axes. A correlation of -1 results in a crossing of all lines in one point in the middle between two variables.

  library("plot")                ; loads library plot
  data = read ("bostonh")        ; reads Boston Housing data
  data = data[21:40]             ; observations 21 to 40
  x = data[,6|13|14]             ; selects columns 6, 13 and 14
  plotpcp (x, grc.prep.standard) ; shows parallel coordinate plot
7162 XLGgraph4B.xpl

Figure: Parallel coordinate plot of 20 observations of the standardized 6th (RM), 13th (LSTAT) and 14th variables (MEDV) of the Boston Housing data.
\includegraphics[scale=0.425]{grfig411}

The Boston Housing data show some kind of negative correlation between RM and LSTAT and LSTAT and MEDV.