|
The optional parameter col allows us to produce a graphical object in another color other than black. For details, see Subsection 3.4.3.
Up to now, we have just generated graphical objects or data sets and
plotted them in a two-dimensional plot. Sometimes the analysis of data
is easier if we can show them as three-dimensional data. Rotating
the data point cloud will give us more insight into the data. We can
use the
plot
quantlet for this.
library("plot") ; loads library plot data = read ("bostonh") ; reads Boston Housing data x = data[,6|13|14] ; selects columns 6, 13 and 14 plot(x) ; plots the data set
![]() |
Now click with the mouse in the window and use the cursor keys to rotate
the data set. Note that the only change we made for three-dimensional
plotting is that the input matrix we use in
plot
consists of three vectors.
We can apply here the quantlet
setmask
to color the data points.
We see in the plot of RM, LSTAT and MEDV a nonlinear relationship which may allow us to estimate the median house prices (MEDV) as a parametric function of percentage of lower status people (LSTAT) and average number of rooms (RM).
Surfaces are the three-dimensional analogs of curves in two dimensions.
Since we have already plotted three-dimensional data, we can imagine
what we have to do: generate a data set, generate some lines, and plot them.
The quantlet
grsurface
does this for data on a rectangular mesh.
library ("plot") ; loads library plot x0 = #(-3, -3) h = #(0.2, 0.2) n = #(31, 31) x = grid(x0, h, n) ; generates a bivariate grid f = exp(-(x[,1]^2+x[,2]^2)/1.5)/(1.5*pi) ; computes density of bivariate ; normal with correlation 0.5 gr = grsurface(x~f) ; generates surface plot(gr) ; plots the surface
Most of the upper program is used to generate the underlying grid and
the density function. We may plot the data set itself by
plot(gf). The surface quantlet
grsurface
needs three parameters: the
- and the
-coordinates of the
grid and the function values at
.
If we view surfaces, then understanding them might be difficult, even
if we can rotate them. We can produce contour plots with
contour lines constant.
library ("plot") ; loads the library plot x0 = #(-3, -3) h = #(0.2, 0.2) n = #(31, 31) x = grid(x0, h, n) ; generates a bivariate grid f = exp(-(x[,1]^2+x[,2]^2)/1.5)/(1.5*pi) ; computes density of bivariate ; normal with correlation 0.5 c = 0.2*(1:4).*max(f) ; selects contour lines as 10%,...,90% ; times the maximum density gr = grcontour2(x~f, c) ; generates contours plot(gr) ; plots the contours
The quantlet
grcontour3
can be used to compute contour lines
for a function
constant. This gives a two-dimensional
surface in a three-dimensional space. An example is the
XploRe
logo.
Sunflower plots avoid the overplotting if we have many data points. We can consider a sunflower plot as a combination of two-dimensional histograms with a contour plot. As with histograms, we must define a two-dimensional binwidth d and a two-dimensional origin o.
Let's compare a bivariate normal distribution with data points
with the equivalent sunflower plot.
library ("graphic") ; loads library graphic x = normal(1000, 2) ; generates bivariate normal data d = createdisplay(2,1) ; creates a display with two plots show (d, 1, 1, x) ; plots the original data gr = grsunflower(x) ; generates sunflower plot show (d, 2, 1, gr) ; plots the sunflower plot
![]() |
An important statistical task is to find a relationship between two variables. The most frequently used technique to quantify the relationship is the least squares linear regression:
library("plot") ; loads library plot data = read ("bostonh") ; reads Boston Housing data x0 = data[,13:14] ; selects columns 13 and 14 l0 = grlinreg(x0) ; generates regression line x1 = log(data[,13:14]) ; logarithm of columns 13 and 14 l1 = grlinreg(x1) ; generates regression line d = createdisplay(1,2) ; creates display with two plots show (d, 1, 1, x0, l0) ; plots data and regression l0 show (d, 1, 2, x1, l1) ; plots data and regression l1
![]() |
We see that the regression line (median house price = +
percentage of lower status
people) does not fit well the data. Either a transformation of the data or
a nonlinear regression technique seems to be useful. In our example we have taken
logarithms of
and
, thus our model is
Obviously we can choose other explanatory variables, e.g. nitric oxygen in parts per
ten million (NOXSQ) and average numbers of rooms (RM). This results in
the following program which draws the three-dimensional data set and the regression
plane (
) with a
mesh.
library("plot") ; loads library plot data = read ("bostonh") ; reads Boston Housing data x = data[,5|6|14] ; selects columns 5, 6 and 14 p = grlinreg2(x, 5|5) ; generates regression plane ; with 4x4 mesh plot(x, p) ; plots data set and regression plane
![]() |
We have already plotted two variables, the percentage of
lower status people
against the median house price. The quantlet
plot2
provides a much more powerful way
of plotting multivariate data sets. The following program shows the first two
principal components (based on the correlation matrix) of
the Boston Housing data
set. Additionally, if the median house price is less than
the mean of the median house price,
then we color the observation green, otherwise we color the
observation blue.
library("plot") ; loads library plot data = read ("bostonh") ; reads Boston Housing data col = grc.col.green-grc.col.blue col = grc.col.blue+col*(data[,14]<mean(data[,14])) ; colors observations blue and green plot2 (data, grc.prep.pcacorr, col) ; plots two first principal axes
When we load the library graphic, the library installs an object grc which contains some often used graphical constants, e.g. grc.col.red for the color red. grc.prep.none makes no transformation on the data and plots the first two variables (here: per capita crime rate, CRIM, and proportion of residential land zoned for lots over 25000 square feet, ZN).
Since we assume a relation between all variables, we choose grc.prep.pcacorr and color the data points as described above. It seems to be an interesting (nonlinear) relationship for the house prices. A more complex example in Subsection 3.4.3 describes how the coloring works.
Star diagrams are used to visualize a multivariate data set. For each variable, we plot
a star. Each axis of a star represents one variable. Obviously, we need to standardize
the variables to a common interval which is internally done by
. Depending on the method the user
chooses,
we select
and
. The default method is grc.prep.zeroone.
library("plot") ; loads library plot data = read ("bostonh") ; reads Boston Housing data data = data[1:70,] ; select first 70 observations col = grc.col.green-grc.col.blue col = grc.col.blue+col*(data[,14]<mean(data[,14])) ; colors observations blue and green plotstar (data, grc.prep.zeroone, col) ; shows star diagram of the data
![]() |
We note immediately that we have several groups of similar looking data points. The Boston Housing data set is well-known for its outliers and subgroups in the data.
Another possibility of analyzing multivariate data
is the scatter-plot matrix. Here, we
plot a set of scatter plots such that we see every
possible variable combination.
Obviously we should not throw too much variables
in a scatter-plot matrix, since
our screen size is limited. In fact,
plotscml
limits the number of variables
to eight.
library("plot") ; loads library plot data = read ("bostonh") ; reads Boston Housing data x = data[,5|6|13|14] ; selects columns 5, 6, 13 14 names="NOXSQ"~"RM"~"LSTAT"~"MEDV" ; names of the variables plotscml (x, names) ; shows scatter-plot matrix
![]() |
The idea of Andrews curves is based on replacing each point by a curve such that some properties of the data points are transferred to properties of the curves. For example, it holds that
library("plot") ; loads library plot data = read ("bostonh") ; reads Boston Housing data data = data[,1:3]~data[,5:14] ; Boston Housing without the fourth column data = data[21:40] ; observations 21 to 40 plotandrews (data, grc.prep.pcacorr) ; shows Andrews curves based on ; principal components of ; correlation matrix
![]() |
Note also that the order of the variables plays an important role. The last variable is represented with a rather high frequency. The human brain will not easily consider two high frequent curves as really distinct.
A completely different approach are parallel coordinate plots. Instead of insisting
on orthogonality of the projections (e.g. in the scatter-plot matrix), we give it up.
We plot on the th parallel axis all data points
.
Then we connect the intersections such that each curve represents one data point.
Some properties between the variables create specific patterns in the
graphics. For example, a correlation of 1 between two
variables is represented by parallel lines between the axes. A correlation of -1
results in a crossing of all lines in one point in the middle between two variables.
library("plot") ; loads library plot data = read ("bostonh") ; reads Boston Housing data data = data[21:40] ; observations 21 to 40 x = data[,6|13|14] ; selects columns 6, 13 and 14 plotpcp (x, grc.prep.standard) ; shows parallel coordinate plot
![]() |
The Boston Housing data show some kind of negative correlation between RM and LSTAT and LSTAT and MEDV.