|
The optional parameter col allows us to produce a graphical
object in another color other than black. For details, see Subsection 3.4.3.
The other optional parameters will be explained when we introduce
grhist
and
grash
.
In the following examples, we use a mix of graphical primitives which are part of the library graphic and high-level routines which are part of the library plot. Since a call of library plot also loads the library graphic , we do not need to call the library graphic explicitly.
Let us now examine some variables of the Boston Housing data with statistical
graphics. Since the aim of the data exploration is to predict the median house price
from the variables, let us make a boxplot with the quantlet
grbox
.
library("plot") ; loads library plot data = read ("bostonh") ; reads Boston Housing data gr = grbox(data[,14]) ; generates a graphical object plot(gr) ; plots graphical object
We might not be satisfied with the boxplot, since the window size is chosen such that all the data are visible. Let us now apply an often helpful trick to get a better plot.
library("plot") ; loads library plot data = read ("bostonh") ; reads Boston Housing data x = data[,14] ; selects the 14th column gr = grbox(x) ; generates graphical object scale = #(min(x),max(x))~#(-1, 2) ; generates scaling data set scale = setmask(scale,"white") ; makes scaling data "invisible" plot(gr, scale) ; plots boxplot and scaling data
We have generated an invisible data set which helps us to scale the boxplot better in the window.
We learn from the boxplot that the variable MEDV contains several large outliers.
The mean (broken line) and the median (solid line in the box) differ.
Moreover we see on the right outliers marked with circles and
crosses. Since the box borders (- and
-quantile) and the whiskers
(
25%-quantile
1.5 interquartile range and
75%-quantile
1.5 interquartile range)
have more or less the same distance from the median, we may
consider that the variable has a symmetrical distribution.
Let us now examine the median house price a little bit more in detail. We use
the quantlet
grdot
to generate a dotplot.
In the horizontal direction, a dotplot takes
the value of the observations, in the vertical direction it takes
a generated uniformly distributed random number.
library("plot") ; loads library plot data = read ("bostonh") ; reads Boston Housing data x = data[,14] ; selects 14th column gr = grdot(x) ; generates dotplot scale = #(min(x),max(x))~#(-1, 2) ; generates scaling data set scale = setmask(scale,"white") ; makes scaling data "invisible" plot(gr, scale) ; plots dotplot and scaling data
After having rescaled the display, we can detect patterns within the variable MEDV.
It seems we have rather sparse area of data until
, then a denser area
of data until
with a sharp break at
and finally another break
at
. We also see that behind the cross in the dotplot, there is more than one
observation.
If we want to plot discrete variables, it does not make sense to use a box- or dotplot.
For this purpose we can use bar charts. We generate a bar chart with the
quantlet
grbar
and use the fourth variable (CHAS)
which is an indicator variable as to
whether the Charles river is part of the school district.
library("plot") ; loads library plot data = read ("bostonh") ; reads Boston Housing data x = data[,4] ; selects 4th column gr = grbar(x) ; generates a bar chart gr = setmask(gr, "line", "medium") ; changes line thickness plot(gr) ; plots bar chart
We see in the bar chart a large bar representing zeros (school district does not include Charles river) and a small bar representing ones (school district does include Charles river).
![]() |
Although most gr... quantlets generate already a useful graphic, they aimed to be building blocks of high-level routines. If the Charles river index variable would be coded by the numbers -1 and 0, we would not be able to tell which bar chart represents the -1 and which represents the 0. The left bar would still start at 0 and the right bar at 1.
Fortunately, there is the more sophisticated quantlet
plotbar
available which
generates a much better bar chart.
library("plot") ; loads library plot data = read ("bostonh") ; reads Boston Housing data x = data[,4] ; selects 4th column plotbar(x) ; plots the bar chart
![]() |
Quantile-Quantile plots are used to compare distributions of two
variables (
grqq
) or to compare one variable with a given
distribution (
grqqu
uniform,
grqqn
normal).
Let us compare the percentage of lower status people with the appropriate normal distribution.
library("plot") ; loads library plot data = read ("bostonh") ; reads Boston Housing data x = data[,13] ; selects 13th column gr = grqqn(x) ; generates a qq plot plot(gr) ; plots the qq plot
The most often used statistical graphics tools to visualize continuous data is the histogram. Let's now generate a histogram from the median house prices.
library("plot") ; loads library plot data = read ("bostonh") ; reads Boston Housing data x = data[,14] ; selects 14th column gr = grhist(x) ; generates histogram plot(gr) ; plots histogram
We already noticed some characteristics of this variable when we generated boxplots.
Here we find some of them again, e.g. the central dense region of data
and the outliers at the right border. In contrast to all other
univariate graphics,
grhist
has two optional parameters: h
the binwidth and o the origin of the histogram. The change of
the binwidth as well as the origin might reveal some more patterns
within the data. Let us first change the binwidth to 1.
library("plot") ; loads library plot data = read ("bostonh") ; reads Boston Housing data x = data[,14] ; selects 14th column gr = grhist(x,1) ; generates histogram with ; binwidth 1 plot(gr) ; plots histogram
Now change the origin to 0.5.
library("plot") ; loads library plot data = read ("bostonh") ; reads Boston Housing data x = data[,14] ; selects 14th column gr = grhist(x,1,0.5) ; generates histogram with ; binwidth 1 and origin 0.5 plot(gr) ; plots histogram
![]() |
Well-known problems with histograms are the choices of the origin and the binwidth. To overcome these problems, the concept of average shifted histograms has been developed. In principle, we generate a set of histograms with different origins, instead of one histogram, and then we average these histograms. We apply average shifted histograms to our last example with binwidth 1 but 10 different origins.
library("plot") ; loads library plot data = read ("bostonh") ; reads Boston Housing data x = data[,14] ; selects 14th column gr = grash(x,1,10) ; generates average shifted histogram plot(gr) ; plots average shifted histogram
In both histograms, Figures 3.15 and 3.16, we can
speculate about multimodality of the data, since they show
different histograms. However, the average shifted histogram suggests the
existence of three modes between and
.