3.2 Univariate Graphics


gr = 5842 grbox (x {,col})
generates a boxplot of the data set x
gr = 5845 grdot (x {,col})
generates a dotplot of the data set x
gr = 5848 grbar (x {,col})
generates a bar chart of the data set x
gr = 5851 grqq (y, x {,col})
generates a QQ-plot from the data sets y and x
gr = 5854 grqqn (x {,col})
generates a QQ-plot from the data set x and a normal distribution
gr = 5857 grqqu (x {,col})
generates a QQ-plot from the data set x and a uniform distribution
gr = 5860 grhist (x {, h {, o {,col}}})
generates a histogram of the data set x
gr = 5863 grash (x {, h {, k {,col}}})
generates an averaged shifted histogram of the data set x

The optional parameter col allows us to produce a graphical object in another color other than black. For details, see Subsection 3.4.3. The other optional parameters will be explained when we introduce 5866 grhist and 5869 grash .

In the following examples, we use a mix of graphical primitives which are part of the library graphic and high-level routines which are part of the library plot. Since a call of library plot also loads the library graphic , we do not need to call the library graphic explicitly.


3.2.1 Boxplots

Let us now examine some variables of the Boston Housing data with statistical graphics. Since the aim of the data exploration is to predict the median house price from the variables, let us make a boxplot with the quantlet 5982 grbox .

  library("plot")                ; loads library plot
  data = read ("bostonh")        ; reads Boston Housing data
  gr = grbox(data[,14])          ; generates a graphical object
  plot(gr)                       ; plots graphical object
5986 XLGgraph31.xpl

Note that we generate the boxplot in two steps. First we generate the graphical object gr and then we plot it.

Figure: Boxplot of the 14th variable (MEDV) of the Boston Housing data.
\includegraphics[scale=0.425]{grfig31}

We might not be satisfied with the boxplot, since the window size is chosen such that all the data are visible. Let us now apply an often helpful trick to get a better plot.

  library("plot")                ; loads library plot
  data = read ("bostonh")        ; reads Boston Housing data
  x = data[,14]                  ; selects the 14th column  
  gr = grbox(x)                  ; generates graphical object
  scale = #(min(x),max(x))~#(-1, 2) ; generates scaling data set
  scale = setmask(scale,"white") ; makes scaling data "invisible"
  plot(gr, scale)                ; plots boxplot and scaling data
5999 XLGgraph32.xpl

Figure: Rescaled boxplot of the 14th variable (MEDV) of the Boston Housing data.
\includegraphics[width=0.8\defpicwidth]{grfig32.ps}

We have generated an invisible data set which helps us to scale the boxplot better in the window.

We learn from the boxplot that the variable MEDV contains several large outliers. The mean (broken line) and the median (solid line in the box) differ. Moreover we see on the right outliers marked with circles and crosses. Since the box borders ($ 25\%$- and $ 75\%$-quantile) and the whiskers ($ \geq$ 25%-quantile $ -$ 1.5 interquartile range and $ \leq$ 75%-quantile $ +$ 1.5 interquartile range) have more or less the same distance from the median, we may consider that the variable has a symmetrical distribution.


3.2.2 Dotplots

Let us now examine the median house price a little bit more in detail. We use the quantlet 6068 grdot to generate a dotplot. In the horizontal direction, a dotplot takes the value of the observations, in the vertical direction it takes a generated uniformly distributed random number.

  library("plot")                ; loads library plot
  data = read ("bostonh")        ; reads Boston Housing data
  x = data[,14]                  ; selects 14th column  
  gr = grdot(x)                  ; generates dotplot
  scale = #(min(x),max(x))~#(-1, 2) ; generates scaling data set
  scale = setmask(scale,"white") ; makes scaling data "invisible"
  plot(gr, scale)                ; plots dotplot and scaling data
6072 XLGgraph33.xpl

Figure: Rescaled dotplot of the 14th variable (MEDV) of the Boston Housing data.
\includegraphics[width=0.8\defpicwidth]{grfig33.ps}

After having rescaled the display, we can detect patterns within the variable MEDV. It seems we have rather sparse area of data until $ x \approx 12$, then a denser area of data until $ x \approx 18$ with a sharp break at $ x \approx 25$ and finally another break at $ x \approx 36$. We also see that behind the cross in the dotplot, there is more than one observation.


3.2.3 Bar Charts

If we want to plot discrete variables, it does not make sense to use a box- or dotplot. For this purpose we can use bar charts. We generate a bar chart with the quantlet 6123 grbar and use the fourth variable (CHAS) which is an indicator variable as to whether the Charles river is part of the school district.

  library("plot")                ; loads library plot
  data = read ("bostonh")        ; reads Boston Housing data
  x = data[,4]                   ; selects 4th column  
  gr = grbar(x)                  ; generates a bar chart
  gr = setmask(gr, "line", "medium") ; changes line thickness
  plot(gr)                       ; plots bar chart
6127 XLGgraph34.xpl

We see in the bar chart a large bar representing zeros (school district does not include Charles river) and a small bar representing ones (school district does include Charles river).

Figure: Barchart of the 4th variable (CHAS) of the Boston Housing data generated by a graphic primitive routine.
\includegraphics[scale=0.425]{grfig34}

Although most gr... quantlets generate already a useful graphic, they aimed to be building blocks of high-level routines. If the Charles river index variable would be coded by the numbers -1 and 0, we would not be able to tell which bar chart represents the -1 and which represents the 0. The left bar would still start at 0 and the right bar at 1.

Fortunately, there is the more sophisticated quantlet 6139 plotbar available which generates a much better bar chart.

  library("plot")                ; loads library plot
  data = read ("bostonh")        ; reads Boston Housing data
  x = data[,4]                   ; selects 4th column  
  plotbar(x)                     ; plots the bar chart
6143 XLGgraph35.xpl

Figure: Barchart of the 4th variable (CHAS) of the Boston Housing data generated by a graphic high-level routine.
\includegraphics[scale=0.425]{grfig35}


3.2.4 Quantile-Quantile Plots

Quantile-Quantile plots are used to compare distributions of two variables ( 6203 grqq ) or to compare one variable with a given distribution ( 6206 grqqu uniform, 6209 grqqn normal).

Let us compare the percentage of lower status people with the appropriate normal distribution.

  library("plot")                ; loads library plot
  data = read ("bostonh")        ; reads Boston Housing data
  x = data[,13]                  ; selects 13th column  
  gr = grqqn(x)                  ; generates a qq plot
  plot(gr)                       ; plots the qq plot
6213 XLGgraph36.xpl

Apparently we have a clear deviation from the line which indicates that the 13th variable is not normally distributed. Since the data points cross the $ 45$ degree line twice, we can say that the distribution is steeper in the center and thicker in the tails.

Figure: QQ-plot of the 13th variable (LSTAT) of the Boston Housing data.
\includegraphics[scale=0.425]{grfig36}


3.2.5 Histograms

The most often used statistical graphics tools to visualize continuous data is the histogram. Let's now generate a histogram from the median house prices.

  library("plot")                ; loads library plot
  data = read ("bostonh")        ; reads Boston Housing data
  x = data[,14]                  ; selects 14th column  
  gr = grhist(x)                 ; generates histogram
  plot(gr)                       ; plots histogram
6288 XLGgraph37.xpl

Figure: Histogram of the 14th variable (MEDV) of the Boston Housing data.
\includegraphics[scale=0.425]{grfig37}

We already noticed some characteristics of this variable when we generated boxplots. Here we find some of them again, e.g. the central dense region of data and the outliers at the right border. In contrast to all other univariate graphics, 6300 grhist has two optional parameters: h the binwidth and o the origin of the histogram. The change of the binwidth as well as the origin might reveal some more patterns within the data. Let us first change the binwidth to 1.

  library("plot")                ; loads library plot
  data = read ("bostonh")        ; reads Boston Housing data
  x = data[,14]                  ; selects 14th column  
  gr = grhist(x,1)               ; generates histogram with 
                                 ;    binwidth 1
  plot(gr)                       ; plots histogram
6304 XLGgraph38.xpl

Figure: Histogram of the 14th variable (MEDV) of the Boston Housing data with a different binwidth.
\includegraphics[scale=0.425]{grfig38}

Now change the origin to 0.5.

  library("plot")                ; loads library plot
  data = read ("bostonh")        ; reads Boston Housing data
  x = data[,14]                  ; selects 14th column  
  gr = grhist(x,1,0.5)           ; generates histogram with 
                                 ;    binwidth 1 and origin 0.5
  plot(gr)                       ; plots histogram
6317 XLGgraph39.xpl

Figure: Histogram of the 14th variable (MEDV) of the Boston Housing data with a different origin of the bin.
\includegraphics[scale=0.425]{grfig39}

Well-known problems with histograms are the choices of the origin and the binwidth. To overcome these problems, the concept of average shifted histograms has been developed. In principle, we generate a set of histograms with different origins, instead of one histogram, and then we average these histograms. We apply average shifted histograms to our last example with binwidth 1 but 10 different origins.

  library("plot")          ; loads library plot
  data = read ("bostonh")  ; reads Boston Housing data
  x = data[,14]            ; selects 14th column  
  gr = grash(x,1,10)       ; generates average shifted histogram
  plot(gr)                 ; plots average shifted histogram
6330 XLGgraph3A.xpl

Figure: Average shifted histogram of the 14th variable (MEDV) of the Boston Housing data.
\includegraphics[scale=0.425]{grfig310}

In both histograms, Figures 3.15 and 3.16, we can speculate about multimodality of the data, since they show different histograms. However, the average shifted histogram suggests the existence of three modes between $ 10$ and $ 25$.