1.1 Boxplots
EXAMPLE 1.1
The Swiss bank data
(see Appendix, Table
B.2) consists of 200
measurements on Swiss bank notes. The first half
of these measurements are from genuine bank notes,
the other half are from counterfeit bank notes.
Figure 1.1:
An old Swiss 1000-franc bank note.
|
The authorities have measured, as indicated in Figure 1.1,
These data are taken from Flury and Riedwyl (1988).
The aim is to study how these measurements may be used in determining
whether a bill is genuine or counterfeit.
The boxplot is a graphical technique that
displays the distribution of
variables. It helps us see the location, skewness, spread, tail length
and outlying points.
It is particularly useful in comparing different batches.
The boxplot is a graphical representation of the Five Number
Summary. To introduce the Five Number Summary,
let us consider for a moment a smaller, one-dimensional data set:
the population of the 15 largest U.S. cities in 1960 (Table 1.1).
In the Five Number Summary,
we calculate the upper quartile ,
the lower quartile , the median and the extremes. Recall that
order statistics
are a
set of ordered values
where
denotes the minimum and the maximum.
The median typically cuts the set of observations
in two equal parts, and is defined as
|
(1.1) |
The quartiles cut the set into four equal parts, which are often called
fourths (that is why we use the
letter ). Using a definition that goes back to Hoaglin et al. (1983)
the definition of a median can be generalized to fourths, eights, etc.
Considering the order statistics we can define the depth of a data
value as
. If is odd, the depth
of the median is . If is even, is a
fraction. Thus, the median is determined to be the average between the two
data values belonging to the next larger and smaller order statistics,
i.e.,
. In our example, we have
hence the median
.
We proceed in the same way to get
the fourths. Take the depth of the median and calculate
with denoting the largest integer smaller than or equal to .
In our example this gives and thus leads to the two fourths
(recalling that a depth which is a fraction corresponds to the average of
the two nearest data values).
Table 1.2:
Five number summary.
|
The -spread, , is defined as .
The outside bars
|
|
|
(1.2) |
|
|
|
(1.3) |
are the borders beyond which a point is regarded as an outlier.
For the number of points outside these bars see Exercise 1.3.
For the data points the fourths are
and
.
Therefore the -spread and the upper and lower outside bars
in the above example are calculated as follows:
Since New York and Chicago are beyond the outside bars they are
considered to be outliers. The minimum and the maximum are called the
extremes. The mean is defined as
which is in our example. The mean is a measure of location.
The median (88), the fourths (74;183.5) and the extremes (63;778) constitute
basic information about the data. The combination of these five numbers
leads to the Five Number Summary as displayed in Table 1.2.
The depths of each of the five numbers have been added as an additional column.
- Draw a box with borders (edges) at and
(i.e., 50% of the data are in this box).
- Draw the median as a solid line () and the mean as a dotted line
().
- Draw ``whiskers'' from each end of the box to the most
remote point that is NOT an outlier.
- Show outliers as either ``'' or ``''depending
on whether they are outside of
or
respectively. Label them if possible.
In the U.S. cities example the cutoff points (outside bars) are at
and 349, hence we draw whiskers to New Orleans and Los Angeles.
We can see from Figure 1.2
that the data are very skew: The
upper half of the data (above the median) is more spread out than the
lower half (below the median). The data contains two outliers marked
as a star and a circle. The more distinct outlier is shown as a star.
The mean (as a non-robust measure of location) is pulled
away from the median.
Figure:
Boxplot for the mileage of American, Japanese and
European cars (from left to right).
MVAboxcar.xpl
|
Boxplots are very useful tools in comparing batches. The relative
location of the distribution of different batches tells us a lot
about the batches themselves. Before we come back to the Swiss bank
data let us compare the fuel economy of vehicles from different
countries, see Figure 1.3 and Table B.3.
The data are from the second column of Table B.3 and show
the mileage (miles per gallon) of U.S. American,
Japanese and European cars. The five-number summaries for these data sets
are
,
,
and
for American, Japanese, and European cars, respectively.
This reflects the information shown in Figure 1.3.
The following conclusions can be made:
- Japanese cars achieve higher fuel efficiency than U.S. and European cars.
- There is one outlier, a very fuel-efficient car (VW-Rabbit Diesel).
- The main body of the U.S. car data (the box) lies below the Japanese
car data.
- The worst Japanese car is more fuel-efficient than almost 50 percent
of the U.S. cars.
- The spread of the Japanese and the U.S. cars are almost equal.
- The median of the Japanese data is above that of the
European data and the U.S. data.
Figure:
The variable of Swiss bank data (diagonal of bank
notes).
MVAboxbank6.xpl
|
Now let us apply the boxplot technique to the bank data
set. In Figure 1.4 we show the parallel
boxplot of the diagonal variable .
On the left is the value of the genuine bank notes and on the right the
value of the counterfeit bank notes. The two five-number summaries are
for the genuine bank notes,
and
for the counterfeit ones.
Figure:
The variable of Swiss bank data
(length of bank notes).
MVAboxbank1.xpl
|
One sees that the diagonals of the genuine bank notes tend to be larger.
It is harder to see a clear distinction when comparing the length
of the bank notes , see Figure 1.5. There are a few outliers
in both plots. Almost all the observations of the diagonal of the
genuine notes are above the ones from the counterfeit. There is one
observation in Figure 1.4 of the genuine notes that is almost
equal to the median of the counterfeit notes.
Can the parallel boxplot technique help us distinguish between the two types
of bank notes?
Summary
- The median and mean bars are measures of locations.
- The relative location of the median (and the mean)
in the box is a measure of skewness.
- The length of the box and whiskers are a measure of spread.
- The length of the whiskers indicate the tail length
of the distribution.
- The outlying points are indicated with a ``" or
``'' depending on if they are outside of
or
respectively.
- The boxplots do not indicate multi modality or clusters.
- If we compare the relative size and location of the boxes,
we are comparing distributions.