1.1 Boxplots

EXAMPLE 1.1   The Swiss bank data (see Appendix, Table B.2) consists of 200 measurements on Swiss bank notes. The first half of these measurements are from genuine bank notes, the other half are from counterfeit bank notes.

Figure 1.1: An old Swiss 1000-franc bank note.
\includegraphics[width=1\defpicwidth]{fig111.ps}

The authorities have measured, as indicated in Figure 1.1,

\begin{eqnarray*}
X_1 &=& \textrm{length of the bill}\\
X_2 &=& \textrm{height ...
... &=& \textrm{length of the diagonal of the central picture.}\\
\end{eqnarray*}



These data are taken from Flury and Riedwyl (1988). The aim is to study how these measurements may be used in determining whether a bill is genuine or counterfeit.

The boxplot is a graphical technique that displays the distribution of variables. It helps us see the location, skewness, spread, tail length and outlying points.

It is particularly useful in comparing different batches. The boxplot is a graphical representation of the Five Number Summary. To introduce the Five Number Summary, let us consider for a moment a smaller, one-dimensional data set: the population of the 15 largest U.S. cities in 1960 (Table 1.1).


Table 1.1: The 15 largest U.S. cities in 1960.
City Pop. (10,000) Order Statistics
New York 778 $x_{\ord{15}}$
Chicago 355 $x_{\ord{14}}$
Los Angeles 248 $x_{\ord{13}}$
Philadelphia 200 $x_{\ord{12}}$
Detroit 167 $x_{\ord{11}}$
Baltimore 94 $x_{\ord{10}}$
Houston 94 $x_{\ord{9}}$
Cleveland 88 $x_{\ord{8}}$
Washington D.C. 76 $x_{\ord{7}}$
Saint Louis 75 $x_{\ord{6}}$
Milwaukee 74 $x_{\ord{5}}$
San Francisco 74 $x_{\ord{4}}$
Boston 70 $x_{\ord{3}}$
Dallas 68 $x_{\ord{2}}$
New Orleans 63 $x_{\ord{1}}$


In the Five Number Summary, we calculate the upper quartile $F_U$, the lower quartile $F_L$, the median and the extremes. Recall that order statistics $\{x_{\ord{1}},x_{\ord{2}},\ldots,x_{\ord{n}}\}$ are a set of ordered values $x_{1},x_{2},\ldots,x_{n}$ where $x_{\ord{1}}$ denotes the minimum and $x_{\ord{n}}$ the maximum. The median $M$ typically cuts the set of observations in two equal parts, and is defined as

\begin{displaymath}
M = \left\{
\begin{array}{cl}
x_{\ord{\frac{n+1}{2}}} & n \t...
...ac{n}{2}+1}}
\right\} & n \textrm{ even} \end{array}\right. .
\end{displaymath} (1.1)

The quartiles cut the set into four equal parts, which are often called fourths (that is why we use the letter $F$). Using a definition that goes back to Hoaglin et al. (1983) the definition of a median can be generalized to fourths, eights, etc. Considering the order statistics we can define the depth of a data value $x_{\ord{i}}$ as $\min\{ i,n-i+1\}$. If $n$ is odd, the depth of the median is $\frac{n+1}{2}$. If $n$ is even, $\frac{n+1}{2}$ is a fraction. Thus, the median is determined to be the average between the two data values belonging to the next larger and smaller order statistics, i.e., $M=\frac{1}{2} \left\{ x_{\ord{\frac{n}{2}}} +
x_{\ord{\frac{n}{2}+1}} \right\}$. In our example, we have $n=15$ hence the median $M=x_{\ord{8}}=88$.

We proceed in the same way to get the fourths. Take the depth of the median and calculate

\begin{displaymath}\textrm{depth of fourth }=\frac{[\textrm{depth of median}]+1}{2}\end{displaymath}

with $[z]$ denoting the largest integer smaller than or equal to $z$. In our example this gives $4.5$ and thus leads to the two fourths

\begin{eqnarray*}
F_{L}&=& \frac{1}{2} \left\{x_{\ord{4}}+x_{\ord{5}}\right\}\\
F_{U}&=& \frac{1}{2} \left\{x_{\ord{11}}+x_{\ord{12}}\right\}
\end{eqnarray*}



(recalling that a depth which is a fraction corresponds to the average of the two nearest data values).


Table 1.2: Five number summary.
\begin{table}\begin{displaymath}\vbox{\offinterlineskip\halign{
\vrule height ...
....5pt depth 0.5pt width 0pt&\cr
\noalign{\hrule }
}}\end{displaymath} \end{table}


The $F$-spread, $d_F$, is defined as $d_F = F_U-F_L$. The outside bars

  $\textstyle F_U+1.5 d_F$   (1.2)
  $\textstyle F_L - 1.5 d_F$   (1.3)

are the borders beyond which a point is regarded as an outlier. For the number of points outside these bars see Exercise 1.3. For the $n=15$ data points the fourths are $74=\frac{1}{2}\left\{x_{\ord{4}}+x_{\ord{5}}\right\}$ and $183.5=\frac{1}{2}\left\{x_{\ord{11}}+x_{\ord{12}}\right\}$. Therefore the $F$-spread and the upper and lower outside bars in the above example are calculated as follows:
$\displaystyle d_F$ $\textstyle =$ $\displaystyle F_U-F_L=183.5-74 =109.5$ (1.4)
$\displaystyle F_L -1.5 d_F$ $\textstyle =$ $\displaystyle 74-1.5\cdot 109.5=-90.25$ (1.5)
$\displaystyle F_U+1.5 d_F$ $\textstyle =$ $\displaystyle 183.5+1.5\cdot 109.5=347.75.$ (1.6)

Since New York and Chicago are beyond the outside bars they are considered to be outliers. The minimum and the maximum are called the extremes. The mean is defined as

\begin{displaymath}\overline{x} = n^{-1} \sum_{i=1}^n x_{i},
\end{displaymath}

which is $168.27$ in our example. The mean is a measure of location. The median (88), the fourths (74;183.5) and the extremes (63;778) constitute basic information about the data. The combination of these five numbers leads to the Five Number Summary as displayed in Table 1.2. The depths of each of the five numbers have been added as an additional column.

Construction of the Boxplot

  1. Draw a box with borders (edges) at $F_L$ and $F_U$ (i.e., 50% of the data are in this box).
  2. Draw the median as a solid line ($\mid$) and the mean as a dotted line ().
  3. Draw ``whiskers'' from each end of the box to the most remote point that is NOT an outlier.
  4. Show outliers as either ``$\star$'' or ``$\bullet$''depending on whether they are outside of $F_{UL}\pm 1.5 d_F$ or $F_{UL} \pm 3d_F$ respectively. Label them if possible.

Figure: Boxplot for U.S. cities. 1860 MVAboxcity.xpl
\includegraphics[width=1\defpicwidth]{boxcity.ps}

In the U.S. cities example the cutoff points (outside bars) are at $-91$ and 349, hence we draw whiskers to New Orleans and Los Angeles. We can see from Figure 1.2 that the data are very skew: The upper half of the data (above the median) is more spread out than the lower half (below the median). The data contains two outliers marked as a star and a circle. The more distinct outlier is shown as a star. The mean (as a non-robust measure of location) is pulled away from the median.

Figure: Boxplot for the mileage of American, Japanese and European cars (from left to right). 1864 MVAboxcar.xpl
\includegraphics[width=1\defpicwidth]{boxcar.ps}

Boxplots are very useful tools in comparing batches. The relative location of the distribution of different batches tells us a lot about the batches themselves. Before we come back to the Swiss bank data let us compare the fuel economy of vehicles from different countries, see Figure 1.3 and Table B.3.

The data are from the second column of Table B.3 and show the mileage (miles per gallon) of U.S. American, Japanese and European cars. The five-number summaries for these data sets are $\{12, 16.8, 18.8, 22, 30 \}$, $\{ 18, 22, 25, 30.5, 35 \}$, and $\{ 14, 19, 23, 25, 28 \}$ for American, Japanese, and European cars, respectively. This reflects the information shown in Figure 1.3. The following conclusions can be made:

Figure: The $X_{6}$ variable of Swiss bank data (diagonal of bank notes). 1868 MVAboxbank6.xpl
\includegraphics[width=1\defpicwidth]{boxbank6.ps}

Now let us apply the boxplot technique to the bank data set. In Figure 1.4 we show the parallel boxplot of the diagonal variable $X_6$. On the left is the value of the genuine bank notes and on the right the value of the counterfeit bank notes. The two five-number summaries are $\{ 140.65, 141.25, 141.5, 141.8, 142.4 \}$ for the genuine bank notes, and $\{ 138.3, 139.2, 139.5, 139.8, 140.65 \}$ for the counterfeit ones.

Figure: The $X_{1}$ variable of Swiss bank data (length of bank notes). 1872 MVAboxbank1.xpl
\includegraphics[width=1\defpicwidth]{boxbank1.ps}

One sees that the diagonals of the genuine bank notes tend to be larger. It is harder to see a clear distinction when comparing the length of the bank notes $X_{1}$, see Figure 1.5. There are a few outliers in both plots. Almost all the observations of the diagonal of the genuine notes are above the ones from the counterfeit. There is one observation in Figure 1.4 of the genuine notes that is almost equal to the median of the counterfeit notes. Can the parallel boxplot technique help us distinguish between the two types of bank notes?

Summary
$\ast$
The median and mean bars are measures of locations.
$\ast$
The relative location of the median (and the mean) in the box is a measure of skewness.
$\ast$
The length of the box and whiskers are a measure of spread.
$\ast$
The length of the whiskers indicate the tail length of the distribution.
$\ast$
The outlying points are indicated with a ``$\star$" or ``$\bullet$'' depending on if they are outside of $F_{UL}\pm 1.5 d_F$ or $F_{UL} \pm 3d_F$ respectively.
$\ast$
The boxplots do not indicate multi modality or clusters.
$\ast$
If we compare the relative size and location of the boxes, we are comparing distributions.