1.4 Scatterplots

Scatterplots are bivariate or trivariate plots of variables against each other. They help us understand relationships among the variables of a data set. A downward-sloping scatter indicates that as we increase the variable on the horizontal axis, the variable on the vertical axis decreases. An analogous statement can be made for upward-sloping scatters.

Figure 1.12: 2D scatterplot for $X_{5}$ vs. $X_{6}$ of the bank notes. Genuine notes are circles, counterfeit notes are stars. 2752 MVAscabank56.xpl
\includegraphics[width=1\defpicwidth]{scabank56.ps}

Figure 1.12 plots the 5th column (upper inner frame) of the bank data against the 6th column (diagonal). The scatter is downward-sloping. As we already know from the previous section on marginal comparison (e.g., Figure 1.9) a good separation between genuine and counterfeit bank notes is visible for the diagonal variable. The sub-cloud in the upper half (circles) of Figure 1.12 corresponds to the true bank notes. As noted before, this separation is not distinct, since the two groups overlap somewhat.

This can be verified in an interactive computing environment by showing the index and coordinates of certain points in this scatterplot. In Figure 1.12, the 70th observation in the merged data set is given as a thick circle, and it is from a genuine bank note. This observation lies well embedded in the cloud of counterfeit bank notes. One straightforward approach that could be used to tell the counterfeit from the genuine bank notes is to draw a straight line and define notes above this value as genuine. We would of course misclassify the 70th observation, but can we do better?

Figure 1.13: 3D Scatterplot of the bank notes for $(X_4, X_5,X_6)$. Genuine notes are circles, counterfeit are stars. 2756 MVAscabank456.xpl
\includegraphics[width=1\defpicwidth]{scabank456.ps}

If we extend the two-dimensional scatterplot by adding a third variable, e.g., $X_4$ (lower distance to inner frame), we obtain the scatterplot in three-dimensions as shown in Figure 1.13. It becomes apparent from the location of the point clouds that a better separation is obtained. We have rotated the three dimensional data until this satisfactory 3D view was obtained. Later, we will see that rotation is the same as bundling a high-dimensional observation into one or more linear combinations of the elements of the observation vector. In other words, the ``separation line" parallel to the horizontal coordinate axis in Figure 1.12 is in Figure 1.13 a plane and no longer parallel to one of the axes. The formula for such a separation plane is a linear combination of the elements of the observation vector:

\begin{displaymath}
a_{1}x_{1} + a_{2}x_{2} + \ldots + a_{6}x_{6} = \textrm{const}.
\end{displaymath} (1.12)

The algorithm that automatically finds the weights ( $a_{1}, \ldots, a_{6}$) will be investigated later on in Chapter 12.

Let us study yet another technique: the scatterplot matrix. If we want to draw all possible two-dimensional scatterplots for the variables, we can create a so-called draftman's plot (named after a draftman who prepares drafts for parliamentary discussions). Similar to a draftman's plot the scatterplot matrix helps in creating new ideas and in building knowledge about dependencies and structure.

Figure 1.14: Draftman plot of the bank notes. The pictures in the left column show $(X_3, X_4)$, $(X_{3}, X_{5})$ and $(X_{3}, X_{6})$, in the middle we have $(X_4, X_5)$ and $(X_4, X_6)$, and in the lower right is $(X_5,X_6)$. The upper right half contains the corresponding density contour plots. 2761 MVAdrafbank4.xpl
\includegraphics[width=1\defpicwidth]{drafbank4.ps}

Figure 1.14 shows a draftman plot applied to the last four columns of the full bank data set. For ease of interpretation we have distinguished between the group of counterfeit and genuine bank notes by a different color. As discussed several times before, the separability of the two types of notes is different for different scatterplots. Not only is it difficult to perform this separation on, say, scatterplot $
X_{3}$ vs. $X_{4}$, in addition the ``separation line" is no longer parallel to one of the axes. The most obvious separation happens in the scatterplot in the lower right where we show, as in Figure 1.12, $X_5$ vs. $X_6$. The separation line here would be upward-sloping with an intercept at about $X_{6} = 139$. The upper right half of the draftman plot shows the density contours that we have introduced in Section 1.3.

The power of the draftman plot lies in its ability to show the the internal connections of the scatter diagrams. Define a brush as a re-scalable rectangle that we can move via keyboard or mouse over the screen. Inside the brush we can highlight or color observations. Suppose the technique is installed in such a way that as we move the brush in one scatter, the corresponding observations in the other scatters are also highlighted. By moving the brush, we can study conditional dependence.

If we brush (i.e., highlight or color the observation with the brush) the $X_{5}$ vs. $X_{6}$ plot and move through the upper point cloud, we see that in other plots (e.g., $
X_{3}$ vs. $X_{4}$), the corresponding observations are more embedded in the other sub-cloud.

Summary
$\ast$
Scatterplots in two and three dimensions helps in identifying separated points, outliers or sub-clusters.
$\ast$
Scatterplots help us in judging positive or negative dependencies.
$\ast$
Draftman scatterplot matrices help detect structures conditioned on values of other variables.
$\ast$
As the brush of a scatterplot matrix moves through a point cloud, we can study conditional dependence.