1.6 Andrews' Curves

The basic problem of graphical displays of multivariate data is the dimensionality. Scatterplots work well up to three dimensions (if we use interactive displays). More than three dimensions have to be coded into displayable 2D or 3D structures (e.g., faces). The idea of coding and representing multivariate data by curves was suggested by Andrews (1972). Each multivariate observation $X_i=(X_{i,1}, .. ,X_{i,p})$ is transformed into a curve as follows:

$\begin{displaymath} f_i(t) =\left\{\begin{array}{ll} \frac{X_{i,1}}{\sqrt{2}} + ... ...\sin(\frac{p}{2}t) & \textrm{for $p$\ even} \end{array}\right. \end{displaymath}$

(1.13)

such that the observation represents the coefficients of a so-called Fourier series ( $t \in [-\pi,\pi]$ ).

Suppose that we have three-dimensional observations: $X_{1} = (0,0,1)$ , $X_{2} = (1,0,0)$ and $X_{3} = (0,1,0)$ . Here and the following representations correspond to the Andrews' curves:

$\begin{eqnarray*} f_1(t) & = & \cos(t) \\ f_2(t) & = & \frac{1}{\sqrt{2}} \quad \textrm{and} \\ f_3(t) & = & \sin(t). \end{eqnarray*}$

These curves are indeed quite distinct, since the observations $X_{1}$ , $X_{2}$ , and $X_{3}$ are the 3D unit vectors: each observation has mass only in one of the three dimensions. The order of the variables plays an important role.

EXAMPLE 1.2 Let us take the 96th observation of the Swiss bank note data set,

$\begin{displaymath}X_{96}=(215.6, 129.9, 129.9, 9.0, 9.5, 141.7). \end{displaymath}$

The Andrews' curve is by (1.13):

$\begin{displaymath}f_{96}(t)= \frac{215.6}{\sqrt{2}} + {129.9} \sin(t) + {129.9} \cos(t) + {9.0} \sin(2t) + {9.5} \cos(2t) + {141.7} \sin(3t).\end{displaymath}$

**Figure:** Andrews' curves of the observations 96-105 from the Swiss bank note data. The order of the variables is 1,2,3,4,5,6. `MVAandcur.xpl`
$\includegraphics[width=1\defpicwidth]{andcur.ps}$

Figure 1.20 shows the Andrews' curves for observations 96-105 of the Swiss bank note data set. We already know that the observations 96-100 represent genuine bank notes, and that the observations 101-105 represent counterfeit bank notes. We see that at least four curves differ from the others, but it is hard to tell which curve belongs to which group.

We know from Figure 1.4 that the sixth variable is an important one. Therefore, the Andrews' curves are calculated again using a reversed order of the variables.

EXAMPLE 1.3 Let us consider again the 96th observation of the Swiss bank note data set,

$\begin{displaymath}X_{96}=(215.6, 129.9, 129.9, 9.0, 9.5, 141.7). \end{displaymath}$

The Andrews' curve is computed using the reversed order of variables:

$\begin{displaymath}f_{96}(t)= \frac{141.7}{\sqrt{2}} + {9.5} \sin(t) + {9.0} \cos(t) + {129.9} \sin(2t) + {129.9} \cos(2t) + {215.6} \sin(3t).\end{displaymath}$

In Figure 1.21 the curves $f_{96}$ - $f_{105}$ for observations 96-105 are plotted. Instead of a difference in high frequency, now we have a difference in the intercept, which makes it more difficult for us to see the differences in observations.

**Figure:** Andrews' curves of the observations 96-105 from the Swiss bank note data. The order of the variables is 6,5,4,3,2,1. `MVAandcur2.xpl`
$\includegraphics[width=1\defpicwidth]{andcur2.ps}$

This shows that the order of the variables plays an important role for the interpretation. If is high-dimensional, then the last variables will have only a small visible contribution to the curve. They fall into the high frequency part of the curve. To overcome this problem Andrews suggested using an order which is suggested by Principal Component Analysis. This technique will be treated in detail in Chapter 9. In fact, the sixth variable will appear there as the most important variable for discriminating between the two groups. If the number of observations is more than 20, there may be too many curves in one graph. This will result in an over plotting of curves or a bad ``signal-to-ink-ratio'', see Tufte (1983). It is therefore advisable to present multivariate observations via Andrews' curves only for a limited number of observations.

Summary

$\ast$: Outliers appear as single Andrews' curves that look different from the rest.
$\ast$: A subgroup of data is characterized by a set of simular curves.
$\ast$: The order of the variables plays an important role for interpretation.
$\ast$: The order of variables may be optimized by Principal Component
Analysis.
$\ast$: For more than 20 observations we may obtain a bad ``signal-to-ink-ratio'', i.e., too many curves are overlaid in one picture.