1. Comparison of Batches

Multivariate statistical analysis is concerned with analyzing and understanding data in high dimensions. We suppose that we are given a set $\{x_i\}^n_{i=1}$ of $n$ observations of a variable vector $X$ in $\mathbb{R}^p$. That is, we suppose that each observation $x_i$ has $p$ dimensions:

\begin{displaymath}x_i = (x_{i1}, x_{i2}, ... , x_{ip}),\end{displaymath}

and that it is an observed value of a variable vector $X \in \mathbb{R}^p$. Therefore, $X$ is composed of $p$ random variables:

\begin{displaymath}X = (X_{1}, X_{2}, ... , X_{p})\end{displaymath}

where $X_j$, for $j=1,\dots,p$, is a one-dimensional random variable. How do we begin to analyze this kind of data? Before we investigate questions on what inferences we can reach from the data, we should think about how to look at the data. This involves descriptive techniques. Questions that we could answer by descriptive techniques are:

One difficulty of descriptive methods for high dimensional data is the human perceptional system. Point clouds in two dimensions are easy to understand and to interpret. With modern interactive computing techniques we have the possibility to see real time 3D rotations and thus to perceive also three-dimensional data. A ``sliding technique'' as described in Härdle and Scott (1992) may give insight into four-dimensional structures by presenting dynamic 3D density contours as the fourth variable is changed over its range.

A qualitative jump in presentation difficulties occurs for dimensions greater than or equal to 5, unless the high-dimensional structure can be mapped into lower-dimensional components (Klinke and Polzehl; 1995). Features like clustered subgroups or outliers, however, can be detected using a purely graphical analysis.

In this chapter, we investigate the basic descriptive and graphical techniques allowing simple exploratory data analysis. We begin the exploration of a data set using boxplots. A boxplot is a simple univariate device that detects outliers component by component and that can compare distributions of the data among different groups. Next several multivariate techniques are introduced (Flury faces, Andrews' curves and parallel coordinate plots) which provide graphical displays addressing the questions formulated above. The advantages and the disadvantages of each of these techniques are stressed.

Two basic techniques for estimating densities are also presented: histograms and kernel densities. A density estimate gives a quick insight into the shape of the distribution of the data. We show that kernel density estimates overcome some of the drawbacks of the histograms.

Finally, scatterplots are shown to be very useful for plotting bivariate or trivariate variables against each other: they help to understand the nature of the relationship among variables in a data set and allow to detect groups or clusters of points. Draftman plots or matrix plots are the visualization of several bivariate scatterplots on the same display. They help detect structures in conditional dependences by brushing across the plots.