8.5 Practical Computation

The practical implementation of the techniques introduced begins with the computation of the eigenvalues $\lambda_{1} \ge \lambda_{2} \ge \ldots \ge \lambda_{p}$ and the corresponding eigenvectors $ u_{1}, \ldots,
u_{p} \ \textrm{of} \ \data{X}^{\top}\data{X} $. (Since $p$ is usually less than $n$, this is numerically less involved than computing $v_{k}$ directly for $k = 1,\ldots, p$). The representation of the $n$ individuals on a plane is then obtained by plotting $z_{1}=\data{X} u_{1}$ versus $z_{2}=\data{X} u_{2}$ ( $z_{3} = \data{X}u_{3}$ may eventually be added if a third dimension is helpful). Using the Duality Relation (8.13) representations for the $p$ variables can easily be obtained. These representations can be visualized in a scatterplot of $ w_{1} = \sqrt{\lambda_{1}}\ u_{1}$ against $ w_{2} = \sqrt{\lambda_{2}} u_{2} $ (and eventually against $ w_{3} = \sqrt{\lambda_{3}} \ u_{3} $). Higher dimensional factorial resolutions can be obtained (by computing $ z_{k} $ and $w_{k}$ for $ k > 3 $) but, of course, cannot be plotted.

A standard way of evaluating the quality of the factorial representations in a subspace of dimension $q$ is given by the ratio

\begin{displaymath}
\tau_q = \frac{\lambda_1 + \lambda_2 + \ldots + \lambda_q }
{\lambda_1 + \lambda_2 + \ldots + \lambda_p },
\end{displaymath} (8.15)

where $ 0 \le \tau_{q} \le 1 $. In general, the scalar product $y^{\top}y$ is called the inertia of $y \in \mathbb{R}^n$ w.r.t. the origin. Therefore, the ratio $\tau_q$ is usually interpreted as the percentage of the inertia explained by the first $q$ factors. Note that $ \lambda_{j} = (\data{X}u_{j})^{\top} (\data{X}u_{j}) = z_{j}^{\top}z_{j}$. Thus, $\lambda_{j}$ is the inertia of the $j$-th factorial variable w.r.t. the origin. The denominator in (8.15) is a measure of the total inertia of the $p$ variables, $x_{\column{j}}$. Indeed, by (2.3)

\begin{displaymath}\sum_{j=1}^p \lambda_{j} = \mathop{\hbox{tr}}(\data{X}^{\top}...
... x_{ij}^2 = \sum_{j=1}^p x_{\column{j}}^{\top}
x_{\column{j}}.\end{displaymath}

REMARK 8.1   It is clear that the sum $\sum_{j=1}^q\lambda_{j}$ is the sum of the inertia of the first $q$ factorial variables $z_{1},z_{2},\dots,z_{q}$.

EXAMPLE 8.1   We consider the data set in Table B.6 which gives the food expenditures of various French families (manual workers = MA, employees = EM, managers = CA) with varying numbers of children (2, 3, 4 or 5 children). We are interested in investigating whether certain household types prefer certain food types. We can answer this question using the factorial approximations developed here.

The correlation matrix corresponding to the data is

\begin{displaymath}\data{R} = \left( \begin{array}{rrrrrrr}
1.00 & 0.59 & 0.20 &...
... -0.49 & -0.44 & -0.40 & 0.01 & 1.00
\end{array} \right)\cdot \end{displaymath}

We observe a rather high correlation between meat and poultry, whereas the expenditure for milk and wine is rather small. Are there household types that prefer, say, meat over bread?

We shall now represent food expenditures and households simultaneously using two factors. First, note that in this particular problem the origin has no specific meaning (it represents a ``zero'' consumer). So it makes sense to compare the consumption of any family to that of an ``average family'' rather than to the origin. Therefore, the data is first centered (the origin is translated to the center of gravity, $\overline{x}$). Furthermore, since the dispersions of the 7 variables are quite different each variable is standardized so that each has the same weight in the analysis (mean 0 and variance 1). Finally, for convenience, we divide each element in the matrix by $ \sqrt{n} =
\sqrt{12} $. (This will only change the scaling of the plots in the graphical representation.)

The data matrix to be analyzed is

\begin{displaymath}\data{X}_{*} = \frac{1}{\sqrt{n}}\data{H}\data{X}\data{D}^{-1/2},\end{displaymath}

where $\data{H}$ is the centering matrix and $\data{D} = \mathop{\hbox{diag}}(s_{X_{i}X_{i}})$ (see Section 3.3). Note that by standardizing by $\sqrt{n}$, it follows that $ \data{X}_{*}^{\top}\data{X}_{*} = \data{R}$ where $\data{R}$ is the correlation matrix of the original data. Calculating

\begin{displaymath}\lambda = \left(4.33, 1.83, 0.63, 0.13, 0.06, 0.02, 0.00\right)^{\top}\end{displaymath}

shows that the directions of the first two eigenvectors play a dominant role ( $\tau_{2} = 88\%$), whereas the other directions contribute less than 15% of inertia. A two-dimensional plot should suffice for interpreting this data set.

Figure 8.6: Representation of food expenditures and family types in two dimensions. 30566 MVAdecofood.xpl
\includegraphics[width=1\defpicwidth]{decofood.ps}

The coordinates of the projected data points are given in the two lower windows of Figure 8.6. Let us first examine the food expenditure window. In this window we see the representation of the $p=7$ variables given by the first two factors. The plot shows the factorial variables $w_{1}$ and $w_{2}$ in the same fashion as Figure 8.4. We see that the points for meat, poultry, vegetables and fruits are close to each other in the lower left of the graph. The expenditures for bread and milk can be found in the upper left whereas wine stands alone in the upper right. The first factor, $w_{1}$, may be interpreted as the meat/fruit factor of consumption, the second factor, $w_{2}$, as the bread/wine component.

In the lower window on the right-hand side, we show the factorial variables $z_{1}$ and $z_{2}$ from the fit of the $n=12$ household types. Note that by the Duality Relations of Theorem 8.4, the factorial variables $z_j$ are linear combinations of the factors $w_{k}$ from the left window. The points displayed in the consumer window (graph on the right) are plotted relative to an average consumer represented by the origin. The manager families are located in the lower left corner of the graph whereas the manual workers and employees tend to be in the upper right. The factorial variables for CA5 (managers with five children) lie close to the meat/fruit factor. Relative to the average consumer this household type is a large consumer of meat/poultry and fruits/vegetables. In Chapter 9, we will return to these plots interpreting them in a much deeper way. At this stage, it suffices to notice that the plots provide a graphical representation in $\mathbb{R}^2$ of the information contained in the original, high-dimensional ($12 \times 7$) data matrix.

Summary
$\ast$
The practical implementation of factor decomposition of matrices consists of computing the eigenvalues $\lambda_{1},\ldots,\lambda_{p}$ and the eigenvectors $u_{1},\ldots,u_{p}$ of $ \data{X}^{\top}\data{X}$. The representation of the $n$ individuals is obtained by plotting $z_{1}=\data{X} u_{1}$ vs. $z_{2}=\data{X} u_{2}$ (and, if necessary, vs. $z_{3} = \data{X}u_{3}$). The representation of the $p$ variables is obtained by plotting $ w_{1} = \sqrt{\lambda_{1}}\ u_{1}$ vs. $ w_{2} = \sqrt{\lambda_{2}} u_{2} $ (and, if necessary, vs. $ w_{3} = \sqrt{\lambda_{3}} \ u_{3} $).
$\ast$
The quality of the factorial representation can be evaluated using $\tau_{q}$ which is the percentage of inertia explained by the first $q$ factors.