9.6 Principal Components as a Factorial Method

The empirical PCs (normalized or not) turn out to be equivalent to the factors that one would obtain by decomposing the appropriate data matrix into its factors (see Chapter 8). It will be shown that the PCs are the factors representing the rows of the centered data matrix and that the NPCs correspond to the factors of the standardized data matrix. The representation of the columns of the standardized data matrix provides (at a scale factor) the correlations between the NPCs and the original variables. The derivation of the (N)PCs presented above will have a nice geometric justification here since they are the best fit in subspaces generated by the columns of the (transformed) data matrix ${\data{X}}$ . This analogy provides complementary interpretations of the graphical representations shown above.

Assume, as in Chapter 8, that we want to obtain representations of the individuals (the rows of $\data{X}$ ) and of the variables (the columns of $\data{X}$ ) in spaces of smaller dimension. To keep the representations simple, some prior transformations are performed. Since the origin has no particular statistical meaning in the space of individuals, we will first shift the origin to the center of gravity, $\overline{x}$ , of the point cloud. This is the same as analyzing the centered data matrix $\data{X}_{C} = \data{H}\data{X}$ . Now all of the variables have zero means, thus the technique used in Chapter 8 can be applied to the matrix $\data{X}_{C}$ . Note that the spectral decomposition of $\data{X}_{C}^{\top}\data{X}_{C}$ is related to that of $\data{S}_{X}$ , namely

The representation of the variables can be obtained using the Duality Relations (8.11), and (8.12). The projections of the columns of $\data{X}_{C}$ onto the eigenvectors $v_{k}$ of $\data{X}_{C}\data{X}_{C}^{\top}$ are

The NPCs can also be viewed as a factorial method for reducing the dimension. The variables are again standardized so that each one has mean zero and unit variance and is independent of the scale of the variables. The factorial analysis of $\data{X}_S$ provides the NPCs. The spectral decomposition of $\data{X}^{\top}_S \data{X}_S$ is related to that of $\data{R}$ , namely

This implies that a deeper interpretation of the representation of the individuals can be obtained by looking simultaneously at the graphs plotting the variables. Note that

Quality of the representations

As said before, an overall measure of the quality of the representation is given by

It can be useful to check if each individual is well represented by the PCs. Clearly, the proximity of two individuals on the projected space may not necessarily coincide with the proximity in the full original space $\mathbb{R}^p$ , which may lead to erroneous interpretations of the graphs. In this respect, it is worth computing the angle $\vartheta_{ik}$ between the representation of an individual

and the

-th PC or NPC axis. This can be done using (2.40), i.e.,

We already know that the quality of the representation of the variables can be evaluated by the percentage of

's variance that is explained by a PC, which is given by $r^2_{X_i Y_j}$ or $r^2_{X_i Z_j}$ according to (9.16) and (9.27) respectively.

EXAMPLE 9.6 Let us return to the French food expenditure example, see Appendix B.6. This yields a two-dimensional representation of the individuals as shown in Figure 9.7.

**Figure 9.7:** Representation of the individuals. `MVAnpcafood.xpl`
$\includegraphics[width=1\defpicwidth]{npcafood.ps}$

**Figure 9.8:** Representation of the variables. `MVAnpcafood.xpl`
$\includegraphics[width=1\defpicwidth]{food2.ps}$

Calculating the matrix ${\data{G}}_{\data{R}}$ we have

$\begin{displaymath}\data{G}_{\data{R}}= \left(\begin{array}{rrrrrr} -0.240& 0.6... ...0.206& 0.479& 0.780& 0.306& -0.069& -0.138 \end{array}\right),\end{displaymath}$

which gives the weights of the variables (milk, vegetables, etc.). The eigenvalues $\ell_j$ and the proportions of explained variance are given in Table 9.3.

Table: Eigenvalues and explained variance

eigenvalues	proportion of variance	cumulated proportion
4.333	0.6190	61.9
1.830	0.2620	88.1
0.631	0.0900	97.1
0.128	0.0180	98.9
0.058	0.0080	99.7
0.019	0.0030	99.9
0.001	0.0001	100.0

The interpretation of the principal components are best understood when looking at the correlations between the original 's and the PCs. Since the first two PCs explain 88.1% of the variance, we limit ourselves to the first two PCs. The results are shown in Table 9.4.

Table: Correlations with PCs

	$r_{X_iZ_1}$	$r_{X_iZ_2}$	$r^2_{X_iZ_1} + r^2_{X_iZ_2}$
: bread	0.499	0.842	0.957
: vegetables	0.970	0.133	0.958
: fruits	0.929	0.278	0.941
: meat	0.962	0.191	0.962
: poultry	0.911	0.266	0.901
: milk	0.584	0.707	0.841
: wine	0.428	0.648	0.604

The two-dimensional graphical representation of the variables in Figure 9.8 is based on the first two columns of Table 9.4.

The plots are the projections of the variables into $\mathbb{R}^2$ . Since the quality of the representation is good for all the variables (except maybe ), their relative angles give a picture of their original correlation: wine is negatively correlated with the vegetables, fruits, meat and poultry groups ( $\theta >90^o$ ), whereas taken individually this latter grouping of variables are highly positively correlated with each other ( $\theta \approx 0$ ). Bread and milk are positively correlated but poorly correlated with meat, fruits and poultry ( $\theta \approx 90^o$ ).

Now the representation of the individuals in Figure 9.7 can be interpreted better. From Figure 9.8 and Table 9.4 we can see that the the first factor is a vegetable-meat-poultry-fruit factor (with a negative sign), whereas the second factor is a milk-bread-wine factor (with a positive sign). Note that this corresponds to the most important weights in the first columns of ${\data{G}}_{\data{R}}$ . In Figure 9.7 lines were drawn to connect families of the same size and families of the same professional types. A grid can clearly be seen (with a slight deformation by the manager families) that shows the families with higher expenditures (higher number of children) on the left.

Considering both figures together explains what types of expenditures are responsible for similarities in food expenditures. Bread, milk and wine expenditures are similar for manual workers and employees. Families of managers are characterized by higher expenditures on vegetables, fruits, meat and poultry. Very often when analyzing NPCs (and PCs), it is illuminating to use such a device to introduce qualitative aspects of individuals in order to enrich the interpretations of the graphs.

$\displaystyle \overline{y}$	$\textstyle =$	$\displaystyle 0,$	(9.30)
$\displaystyle \data{S}_{Y}$	$\textstyle =$	$\displaystyle \data{G}^{\top}\data{S}_{X}\data{G} = \data{L} = \mathop{\hbox{diag}}(\ell_{1}, \ldots, \ell_{p}) .$	(9.31)

$\displaystyle x^{\top}_{C \column{j}}x_{C \column{k}} = n s_{X_{j}X_{k}},$			(9.34)
$\displaystyle \vert\vert x_{C \column{j}}\vert\vert^2 = n s_{X_{j}X_{j}},$			(9.35)

$\displaystyle x_{S \column{j}}^{\top} x_{S \column{k}}$	$\textstyle =$	$\displaystyle n r_{X_{j}X_{k}},$	(9.38)
$\displaystyle \Vert x_{S \column{j}}\Vert^2$	$\textstyle =$	$\displaystyle n,$	(9.39)