9.6 Principal Components as a Factorial Method

The empirical PCs (normalized or not) turn out to be equivalent to the factors that one would obtain by decomposing the appropriate data matrix into its factors (see Chapter 8). It will be shown that the PCs are the factors representing the rows of the centered data matrix and that the NPCs correspond to the factors of the standardized data matrix. The representation of the columns of the standardized data matrix provides (at a scale factor) the correlations between the NPCs and the original variables. The derivation of the (N)PCs presented above will have a nice geometric justification here since they are the best fit in subspaces generated by the columns of the (transformed) data matrix ${\data{X}}
$. This analogy provides complementary interpretations of the graphical representations shown above.

Assume, as in Chapter 8, that we want to obtain representations of the individuals (the rows of $\data{X}$) and of the variables (the columns of $\data{X}$) in spaces of smaller dimension. To keep the representations simple, some prior transformations are performed. Since the origin has no particular statistical meaning in the space of individuals, we will first shift the origin to the center of gravity, $\overline{x}$, of the point cloud. This is the same as analyzing the centered data matrix $\data{X}_{C} = \data{H}\data{X}$. Now all of the variables have zero means, thus the technique used in Chapter 8 can be applied to the matrix $\data{X}_{C}$. Note that the spectral decomposition of $\data{X}_{C}^{\top}\data{X}_{C}$ is related to that of $\data{S}_{X}$, namely

\begin{displaymath}
\data{X}_{C}^{\top}\data{X}_{C}
= \data{X}^{\top}\data{H}^{...
...data{X}
= n\data{S}_{X} = n \data{G}\data{L}\data{G}^{\top}.
\end{displaymath} (9.28)

The factorial variables are obtained by projecting $\data{X}_{C}$ on $\data{G}$,
\begin{displaymath}
\data{Y} = \data{X}_{C}\data{G} = (y_{1}, \ldots, y_{p}).
\end{displaymath} (9.29)

These are the same principal components obtained above, see formula (9.10). (Note that the $y$'s here correspond to the $z$'s in Section 8.2.) Since $\data{H}\data{X}_{C} = \data{X}_C$, it immediately follows that
$\displaystyle \overline{y}$ $\textstyle =$ $\displaystyle 0,$ (9.30)
$\displaystyle \data{S}_{Y}$ $\textstyle =$ $\displaystyle \data{G}^{\top}\data{S}_{X}\data{G}
= \data{L} = \mathop{\hbox{diag}}(\ell_{1}, \ldots, \ell_{p}) .$ (9.31)

The scatterplot of the individuals on the factorial axes are thus centered around the origin and are more spread out in the first direction (first PC has variance $\ell_{1}$) than in the second direction (second PC has variance $\ell_{2}$).

The representation of the variables can be obtained using the Duality Relations (8.11), and (8.12). The projections of the columns of $\data{X}_{C}$ onto the eigenvectors $v_{k}$ of $\data{X}_{C}\data{X}_{C}^{\top}$ are

\begin{displaymath}
\data{X}_{C}^{\top}v_{k}
= \frac{1}{\sqrt{n \ell_{k}}} \data{X}_{C}^{\top}\data{X}_{C}g_{k} =
\sqrt{n \ell_{k}} g_{k}.
\end{displaymath} (9.32)

Thus the projections of the variables on the first $p$ axes are the columns of the matrix
\begin{displaymath}
\data{X}_{C}^{\top}\data{V} = \sqrt{n}\data{G}\data{L}^{1/2}.
\end{displaymath} (9.33)

Considering the geometric representation, there is a nice statistical interpretation of the angle between two columns of $\data{X}_{C}$. Given that
$\displaystyle x^{\top}_{C \column{j}}x_{C \column{k}} = n s_{X_{j}X_{k}},$     (9.34)
$\displaystyle \vert\vert x_{C \column{j}}\vert\vert^2 = n s_{X_{j}X_{j}},$     (9.35)

where $x_{C \column{j}}$ and $x_{C \column{k}}$ denote the $j$-th and $k$-th column of $\data{X}_{C}$, it holds that in the full space of the variables, if $\theta_{jk}$ is the angle between two variables, $x_{C \column{j}}$ and $x_{C \column{k}}$, then
\begin{displaymath}
\cos \theta_{jk} = \frac{ x^{\top}_{C \column{j}}x_{C \colum...
...olumn{j}}\Vert\ \Vert x_{C \column{k}}\Vert} = r_{X_{j}X_{k}}
\end{displaymath} (9.36)

(Example 2.11 shows the general connection that exists between the angle and correlation of two variables). As a result, the relative positions of the variables in the scatterplot of the first columns of $\data{X}_{C}^{\top}\data{V}$ may be interpreted in terms of their correlations; the plot provides a picture of the correlation structure of the original data set. Clearly, one should take into account the percentage of variance explained by the chosen axes when evaluating the correlation.

The NPCs can also be viewed as a factorial method for reducing the dimension. The variables are again standardized so that each one has mean zero and unit variance and is independent of the scale of the variables. The factorial analysis of $\data{X}_S$ provides the NPCs. The spectral decomposition of $\data{X}^{\top}_S \data{X}_S$ is related to that of $\data{R}$, namely

\begin{displaymath}
\data{X}_{S}^{\top}\data{X}_{S}
= \data{D}^{-1/2}\data{X}^{...
...ta{G}_{\data{R}}\data{L}_{\data{R}}\data{G}_{\data{R}}^{\top}.
\end{displaymath}

The NPCs $Z_j$, given by (9.21), may be viewed as the projections of the rows of $\data{X}_S$ onto $\data{G}_R$.

The representation of the variables are again given by the columns of

\begin{displaymath}
\data{X}_{S}^{\top}\data{V}_{\data{R}}
= \sqrt{n}\data{G}_{\data{R}}\data{L}_{\data{R}}^{1/2} .
\end{displaymath} (9.37)

Comparing (9.37) and (9.25) we see that the projections of the variables in the factorial analysis provide the correlation between the NPCs $\data{Z}_{k}$ and the original variables $x_{\column{j}}$ (up to the factor $\sqrt{n}$ which could be the scale of the axes).

This implies that a deeper interpretation of the representation of the individuals can be obtained by looking simultaneously at the graphs plotting the variables. Note that

$\displaystyle x_{S \column{j}}^{\top} x_{S \column{k}}$ $\textstyle =$ $\displaystyle n r_{X_{j}X_{k}},$ (9.38)
$\displaystyle \Vert x_{S \column{j}}\Vert^2$ $\textstyle =$ $\displaystyle n,$ (9.39)

where $x_{S \column{j}}$ and $x_{S \column{k}}$ denote the $j$-th and $k$-th column of $\data{X}_{S}$. Hence, in the full space, all the standardized variables (columns of $\data{X}_{S}$) are contained within the ``sphere'' in $\mathbb{R}^n$, which is centered at the origin and has radius $\sqrt{n}$ (the scale of the graph). As in (9.36), given the angle $\theta_{jk}$ between two columns $x_{S \column{j}}$ and $x_{S \column{k}}$, it holds that
\begin{displaymath}
\cos \theta_{jk} = r_{X_{j}X_{k}}.
\end{displaymath} (9.40)

Therefore, when looking at the representation of the variables in the spaces of reduced dimension (for instance the first two factors), we have a picture of the correlation structure between the original $X_i$'s in terms of their angles. Of course, the quality of the representation in those subspaces has to be taken into account, which is presented in the next section.


Quality of the representations

As said before, an overall measure of the quality of the representation is given by

\begin{displaymath}\psi = \frac{\ell_{1} + \ell_{2} + \ldots +
\ell_{q}}{\sum\limits_{j=1}^p \ell_{j}}.\end{displaymath}

In practice, $q$ is chosen to be equal to 1, 2 or 3. Suppose for instance that $\psi = 0.93 $ for $q=2$. This means that the graphical representation in two dimensions captures 93% of the total variance. In other words, there is minimal dispersion in a third direction (no more than 7%).

It can be useful to check if each individual is well represented by the PCs. Clearly, the proximity of two individuals on the projected space may not necessarily coincide with the proximity in the full original space $\mathbb{R}^p$, which may lead to erroneous interpretations of the graphs. In this respect, it is worth computing the angle $\vartheta_{ik}$ between the representation of an individual $i$ and the $k$-th PC or NPC axis. This can be done using (2.40), i.e.,

\begin{displaymath}\cos \vartheta_{ik}
= \frac{y_{i}^{\top}e_{k}}{\Vert y_{i}\Vert \Vert e_{k}\Vert}
= \frac{y_{ik}}{\Vert x_{C i}\Vert}\end{displaymath}

for the PCs or analogously

\begin{displaymath}\cos \zeta_{ik}
= \frac{z_{i}^{\top}e_{k}}{\Vert z_{i}\Vert \Vert e_{k}\Vert}
= \frac{z_{ik}}{\Vert x_{S i}\Vert}\end{displaymath}

for the NPCs, where $e_{k}$ denotes the $k$-th unit vector $e_{k}=(0,\ldots,1,\ldots,0)^{\top}$. An individual $i$ will be represented on the $k$-th PC axis if its corresponding angle is small, i.e., if $\cos^2\vartheta_{ik}$ for $k=1,\dots,p$ is close to one. Note that for each individual $i$,

\begin{displaymath}\sum_{k=1}^p \cos^2 \vartheta_{ik}
= \frac{y_{i}^{\top}y_{i...
...top}\data{G}\data{G}^{\top}x_{C i}}{x_{C i}^{\top}x_{C i}} = 1 \end{displaymath}

The values $\cos^2\vartheta_{ik}$ are sometimes called the relative contributions of the $k$-th axis to the representation of the $i$-th individual, e.g., if $\cos^2 \vartheta_{i1} +
\cos^2 \vartheta_{i2}$ is large (near one), we know that the individual $i$ is well represented on the plane of the first two principal axes since its corresponding angle with the plane is close to zero.

We already know that the quality of the representation of the variables can be evaluated by the percentage of $X_i$'s variance that is explained by a PC, which is given by $r^2_{X_i Y_j}$ or $r^2_{X_i Z_j}$ according to (9.16) and (9.27) respectively.

EXAMPLE 9.6   Let us return to the French food expenditure example, see Appendix B.6. This yields a two-dimensional representation of the individuals as shown in Figure 9.7.

Figure 9.7: Representation of the individuals. 33334 MVAnpcafood.xpl
\includegraphics[width=1\defpicwidth]{npcafood.ps}

Figure 9.8: Representation of the variables. 33341 MVAnpcafood.xpl
\includegraphics[width=1\defpicwidth]{food2.ps}

Calculating the matrix ${\data{G}}_{\data{R}}$ we have

\begin{displaymath}\data{G}_{\data{R}}=
\left(\begin{array}{rrrrrr}
-0.240& 0.6...
...0.206& 0.479& 0.780& 0.306& -0.069& -0.138
\end{array}\right),\end{displaymath}

which gives the weights of the variables (milk, vegetables, etc.). The eigenvalues $\ell_j$ and the proportions of explained variance are given in Table 9.3.

Table: Eigenvalues and explained variance
eigenvalues proportion of variance cumulated proportion
4.333 0.6190 61.9
1.830 0.2620 88.1
0.631 0.0900 97.1
0.128 0.0180 98.9
0.058 0.0080 99.7
0.019 0.0030 99.9
0.001 0.0001 100.0


The interpretation of the principal components are best understood when looking at the correlations between the original $X_i$'s and the PCs. Since the first two PCs explain 88.1% of the variance, we limit ourselves to the first two PCs. The results are shown in Table 9.4.

Table: Correlations with PCs
  $r_{X_iZ_1}$ $r_{X_iZ_2}$ $r^2_{X_iZ_1} + r^2_{X_iZ_2}$
$X_1$: bread $-$0.499 0.842 0.957
$X_2$: vegetables $-$0.970 0.133 0.958
$X_3$: fruits $-$0.929 $-$0.278 0.941
$X_4$: meat $-$0.962 $-$0.191 0.962
$X_5$: poultry $-$0.911 $-$0.266 0.901
$X_6$: milk $-$0.584 0.707 0.841
$X_7$: wine 0.428 0.648 0.604


The two-dimensional graphical representation of the variables in Figure 9.8 is based on the first two columns of Table 9.4.

The plots are the projections of the variables into $\mathbb{R}^2$. Since the quality of the representation is good for all the variables (except maybe $X_7$), their relative angles give a picture of their original correlation: wine is negatively correlated with the vegetables, fruits, meat and poultry groups ($\theta >90^o$), whereas taken individually this latter grouping of variables are highly positively correlated with each other ( $\theta \approx 0$). Bread and milk are positively correlated but poorly correlated with meat, fruits and poultry ( $\theta \approx 90^o$).

Now the representation of the individuals in Figure 9.7 can be interpreted better. From Figure 9.8 and Table 9.4 we can see that the the first factor $Z_1$ is a vegetable-meat-poultry-fruit factor (with a negative sign), whereas the second factor is a milk-bread-wine factor (with a positive sign). Note that this corresponds to the most important weights in the first columns of ${\data{G}}_{\data{R}}$. In Figure 9.7 lines were drawn to connect families of the same size and families of the same professional types. A grid can clearly be seen (with a slight deformation by the manager families) that shows the families with higher expenditures (higher number of children) on the left.

Considering both figures together explains what types of expenditures are responsible for similarities in food expenditures. Bread, milk and wine expenditures are similar for manual workers and employees. Families of managers are characterized by higher expenditures on vegetables, fruits, meat and poultry. Very often when analyzing NPCs (and PCs), it is illuminating to use such a device to introduce qualitative aspects of individuals in order to enrich the interpretations of the graphs.

Summary
$\ast$
NPCs are PCs applied to the standardized (normalized) data matrix $\data{X}_{S}$.
$\ast$
The graphical representation of NPCs provides a similar type of picture as that of PCs, the difference being in the relative position of individuals, i.e., each variable in NPCs has the same weight (in PCs, the variable with the largest variance has the largest weight).
$\ast$
The quality of the representation is evaluated by $ \psi=
({\sum_{j=1}^p \ell_{j}})^{-1}
{(\ell_{1} + \ell_{2} + \ldots + \ell_{q})}.$
$\ast$
The quality of the representation of a variable can be evaluated by the percentage of $X_i$'s variance that is explained by a PC, i.e., $r^2_{X_i Y_j}$.