The empirical PCs (normalized or not) turn out to be equivalent to the factors that one would obtain by decomposing the appropriate data matrix into its factors (see Chapter 8). It will be shown that the PCs are the factors representing the rows of the centered data matrix and that the NPCs correspond to the factors of the standardized data matrix. The representation of the columns of the standardized data matrix provides (at a scale factor) the correlations between the NPCs and the original variables. The derivation of the (N)PCs presented above will have a nice geometric justification here since they are the best fit in subspaces generated by the columns of the (transformed) data matrix . This analogy provides complementary interpretations of the graphical representations shown above.
Assume, as in Chapter 8, that we want to obtain
representations of the individuals (the rows of ) and
of the variables (the columns of ) in spaces of smaller
dimension. To keep the representations simple,
some prior transformations are performed. Since the origin has no particular
statistical meaning in the space of individuals, we will first shift
the origin to the center of gravity, , of the point
cloud.
This is the same as analyzing the centered data matrix
.
Now all of the variables have zero means, thus the technique used in
Chapter 8 can be applied to the matrix . Note that
the spectral decomposition of
is related
to that of , namely
(9.28) |
(9.29) |
(9.30) | |||
(9.31) |
The representation of the variables can be obtained using the Duality
Relations (8.11), and (8.12).
The projections of the columns of
onto the eigenvectors of
are
(9.32) |
(9.33) |
(9.34) | |||
(9.35) |
The NPCs can also be viewed as a
factorial method for reducing the dimension.
The variables are again standardized so that each one has mean zero and
unit variance and is independent of the scale of the variables. The factorial
analysis of provides the NPCs. The spectral decomposition of
is related to that of , namely
The representation of the variables are again given by the columns of
This implies that a deeper interpretation of the representation of
the individuals can be obtained by looking
simultaneously at the graphs plotting the variables. Note
that
(9.38) | |||
(9.39) |
(9.40) |
As said before, an overall measure of the quality of the
representation is given by
It can be useful to check if each individual is well represented
by the PCs. Clearly, the proximity of two individuals on the projected
space may not necessarily coincide with the proximity
in the full original space , which
may lead to erroneous interpretations of the graphs.
In this respect, it is worth computing the angle
between the representation of an individual
and the -th PC or NPC axis. This can be done using (2.40),
i.e.,
We already know that the quality of the representation of the variables can be evaluated by the percentage of 's variance that is explained by a PC, which is given by or according to (9.16) and (9.27) respectively.
Calculating the matrix
we have
|
The interpretation of the principal components are best understood
when looking at the correlations between the original 's and the PCs.
Since the first two PCs explain 88.1% of the variance, we limit
ourselves to the first two PCs. The results are shown in Table 9.4.
|
The plots are the projections of the variables into . Since the quality of the representation is good for all the variables (except maybe ), their relative angles give a picture of their original correlation: wine is negatively correlated with the vegetables, fruits, meat and poultry groups (), whereas taken individually this latter grouping of variables are highly positively correlated with each other ( ). Bread and milk are positively correlated but poorly correlated with meat, fruits and poultry ( ).
Now the representation of the individuals in Figure 9.7 can be interpreted better. From Figure 9.8 and Table 9.4 we can see that the the first factor is a vegetable-meat-poultry-fruit factor (with a negative sign), whereas the second factor is a milk-bread-wine factor (with a positive sign). Note that this corresponds to the most important weights in the first columns of . In Figure 9.7 lines were drawn to connect families of the same size and families of the same professional types. A grid can clearly be seen (with a slight deformation by the manager families) that shows the families with higher expenditures (higher number of children) on the left.
Considering both figures together explains what types of expenditures are responsible for similarities in food expenditures. Bread, milk and wine expenditures are similar for manual workers and employees. Families of managers are characterized by higher expenditures on vegetables, fruits, meat and poultry. Very often when analyzing NPCs (and PCs), it is illuminating to use such a device to introduce qualitative aspects of individuals in order to enrich the interpretations of the graphs.