A set of transformations were defined in Chapter 1
for the Boston Housing data set that resulted in ``regular''
marginal distributions.
The usefulness of principal component analysis with respect to such
high-dimensional data sets will now be shown.
The variable is dropped because it is a discrete
0-1 variable. It will be used later, however, in the graphical
representations. The scale difference of the
remaining 13 variables motivates a NPCA based on the correlation matrix.
The eigenvalues and the percentage of explained
variance are given in Table 9.5.
|
The first principal component explains 56% of the total variance and the first three components together explain more than 75%. These results imply that it is sufficient to look at 2, maximum 3, principal components.
|
Table 9.6 provides the correlations between the first three PC's and the original variables. These can be seen in Figure 9.10.
![]() |
The correlations with the first PC show a very clear pattern. The
variables
, and
are strongly positively
correlated with the first PC, whereas the remaining variables are highly
negatively correlated. The minimal correlation in the absolute value
is 0.5. The first PC axis could be interpreted as a quality of life
and house indicator. The second axis, given the polarities of
and
and of
and
, can be interpreted as a social factor
explaining only 10% of the total variance. The third axis is
dominated by a polarity between
and
.
![]() |
The set of individuals from the first two PCs can be graphically
interpreted if the plots are color coded
with respect to some particular variable of interest.
Figure 9.11 color codes
as red
points. Clearly the first and second PCs are related
to house value. The situation is less clear in Figure 9.12
where the color code corresponds to
, the Charles River indicator,
i.e., houses near the river are colored red.
![]() |