9.8 Boston Housing

A set of transformations were defined in Chapter 1 for the Boston Housing data set that resulted in ``regular'' marginal distributions. The usefulness of principal component analysis with respect to such high-dimensional data sets will now be shown. The variable $X_4$ is dropped because it is a discrete 0-1 variable. It will be used later, however, in the graphical representations. The scale difference of the remaining 13 variables motivates a NPCA based on the correlation matrix.

The eigenvalues and the percentage of explained variance are given in Table 9.5.

Table: Eigenvalues and percentage of explained variance for Boston Housing data. 34088 MVAnpcahous.xpl
eigenvalue percentages cumulated percentages
7.2852 0.5604 0.5604
1.3517 0.1040 0.6644
1.1266 0.0867 0.7510
0.7802 0.0600 0.8111
0.6359 0.0489 0.8600
0.5290 0.0407 0.9007
0.3397 0.0261 0.9268
0.2628 0.0202 0.9470
0.1936 0.0149 0.9619
0.1547 0.0119 0.9738
0.1405 0.0108 0.9846
0.1100 0.0085 0.9931
0.0900 0.0069 1.0000


The first principal component explains 56% of the total variance and the first three components together explain more than 75%. These results imply that it is sufficient to look at 2, maximum 3, principal components.


Table: Correlations of the first three PC's with the original variables. 34091 MVAnpcahous.xpl
  $PC_1$ $PC_2$ $PC_3$
$X_{1}$ $-$0.9076 0.2247 0.1457
$X_{2}$ 0.6399 $-$0.0292 0.5058
$
X_{3}$ $-$0.8580 0.0409 $-$0.1845
$X_{5}$ $-$0.8737 0.2391 $-$0.1780
$X_{6}$ 0.5104 0.7037 0.0869
$X_{7}$ $-$0.7999 0.1556 $-$0.2949
$X_{8}$ 0.8259 $-$0.2904 0.2982
$X_{9}$ $-$0.7531 0.2857 0.3804
$X_{10}$ $-$0.8114 0.1645 0.3672
$X_{11}$ $-$0.5674 $-$0.2667 0.1498
$X_{12}$ 0.4906 $-$0.1041 $-$0.5170
$X_{13}$ $-$0.7996 $-$0.4253 $-$0.0251
$X_{14}$ 0.7366 0.5160 $-$0.1747


Table 9.6 provides the correlations between the first three PC's and the original variables. These can be seen in Figure 9.10.

Figure 9.10: NPCA for the Boston housing data, correlations of first three PCs with the original variables. 34097 MVAnpcahousi.xpl
\includegraphics[width=1\defpicwidth]{npcahousi.ps}

The correlations with the first PC show a very clear pattern. The variables $X_2, X_6, X_8, X_{12}$, and $X_{14}$ are strongly positively correlated with the first PC, whereas the remaining variables are highly negatively correlated. The minimal correlation in the absolute value is 0.5. The first PC axis could be interpreted as a quality of life and house indicator. The second axis, given the polarities of $X_{11}$ and $X_{13}$ and of $X_6$ and $X_{14}$, can be interpreted as a social factor explaining only 10% of the total variance. The third axis is dominated by a polarity between $X_2$ and $X_{12}$.

Figure 9.11: NPC analysis for the Boston housing data, scatterplot of the first two PCs. More expensive houses are marked with red color. 34103 MVAnpcahous.xpl
\includegraphics[width=1\defpicwidth]{MVAnpcahous.ps}

The set of individuals from the first two PCs can be graphically interpreted if the plots are color coded with respect to some particular variable of interest. Figure 9.11 color codes $X_{14}> \textrm{median}$ as red points. Clearly the first and second PCs are related to house value. The situation is less clear in Figure 9.12 where the color code corresponds to $X_4$, the Charles River indicator, i.e., houses near the river are colored red.

Figure 9.12: NPC analysis for the Boston housing data, scatterplot of the first two PCs. Houses close to the Charles River are indicated with red squares. 34109 MVAnpcahous.xpl
\includegraphics[width=1\defpicwidth]{MVAnpcahous2.ps}