11.4 Boston Housing

Figure 11.7: Dendrograms of the Boston housing data using the Ward algorithm. 39858 MVAclusbh.xpl
\includegraphics[width=1\defepswidth]{MVAclusbh1.ps}


Table: Means and standard errors of the 13 standardized variables for Cluster 1 (251 observations) and Cluster 2 (255 observations). 39861 MVAclusbh.xpl
Variable Mean C1 $SE$ C1 Mean C2 $SE$ C2
${1}$ $-$0.7105 0.0332 0.6994 0.0535
${2}$ 0.4848 0.0786 $-$0.4772 0.0047
${3}$ $-$0.7665 0.0510 0.7545 0.0279
${5}$ $-$0.7672 0.0365 0.7552 0.0447
${6}$ 0.4162 0.0571 $-$0.4097 0.0576
${7}$ $-$0.7730 0.0429 0.7609 0.0378
${8}$ 0.7140 0.0472 $-$0.7028 0.0417
${9}$ $-$0.5429 0.0358 0.5344 0.0656
${10}$ $-$0.6932 0.0301 0.6823 0.0569
${11}$ $-$0.5464 0.0469 0.5378 0.0582
${12}$ 0.3547 0.0080 $-$0.3491 0.0824
${13}$ $-$0.6899 0.0401 0.6791 0.0509
${14}$ 0.5996 0.0431 $-$0.5902 0.0570


We have motivated the transformation of the variables of the Boston housing data many times before. Now we illustrate the cluster algorithm with the transformed data $\widetilde{\data{X}}$ excluding $\widetilde X_4$ (Charles River indicator). Among the various algorithms, the results from the Ward algorithm are presented since this algorithm gave the most sensible results. In order to be coherent with our previous analysis, we standardize each variable. The dendrogram of the Ward method is displayed in Figure 11.7. Two dominant clusters are visible. A further refinement of say, 4 clusters, could be considered at a lower level of distance.

To interprete the two clusters, we present the mean values and their respective standard errors of the thirteen $\widetilde{\data{X}}$ variables by group in Table 11.3. Comparing the mean values for both groups shows that all the differences in the means are individually significant and that cluster one corresponds to housing districts with better living quality and higher house prices, whereas cluster two corresponds to less favored districts in Boston. This can be confirmed, for instance, by a lower crime rate, a higher proportion of residential land, lower proportion of blacks, etc. for cluster one. Cluster two is identified by a higher proportion of older houses, a higher pupil/teacher ratio and a higher percentage of the lower status population.

Figure: Scatterplot matrix for variables $\widetilde X_{1}$ to $\widetilde X_{7}$ of the Boston housing data. 39865 MVAclusbh.xpl
\includegraphics[width=1\defepswidth]{MVAclusbh3.ps}

Figure: Scatterplot matrix for variables $\widetilde X_{8}$ to $\widetilde X_{14}$ of the Boston housing data. 39869 MVAclusbh.xpl
\includegraphics[width=1\defepswidth]{MVAclusbh4.ps}

Figure: Scatterplot of the first two PCs displaying the two clusters. 39873 MVAclusbh.xpl
\includegraphics[width=1\defepswidth]{MVAclusbh2.ps}

This interpretation is underlined by visual inspection of all the variables presented on scatterplot matrices in Figures 11.8 and 11.9. For example, the lower right boxplot of Figure 11.9 and the correspondingly colored clusters in the last row confirm the role of each variable in determining the clusters. This interpretation perfectly coincides with the previous PC analysis (Figure 9.11). The quality of life factor is clearly visible in Figure 11.10, where cluster membership is distinguished by the shape and color of the points graphed according to the first two principal components. Clearly, the first PC completely separates the two clusters and corresponds, as we have discussed in Chapter 9, to a quality of life and house indicator.