1.8 Boston Housing

Aim of the analysis

The Boston Housing data set was analyzed by Harrison and Rubinfeld (1978) who wanted to find out whether ``clean air'' had an influence on house prices. We will use this data set in this chapter and in most of the following chapters to illustrate the presented methodology. The data are described in Appendix B.1.

What can be seen from the PCPs

In order to highlight the relations of $X_{14}$ to the remaining 13 variables we color all of the observations with $X_{14}>$median$(X_{14})$ as red lines in Figure 1.24. Some of the variables seem to be strongly related. The most obvious relation is the negative dependence between $X_{13}$ and $X_{14}$. It can also be argued that there exists a strong dependence between $X_{12}$ and $X_{14}$ since no red lines are drawn in the lower part of $X_{12}$. The opposite can be said about $X_{11}$: there are only red lines plotted in the lower part of this variable. Low values of $X_{11}$ induce high values of $X_{14}$.

For the PCP, the variables have been rescaled over the interval $[0,1]$ for better graphical representations. The PCP shows that the variables are not distributed in a symmetric manner. It can be clearly seen that the values of $X_1$ and $X_9$ are much more concentrated around 0. Therefore it makes sense to consider transformations of the original data.

Figure: Parallel coordinates plot for Boston Housing data. 3494 MVApcphousing.xpl
\includegraphics[width=1\defpicwidth]{pcp.ps}

The scatterplot matrix

One characteristic of the PCPs is that many lines are drawn on top of each other. This problem is reduced by depicting the variables in pairs of scatterplots. Including all 14 variables in one large scatterplot matrix is possible, but makes it hard to see anything from the plots. Therefore, for illustratory purposes we will analyze only one such matrix from a subset of the variables in Figure 1.25. On the basis of the PCP and the scatterplot matrix we would like to interpret each of the thirteen variables and their eventual relation to the 14th variable. Included in the figure are images for $X_1$-$X_5$ and $X_{14}$, although each variable is discussed in detail below. All references made to scatterplots in the following refer to Figure 1.25.

Figure: Scatterplot matrix for variables $X_1,\dots ,X_5$ and $X_{14}$ of the Boston Housing data. 3498 MVAdrafthousing.xpl
\includegraphics[width=1\defpicwidth]{MVAdrafthousing.ps}

Figure: Scatterplot matrix for variables $\widetilde X_1,\dots,\widetilde X_5$ and $\widetilde X_{14}$ of the Boston Housing data. 3502 MVAdrafthousingt.xpl
\includegraphics[width=1\defpicwidth]{MVAdrafthousingt.ps}

Per-capita crime rate $X_1$

Taking the logarithm makes the variable's distribution more symmetric. This can be seen in the boxplot of $\widetilde X_1$ in Figure 1.27 which shows that the median and the mean have moved closer to each other than they were for the original $X_1$. Plotting the kernel density estimate (KDE) of $\widetilde X_1=\log{(X_1)}$ would reveal that two subgroups might exist with different mean values. However, taking a look at the scatterplots in Figure 1.26 of the logarithms which include $X_1$ does not clearly reveal such groups. Given that the scatterplot of $\log{(X_1)}$ vs.  $\log{(X_{14})}$ shows a relatively strong negative relation, it might be the case that the two subgroups of $X_1$ correspond to houses with two different price levels. This is confirmed by the two boxplots shown to the right of the $X_1$ vs. $X_2$ scatterplot (in Figure 1.25): the red boxplot's shape differs a lot from the black one's, having a much higher median and mean.

Proportion of residential area zoned for large lots $X_2$

It strikes the eye in Figure 1.25 that there is a large cluster of observations for which $X_2$ is equal to 0. It also strikes the eye that--as the scatterplot of $X_1$ vs. $X_2$ shows--there is a strong, though non-linear, negative relation between $X_1$ and $X_2$: Almost all observations for which $X_2$ is high have an $X_1$-value close to zero, and vice versa, many observations for which $X_2$ is zero have quite a high per-capita crime rate $X_1$. This could be due to the location of the areas, e.g., downtown districts might have a higher crime rate and at the same time it is unlikely that any residential land would be zoned in a generous manner.

As far as the house prices are concerned it can be said that there seems to be no clear (linear) relation between $X_2$ and $X_{14}$, but it is obvious that the more expensive houses are situated in areas where $X_2$ is large (this can be seen from the two boxplots on the second position of the diagonal, where the red one has a clearly higher mean/median than the black one).

Proportion of non-retail business acres $X_3$

The PCP (in Figure 1.24) as well as the scatterplot of $X_3$ vs. $X_{14}$ shows an obvious negative relation between $X_3$ and $X_{14}$. The relationship between the logarithms of both variables seems to be almost linear. This negative relation might be explained by the fact that non-retail business sometimes causes annoying sounds and other pollution. Therefore, it seems reasonable to use $X_3$ as an explanatory variable for the prediction of $X_{14}$ in a linear-regression analysis.

As far as the distribution of $X_3$ is concerned it can be said that the kernel density estimate of $X_3$ clearly has two peaks, which indicates that there are two subgroups. According to the negative relation between $X_3$ and $X_{14}$ it could be the case that one subgroup corresponds to the more expensive houses and the other one to the cheaper houses.

Charles River dummy variable $X_4$

The observation made from the PCP that there are more expensive houses than cheap houses situated on the banks of the Charles River is confirmed by inspecting the scatterplot matrix. Still, we might have some doubt that the proximity to the river influences the house prices. Looking at the original data set, it becomes clear that the observations for which $X_4$ equals one are districts that are close to each other. Apparently, the Charles River does not flow through too many different districts. Thus, it may be pure coincidence that the more expensive districts are close to the Charles River--their high values might be caused by many other factors such as the pupil/teacher ratio or the proportion of non-retail business acres.

Nitric oxides concentration $X_5$

The scatterplot of $X_5$ vs. $X_{14}$ and the separate boxplots of $X_5$ for more and less expensive houses reveal a clear negative relation between the two variables. As it was the main aim of the authors of the original study to determine whether pollution had an influence on housing prices, it should be considered very carefully whether $X_5$ can serve as an explanatory variable for the price $X_{14}$. A possible reason against it being an explanatory variable is that people might not like to live in areas where the emissions of nitric oxides are high. Nitric oxides are emitted mainly by automobiles, by factories and from heating private homes. However, as one can imagine there are many good reasons besides nitric oxides not to live downtown or in industrial areas! Noise pollution, for example, might be a much better explanatory variable for the price of housing units. As the emission of nitric oxides is usually accompanied by noise pollution, using $X_5$ as an explanatory variable for $X_{14}$ might lead to the false conclusion that people run away from nitric oxides, whereas in reality it is noise pollution that they are trying to escape.

Average number of rooms per dwelling $X_6$

The number of rooms per dwelling is a possible measure for the size of the houses. Thus we expect $X_6$ to be strongly correlated with $X_{14}$ (the houses' median price). Indeed--apart from some outliers--the scatterplot of $X_6$ vs. $X_{14}$ shows a point cloud which is clearly upward-sloping and which seems to be a realisation of a linear dependence of $X_{14}$ on $X_6$. The two boxplots of $X_6$ confirm this notion by showing that the quartiles, the mean and the median are all much higher for the red than for the black boxplot.

Proportion of owner-occupied units built prior to 1940 $X_7$

There is no clear connection visible between $X_7$ and $X_{14}$. There could be a weak negative correlation between the two variables, since the (red) boxplot of $X_7$ for the districts whose price is above the median price indicates a lower mean and median than the (black) boxplot for the district whose price is below the median price. The fact that the correlation is not so clear could be explained by two opposing effects. On the one hand house prices should decrease if the older houses are not in a good shape. On the other hand prices could increase, because people often like older houses better than newer houses, preferring their atmosphere of space and tradition. Nevertheless, it seems reasonable that the houses' age has an influence on their price $X_{14}$.

Raising $X_7$ to the power of 2.5 reveals again that the data set might consist of two subgroups. But in this case it is not obvious that the subgroups correspond to more expensive or cheaper houses. One can furthermore observe a negative relation between $X_7$ and $X_8$. This could reflect the way the Boston metropolitan area developed over time: the districts with the newer buildings are farther away from employment centres with industrial facilities.

Weighted distance to five Boston employment centres $X_8$

Since most people like to live close to their place of work, we expect a negative relation between the distances to the employment centres and the houses' price. The scatterplot hardly reveals any dependence, but the boxplots of $X_8$ indicate that there might be a slightly positive relation as the red boxplot's median and mean are higher than the black one's. Again, there might be two effects in opposite directions at work. The first is that living too close to an employment centre might not provide enough shelter from the pollution created there. The second, as mentioned above, is that people do not travel very far to their workplace.

Index of accessibility to radial highways $X_9$

The first obvious thing one can observe in the scatterplots, as well in the histograms and the kernel density estimates, is that there are two subgroups of districts containing $X_9$ values which are close to the respective group's mean. The scatterplots deliver no hint as to what might explain the occurrence of these two subgroups. The boxplots indicate that for the cheaper and for the more expensive houses the average of $X_9$ is almost the same.

Full-value property tax $X_{10}$

$X_{10}$ shows a behavior similar to that of $X_9$: two subgroups exist. A downward-sloping curve seems to underlie the relation of $X_{10}$ and $X_{14}$. This is confirmed by the two boxplots drawn for $X_{10}$: the red one has a lower mean and median than the black one.

Pupil/teacher ratio $X_{11}$

The red and black boxplots of $X_{11}$ indicate a negative relation between $X_{11}$ and $X_{14}$. This is confirmed by inspection of the scatterplot of $X_{11}$ vs. $X_{14}$: The point cloud is downward sloping, i.e., the less teachers there are per pupil, the less people pay on median for their dwellings.

Proportion of blacks $B$, $X_{12} = 1000 (B - 0.63)^2 {\boldsymbol{I}}(B<0.63)$

Interestingly, $X_{12}$ is negatively--though not linearly--correlated with $X_3$, $X_7$ and $X_{11}$, whereas it is positively related with $X_{14}$. Having a look at the data set reveals that for almost all districts $X_{12}$ takes on a value around 390. Since $B$ cannot be larger than 0.63, such values can only be caused by $B$ close to zero. Therefore, the higher $X_{12}$ is, the lower the actual proportion of blacks is! Among observations 405 through 470 there are quite a few that have a $X_{12}$ that is much lower than 390. This means that in these districts the proportion of blacks is above zero. We can observe two clusters of points in the scatterplots of $\log{(X_{12})}$: one cluster for which $X_{12}$ is close to 390 and a second one for which $X_{12}$ is between 3 and 100. When $X_{12}$ is positively related with another variable, the actual proportion of blacks is negatively correlated with this variable and vice versa. This means that blacks live in areas where there is a high proportion of non-retail business acres, where there are older houses and where there is a high (i.e., bad) pupil/teacher ratio. It can be observed that districts with housing prices above the median can only be found where the proportion of blacks is virtually zero!

Proportion of lower status of the population $X_{13}$

Of all the variables $X_{13}$ exhibits the clearest negative relation with $X_{14}$--hardly any outliers show up. Taking the square root of $X_{13}$ and the logarithm of $X_{14}$ transforms the relation into a linear one.

4.3 Transformations

Since most of the variables exhibit an asymmetry with a higher density on the left side, the following transformations are proposed:

\begin{eqnarray*}
\widetilde{X_1} &=& \log{(X_1)}\\
\widetilde{X_2} &=& X_2/10\...
...}} &=& \sqrt{X_{13}}\\
\widetilde{X_{14}} &=& \log{(X_{14})}\\
\end{eqnarray*}



Taking the logarithm or raising the variables to the power of something smaller than one helps to reduce the asymmetry. This is due to the fact that lower values move further away from each other, whereas the distance between greater values is reduced by these transformations.

Figure: Boxplots for all of the variables from the Boston Housing data before and after the proposed transformations. 3510 MVAboxbhd.xpl
\includegraphics[width=1\defpicwidth]{MVAboxbhd.ps}

Figure 1.27 displays boxplots for the original mean variance scaled variables as well as for the proposed transformed variables. The transformed variables' boxplots are more symmetric and have less outliers than the original variables' boxplots.