The Boston Housing data set was analyzed by Harrison and Rubinfeld (1978) who wanted to find out whether ``clean air'' had an influence on house prices. We will use this data set in this chapter and in most of the following chapters to illustrate the presented methodology. The data are described in Appendix B.1.
In order to highlight the relations of to the remaining 13 variables
we color all of the observations with
median
as red lines
in Figure 1.24.
Some of the variables seem to be strongly related.
The most obvious relation is the negative dependence
between
and
.
It can also be argued that there exists a strong dependence
between
and
since no red lines are drawn
in the lower part of
.
The opposite can be said about
: there are only red lines
plotted in the lower part of this variable.
Low values of
induce high values of
.
For the PCP, the variables have been rescaled over the interval
for better graphical representations.
The PCP shows that the variables are not distributed in a symmetric manner.
It can be clearly seen that the values of
and
are much more concentrated around 0. Therefore it makes
sense to consider transformations of the original data.
One characteristic of the PCPs is that many
lines are drawn on top of each other.
This problem is reduced by depicting the variables
in pairs of scatterplots. Including all 14 variables
in one large scatterplot matrix
is possible, but makes it hard to see anything from the plots.
Therefore, for illustratory purposes we will analyze only one such matrix
from a subset of the variables in Figure 1.25.
On the basis of the PCP and the scatterplot matrix
we would like to interpret each of the thirteen variables
and their eventual relation to the 14th variable.
Included in the figure are images
for -
and
, although each
variable is discussed in detail below. All references made to scatterplots
in the following refer to Figure 1.25.
Taking the logarithm makes the variable's distribution more symmetric.
This can be seen in the boxplot of
in Figure 1.27
which shows that the median and the mean have
moved closer to each other than they were for the original
.
Plotting the kernel density estimate (KDE) of
would reveal that two subgroups might exist with
different mean values. However, taking a look at the scatterplots
in Figure 1.26 of the logarithms
which include
does not clearly reveal such groups.
Given that the scatterplot of
vs.
shows a relatively strong negative relation, it might be the case that
the two subgroups of
correspond to houses with two different price levels. This
is confirmed by the two boxplots shown to the right of the
vs.
scatterplot (in Figure 1.25): the red
boxplot's shape differs a lot from the black one's,
having a much higher median and mean.
It strikes the eye in Figure 1.25 that
there is a large cluster of observations for which is equal to 0.
It also strikes the eye that--as the scatterplot of
vs.
shows--there
is a strong, though non-linear, negative relation between
and
: Almost all
observations for which
is high have an
-value close to zero, and
vice versa, many observations for which
is zero have quite a
high per-capita crime rate
. This
could be due to the location of the areas, e.g., downtown districts might have a higher
crime rate and at the same time it is unlikely that any residential land
would be zoned in a generous manner.
As far as the house prices are concerned it can be said that there seems
to be no clear (linear) relation between and
, but it is obvious
that the more expensive houses are situated in areas where
is large
(this can be seen from the two boxplots on the second position of the diagonal,
where the red one has a clearly higher mean/median than the black one).
The PCP (in Figure 1.24) as well as
the scatterplot of vs.
shows an obvious negative relation
between
and
.
The relationship between the logarithms of both variables
seems to be almost linear. This negative relation might be explained by the
fact that non-retail business sometimes causes
annoying sounds and other pollution.
Therefore, it seems reasonable to use
as an explanatory variable
for the prediction of
in a linear-regression analysis.
As far as the distribution of is concerned it can be said that the kernel density
estimate of
clearly has two peaks, which indicates that there are two subgroups.
According to the negative relation between
and
it could be the case that one
subgroup corresponds to the more expensive houses and the other one to the cheaper houses.
The observation made from the PCP that there are more expensive houses
than cheap houses
situated on the banks of the Charles River is confirmed by inspecting
the scatterplot
matrix. Still, we might have some doubt that the proximity to the
river influences the house prices.
Looking at the original data set, it becomes
clear that the observations for which equals one are districts
that are close to each other. Apparently, the Charles River
does not flow through too many different
districts. Thus, it may be pure coincidence that the more expensive districts are
close to the Charles River--their high values might be caused by many other factors
such as the pupil/teacher ratio or the proportion of non-retail business acres.
The scatterplot of vs.
and the separate boxplots of
for more and less
expensive houses reveal a clear negative relation between the two variables.
As it was the main aim of the authors of the original study to determine whether
pollution had an influence on housing prices, it should be considered very carefully
whether
can serve as an explanatory variable for the price
.
A possible reason against it being an explanatory variable is
that people might not like to live in areas where the emissions
of nitric oxides are high.
Nitric oxides are emitted mainly by automobiles, by factories and from
heating private homes. However, as one can imagine there are
many good reasons besides
nitric oxides not to live downtown or in industrial areas! Noise pollution,
for example,
might be a much better explanatory variable for the price of housing units. As the
emission of nitric oxides is usually accompanied by noise pollution, using
as an
explanatory variable for
might lead to the false conclusion that people run away
from nitric oxides, whereas in reality it is noise pollution that they are
trying to escape.
The number of rooms per dwelling is a possible measure for the size
of the houses. Thus we expect
to be strongly correlated with
(the houses' median price). Indeed--apart from
some outliers--the scatterplot of
vs.
shows a point cloud which is clearly
upward-sloping and which seems to be a realisation of
a linear dependence of
on
.
The two boxplots of
confirm this notion by showing that the quartiles, the mean
and the median are all much higher for the red than for the black boxplot.
There is no clear connection visible between and
. There
could be a weak negative correlation between the two variables, since
the (red) boxplot of
for
the districts whose price is above the median price indicates
a lower mean and median than the (black) boxplot for the district
whose price is below the median price. The
fact that the correlation is not so clear could be explained by two
opposing effects. On the one hand house prices should decrease if the older
houses are not in a good shape. On the other hand prices could increase,
because people often like older houses better than newer houses,
preferring their atmosphere of space and tradition. Nevertheless,
it seems reasonable that the houses' age has an
influence on their price
.
Raising to the power of 2.5 reveals again that the data set might consist of two
subgroups. But in this case it is not obvious that the subgroups correspond to more
expensive or cheaper houses.
One can furthermore observe a negative relation between
and
.
This could
reflect the way the Boston metropolitan area developed over time:
the districts with the newer buildings are farther away from
employment centres with industrial facilities.
Since most people like to live close to their place of work, we expect a
negative relation between the distances to the employment centres and the houses' price.
The scatterplot hardly reveals any dependence, but the boxplots of indicate that
there might be a slightly positive relation as the red boxplot's median and mean
are higher than the black one's.
Again, there might be two effects in opposite directions at work. The first
is that living too close to an employment centre might not provide
enough shelter from the pollution created there. The second,
as mentioned above, is that people do not travel very far
to their workplace.
The first obvious thing one can observe in the scatterplots,
as well in the histograms
and the kernel density estimates,
is that there are two subgroups of districts containing
values which are close to the respective group's mean. The
scatterplots deliver no hint as to
what might explain the occurrence of these two subgroups.
The boxplots indicate that for the cheaper and for the more expensive houses the
average of
is almost the same.
shows a behavior similar to that of
:
two subgroups exist.
A downward-sloping curve
seems to underlie the relation of
and
. This is confirmed by the two boxplots
drawn for
: the red one has a lower mean and median than the black one.
The red and black boxplots of indicate a negative relation between
and
.
This is confirmed by inspection of the scatterplot of
vs.
: The point
cloud is downward sloping, i.e., the less teachers there are per pupil, the less people
pay on median for their dwellings.
Interestingly, is negatively--though not linearly--correlated
with
,
and
,
whereas it is positively related with
.
Having a look at the data set reveals that for almost all districts
takes on a
value around 390. Since
cannot be larger than 0.63, such values can only
be caused by
close to zero. Therefore, the higher
is,
the lower the actual proportion
of blacks is! Among observations
405 through 470 there are quite a few that have a
that is much
lower than 390. This means that in these districts the proportion of blacks is above
zero.
We can observe two clusters of points in the scatterplots of
:
one cluster for
which
is close to 390 and a second one for which
is between 3 and 100.
When
is positively related with another variable, the actual proportion of
blacks is negatively correlated with this variable and vice versa. This means that
blacks live in areas where there is a high proportion of non-retail business
acres, where there are older houses and where there is a high (i.e., bad)
pupil/teacher
ratio. It can be observed that districts with housing prices above the
median can only be found where the proportion of blacks is virtually zero!
Of all the variables exhibits the clearest negative relation with
--hardly any
outliers show up. Taking the square root of
and
the logarithm of
transforms the relation into a
linear one.
Since most of the variables exhibit an asymmetry with a higher density on the left side, the following transformations are proposed:
![]() |
Figure 1.27 displays boxplots for the original mean variance scaled variables as well as for the proposed transformed variables. The transformed variables' boxplots are more symmetric and have less outliers than the original variables' boxplots.