3.7 Boston Housing


Table: Descriptive statistics for the Boston Housing data set. 12020 MVAdescbh.xpl
$X$ $\overline x$ median$(X)$ $\mathop{\mathit{Var}}(X)$ std$(X)$
$X_1$ 3.61 0.26 73.99 8.60
$X_2$ 11.36 0.00 543.94 23.32
$X_3$ 11.14 9.69 47.06 6.86
$X_4$ 0.07 0.00 0.06 0.25
$X_5$ 0.55 0.54 0.01 0.12
$X_6$ 6.28 6.21 0.49 0.70
$X_7$ 68.57 77.50 792.36 28.15
$X_8$ 3.79 3.21 4.43 2.11
$X_9$ 9.55 5.00 75.82 8.71
$X_{10}$ 408.24 330.00 28405.00 168.54
$X_{11}$ 18.46 19.05 4.69 2.16
$X_{12}$ 356.67 391.44 8334.80 91.29
$X_{13}$ 12.65 11.36 50.99 7.14
$X_{14}$ 22.53 21.20 84.59 9.20


The main statistics presented so far can be computed for the data matrix ${\data{X}} (506\times 14)$ from our Boston Housing data set. The sample means and the sample medians of each variable are displayed in Table 3.3. The table also provides the unbiased estimates of the variance of each variable and the corresponding standard deviations. The comparison of the means and the medians confirms the assymmetry of the components of ${\data{X}}
$ that was pointed out in Section 1.8.

The (unbiased) sample covariance matrix is given by the following $ (14\times 14)$ matrix ${\data{S}}_n$:

\begin{displaymath}
\left(
\begin{array}{r@{\,}r@{\,}r@{\,}r@{\,}r@{\,}r@{\,}r@{...
....26 & -10.11 & 279.99 & -48.45 & 84.59 \\
\end{array}\right),
\end{displaymath}

and the corresponding correlation matrix ${\data{R}} (14\times 14)$ is:

\begin{displaymath}
\left(
\begin{array}{r@{\,}r@{\,}r@{\,}r@{\,}r@{\,}r@{\,}r@{...
... & -0.47 & -0.51 & 0.33 & -0.74 & 1.00 \\
\end{array}\right).
\end{displaymath}

Analyzing ${\data{R}}$ confirms most of the comments made from examining the scatterplot matrix in Chapter 1. In particular, the correlation between $X_{14}$ (the value of the house) and all the other variables is given by the last row (or column) of ${\data{R}}$. The highest correlations (in absolute values) are in decreasing order $X_{13},X_{6}, X_{11},X_{10},$ etc.

Using the Fisher's Z-transform on each of the correlations between $X_{14}$ and the other variables would confirm that all are significantly different from zero, except the correlation between $X_{14}$ and $X_{4}$ (the indicator variable for the Charles River). We know, however, that the correlation and Fisher's Z-transform are not appropriate for binary variable.

The same descriptive statistics can be calculated for the transformed variables (transformations were motivated in Section 1.8). The results are given in Table 3.4

Table: Descriptive statistics for the Boston Housing data set after the transformation. 12025 MVAdescbh.xpl
$\widetilde X$ $\overline{\widetilde x}$ median $(\widetilde X)$ $\mathop{\mathit{Var}}(\widetilde X)$ std $(\widetilde X)$
$\widetilde X_1$ $-$0.78 $-$1.36 4.67 2.16
$\widetilde X_2$ 1.14 0.00 5.44 2.33
$\widetilde X_3$ 2.16 2.27 0.60 0.78
$\widetilde X_4$ 0.07 0.00 0.06 0.25
$\widetilde X_5$ $-$0.61 $-$0.62 0.04 0.20
$\widetilde X_6$ 1.83 1.83 0.01 0.11
$\widetilde X_7$ 5.06 5.29 12.72 3.57
$\widetilde X_8$ 1.19 1.17 0.29 0.54
$\widetilde X_9$ 1.87 1.61 0.77 0.87
$\widetilde X_{10}$ 5.93 5.80 0.16 0.40
$\widetilde X_{11}$ 2.15 2.04 1.86 1.36
$\widetilde X_{12}$ 3.57 3.91 0.83 0.91
$\widetilde X_{13}$ 3.42 3.37 0.97 0.99
$\widetilde X_{14}$ 3.03 3.05 0.17 0.41


and as can be seen most of the variables are now more symmetric. Note that the covariances and the correlations are sensitive to these nonlinear transformations. For example, the correlation matrix is now

\begin{displaymath}
\left(
\begin{array}{r@{\,}r@{\,}r@{\,}r@{\,}r@{\,}r@{\,}r@{...
... & -0.56 & -0.51 & 0.40 & -0.83 & 1.00 \\
\end{array}\right).
\end{displaymath}

Notice that some of the correlations between $\widetilde X_{14}$ and the other variables have increased.

If we want to explain the variations of the price $\widetilde X_{14}$ by the variation of all the other variables $\widetilde X_{1},\dots,\widetilde X_{13}$ we could estimate the linear model

\begin{displaymath}
\widetilde X_{14}=\beta_0-\sum_{j=1}^{13}\beta_j \widetilde X_{j}+\varepsilon.
\end{displaymath} (3.57)

The result is given in Table 3.5.

Table: Linear regression results for all variables of Boston Housing data set. 12028 MVAlinregbh.xpl
Variable $\hat\beta_j$ $SE(\hat\beta_j)$ $t$ $p$-value
constant 4.1769 0.3790 11.020 0.0000
$\widetilde X_1$ $-$0.0146 0.0117 $-$1.254 0.2105
$\widetilde X_2$ 0.0014 0.0056 0.247 0.8051
$\widetilde X_3$ $-$0.0127 0.0223 $-$0.570 0.5692
$\widetilde X_4$ 0.1100 0.0366 3.002 0.0028
$\widetilde X_5$ $-$0.2831 0.1053 $-$2.688 0.0074
$\widetilde X_6$ 0.4211 0.1102 3.822 0.0001
$\widetilde X_7$ 0.0064 0.0049 1.317 0.1885
$\widetilde X_8$ $-$0.1832 0.0368 $-$4.977 0.0000
$\widetilde X_9$ 0.0684 0.0225 3.042 0.0025
$\widetilde X_{10}$ $-$0.2018 0.0484 $-$4.167 0.0000
$\widetilde X_{11}$ $-$0.0400 0.0081 $-$4.946 0.0000
$\widetilde X_{12}$ 0.0445 0.0115 3.882 0.0001
$\widetilde X_{13}$ $-$0.2626 0.0161 $-$16.320 0.0000


The value of $r^2$ (0.765) and $r^2_{adj}$ (0.759) show that most of the variance of $X_{14}$ is explained by the linear model (3.57).

Again we see that the variations of $\widetilde X_{14}$ are mostly explained by (in decreasing order of the absolute value of the $t$-statistic) $\widetilde X_{13}, \widetilde X_{8},\widetilde X_{11}, \widetilde X_{10}, \widetilde X_{12},\widetilde X_{6},\widetilde X_9,\widetilde X_4$ and $\widetilde X_5$. The other variables $\widetilde X_1, \widetilde X_2, \widetilde X_3$ and $\widetilde X_7$ seem to have little influence on the variations of $\widetilde X_{14}$. This will be confirmed by the testing procedures that will be developed in Chapter 7.