7.3 Boston Housing

Returning to the Boston housing data set, we are now in a position to test if the means of the variables vary according to their location, for example, when they are located in a district with high valued houses. In Chapter 1, we built 2 groups of observations according to the value of $X_{14}$ being less than or equal to the median of $X_{14}$ (a group of 256 districts) and greater than the median (a group of 250 districts). In what follows, we use the transformed variables motivated in Section 1.8.

Testing the equality of the means from the two groups was proposed in a multivariate setup, so we restrict the analysis to the variables $X_1, X_5, X_8, X_{11}$, and $X_{13}$ to see if the differences between the two groups that were identified in Chapter 1 can be confirmed by a formal test. As in Test Problem 8, the hypothesis to be tested is

\begin{displaymath}
H_0:\ \mu_1=\mu_2,\textrm{ where }\mu_1\in \mathbb{R}^5, n_1=256,\textrm{ and }n_2=250.
\end{displaymath}

$\Sigma$ is not known. The $F$-statistic given in (7.13) is equal to 126.30, which is much higher than the critical value $F_{0.95;5,500}=2.23$. Therefore, we reject the hypothesis of equal means.

To see which component, $X_1, X_5, X_8, X_{11}$, or $X_{13}$, is responsible for this rejection, take a look at the simultaneous confidence intervals defined in (7.14):

\begin{eqnarray*}
\delta_1 & \in & (1.4020,2.5499) \\
\delta_5 & \in & (0.1315,...
... \in & (1.0375,1.7384) \\
\delta_{13} & \in & (1.1577,1.5818).
\end{eqnarray*}



These confidence intervals confirm that all of the $\delta_j$ are significantly different from zero (note there is a negative effect for $X_8$: weighted distances to employment centers) 27860 MVAsimcibh.xpl .

We could also check if the factor ``being bounded by the river'' (variable $X_4$) has some effect on the other variables. To do this compare the means of $(X_5, X_8, X_9, X_{12}, X_{13}, X_{14})^\top$. There are two groups: $n_1 =35$ districts bounded by the river and $n_2=471$ districts not bounded by the river. Test Problem 8 ( $H_0:\mu_1=\mu_2$) is applied again with $p=6$. The resulting test statistic, $F=5.81$, is highly significant ( $F_{0.95;6,499}=2.12$). The simultaneous confidence intervals indicate that only $X_{14}$ (the value of the houses) is responsible for the hypothesis being rejected! At a significance level of 0.95

\begin{eqnarray*}
\delta_5 & \in & (-0.0603 ,0.1919 )\\
\delta_8 & \in & (-0.52...
...n & (-0.8595 ,0.3782 )\\
\delta_{14} & \in & (0.0014 ,0.5084 ).
\end{eqnarray*}



Testing Linear Restrictions

In Chapter 3 a linear model was proposed that explained the variations of the price $X_{14}$ by the variations of the other variables. Using the same procedure that was shown in Testing Problem 7, we are in a position to test a set of linear restrictions on the vector of regression coefficients $\beta$.

The model we estimated in Section 3.7 provides the following ( 27863 MVAlinregbh.xpl ):

Variable $\hat\beta_j$ $SE(\hat\beta_j)$ $t$ p-value
constant 4.1769 0.3790 11.020 0.0000
$X_1$ $-$0.0146 0.0117 $-$1.254 0.2105
$X_2$ 0.0014 0.0056 0.247 0.8051
$X_3$ $-$0.0127 0.0223 $-$0.570 0.5692
$X_4$ 0.1100 0.0366 3.002 0.0028
$X_5$ $-$0.2831 0.1053 $-$2.688 0.0074
$X_6$ 0.4211 0.1102 3.822 0.0001
$X_7$ 0.0064 0.0049 1.317 0.1885
$X_8$ $-$0.1832 0.0368 $-$4.977 0.0000
$X_9$ 0.0684 0.0225 3.042 0.0025
$X_{10}$ $-$0.2018 0.0484 $-$4.167 0.0000
$X_{11}$ $-$0.0400 0.0081 $-$4.946 0.0000
$X_{12}$ 0.0445 0.0115 3.882 0.0001
$X_{13}$ $-$0.2626 0.0161 $-$16.320 0.0000

Recall that the estimated residuals $Y-{\data{X}}\widehat\beta$ did not show a big departure from normality, which means that the testing procedure developed above can be used.

  1. First a global test of significance for the regression coefficients is performed,

    \begin{displaymath}H_0:\ (\beta_1,\dots,\beta_{13})=0.\end{displaymath}

    This is obtained by defining ${\data{A}}=(0_{13},{\data{I}}_{13})$ and $a=0_{13}$ so that $H_0$ is equivalent to ${\data{A}}\beta=a$ where $\beta=(\beta_0,\beta_1,\dots,\beta_{13})^\top$. Based on the observed values $F=123.20$. This is highly significant ( $F_{0.95;13,492}=1.7401$), thus we reject $H_0$. Note that under $H_0$ $\widehat \beta_{H_0}=
(3.0345,0,\dots,0)$ where $3.0345=\overline y$.
  2. Since we are interested in the effect that being located close to the river has on the value of the houses, the second test is $H_0:\ \beta_4=0$. This is done by fixing

    \begin{displaymath}
{\data{A}}=(0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0)^\top
\end{displaymath}

    and $a=0$ to obtain the equivalent hypothesis $H_0: {\data{A}}\beta = a$. The result is again significant: $F=9.0125$ ( $F_{0.95;1,492}=3.8604$) with a $p$-value of 0.0028. Note that this is the same $p$-value obtained in the individual test $\beta_4=0$ in Chapter 3, computed using a different setup.
  3. A third test notices the fact that some of the regressors in the full model (3.57) appear to be insignificant (that is they have high individual $p$-values). It can be confirmed from a joint test if the corresponding reduced model, formulated by deleting the insignificant variables, is rejected by the data. We want to test $H_0:\ \beta_1=\beta_2=\beta_3=\beta_7=0$. Hence,

    \begin{displaymath}
\setcounter{MaxMatrixCols}{14}{\data{A}}=
\begin{pmatrix}
0 ...
... 0 & 0 & 0 & 0 & 0
\setcounter{MaxMatrixCols}{10}\end{pmatrix}\end{displaymath}

    and $a=0_4$. The test statistic is 0.9344, which is not significant for $F_{4,492}$. Given that the $p$-value is equal to 0.44, we cannot reject the null hypothesis nor the corresponding reduced model. The value of $\widehat\beta$ under the null hypothesis is

    \begin{displaymath}
\widehat\beta_{H_0}=(4.16, 0, 0, 0, 0.11, -0.31, 0.47, 0, -0.19, 0.05, -0.20, -0.04, 0.05,-0.26)^\top.
\end{displaymath}

    A possible reduced model is

    \begin{displaymath}
X_{14}=\beta_0+\beta_{4}X_{4}+\beta_{5}X_{5}+\beta_{6}X_{6}+\beta_{8}X_{8}+\dots+\beta_{13}X_{13}+\varepsilon.
\end{displaymath}

    Estimating this reduced model using OLS, as was done in Chapter 3, provides the results shown in Table 7.1.

    Table: Linear Regression for Boston Housing Data Set. 27866 MVAlinreg2bh.xpl
    Variable $\hat\beta_j$ $SE$ $t$ p-value
    const 4.1582 0.3628 11.462 0.0000
    $X_4$ 0.1087 0.0362 2.999 0.0028
    $X_5$ $-$0.3055 0.0973 $-$3.140 0.0018
    $X_6$ 0.4668 0.1059 4.407 0.0000
    $X_8$ $-$0.1855 0.0327 $-$5.679 0.0000
    $X_9$ 0.0492 0.0183 2.690 0.0074
    $X_{10}$ $-$0.2096 0.0446 $-$4.705 0.0000
    $X_{11}$ $-$0.0410 0.0078 $-$5.280 0.0000
    $X_{12}$ 0.0481 0.0112 4.306 0.0000
    $X_{13}$ $-$0.2588 0.0149 $-$17.396 0.0000


    Note that the reduced model has $r^2=0.763$ which is very close to $r^2=0.765$ obtained from the full model. Clearly, including variables $X_1, X_2, X_3$, and $X_7$ does not provide valuable information in explaining the variation of $X_{14}$, the price of the houses.