3.6 Multiple Linear Model

The simple linear model and the analysis of variance model can be viewed as a particular case of a more general linear model where the variations of one variable $y$ are explained by $p$ explanatory variables $x$ respectively. Let $y\,(n\times 1)$ and ${\cal{X}}\,(n\times p)$ be a vector of observations on the response variable and a data matrix on the $p$ explanatory variables. An important application of the developed theory is the least squares fitting. The idea is to approximate $y$ by a linear combination $\widehat y$ of columns of $\data{X}$, i.e., $\widehat y \in C(\data{X})$. The problem is to find $ \widehat\beta \in \mathbb{R}^p $ such that $\widehat{y}=
\data{X}\widehat\beta$ is the best fit of $y$ in the least-squares sense. The linear model can be written as
\begin{displaymath}
y=\data{X} \beta+\varepsilon,
\end{displaymath} (3.50)

where $\varepsilon$ are the errors. The least squares solution is given by $\widehat\beta$:
\begin{displaymath}
\widehat\beta=\arg\min_\beta\ (y-\data{X}\beta)^{\top}(y-\data{X}\beta)=
\arg\min_\beta\ \varepsilon^{\top} \varepsilon.
\end{displaymath} (3.51)

Suppose that $(\data{X}^{\top}\data{X})$ is of full rank and thus invertible. Minimizing the expression (3.51) with respect to $\beta$ yields:
\begin{displaymath}
\widehat\beta=(\data{X}^{\top}\data{X})^{-1}\data{X}^{\top}y.
\end{displaymath} (3.52)

The fitted value $\widehat y=\data{X}\widehat\beta=\data{X}(\data{X}^{\top}\data{X})^{-1}
\data{X}^{\top}y=\data{P}y$ is the projection of $y$ onto $C(\data{X})$ as computed in (2.47).

The least squares residuals are

\begin{displaymath}
e=y-\widehat{y}=y-\data{X}\widehat\beta=\data{Q}y=(\data{I}_n-{\data P})y.
\end{displaymath}

The vector $e$ is the projection of $y$ onto the orthogonal complement of ${C}(\data{X})$.

REMARK 3.5   A linear model with an intercept $\alpha$ can also be written in this framework. The approximating equation is:

\begin{displaymath}
y_i = {\alpha} + {\beta_1} x_{i1} + \ldots + {\beta_p} x_{ip} + \varepsilon_i\;
; \; i=1, \ldots ,n.\end{displaymath}

This can be written as:

\begin{displaymath}y = \data{X}^{*} {\beta}^{*} + \varepsilon\end{displaymath}

where $\data{X^{*}}=(1_n \; \data{X})$ (we add a column of ones to the data). We have by (3.52):

\begin{displaymath}
{\widehat\beta}^{*} = \left({\widehat\alpha} \atop {\widehat...
...right) =
(\data{X}^{*\top} \data{X}^*)^{-1} \data{X}^{*\top}y.
\end{displaymath}

EXAMPLE 3.15   Let us come back to the ``classic blue'' pullovers example. In Example 3.11, we considered the regression fit of the sales $X_{1}$ on the price $X_{2}$ and concluded that there was only a small influence of sales by changing the prices. A linear model incorporating all three variables allows us to approximate sales as a linear function of price ($X_{2}$), advertisement ($
X_{3}$) and presence of sales assistants ($X_{4}$) simultaneously. Adding a column of ones to the data (in order to estimate the intercept $\alpha$) leads to

\begin{displaymath}\widehat\alpha=65.670 \textrm{ and } \widehat\beta_{1}= -0.216,\ \widehat\beta_{2}=0.485,\
\widehat\beta_{3}=0.844.\end{displaymath}

The coefficient of determination is computed as before in (3.40) and is:

\begin{displaymath}r^2= 1- \frac{e^{\top}e}{\sum {(y_i- \overline y)}^2} = 0.907.\end{displaymath}

We conclude that the variation of $X_{1}$ is well approximated by the linear relation.

REMARK 3.6   The coefficient of determination is influenced by the number of regressors. For a given sample size $n$, the $r^2$ value will increase by adding more regressors into the linear model. The value of $r^2$ may therefore be high even if possibly irrelevant regressors are included. A corrected coefficient of determination for $p$ regressors and a constant intercept ($p+1$ parameters) is
\begin{displaymath}
r^2_{\textrm{adj}} = r^2 - \frac{p(1-r^2)}{n-(p+1)}.
\end{displaymath} (3.53)

EXAMPLE 3.16   The corrected coefficient of determination for Example 3.15 is

\begin{eqnarray*}
r^2_{\textrm{adj}}& = & 0.907 - \frac{3(1-0.907^2)}{10-3-1}\\
& = & 0.818.
\end{eqnarray*}



This means that $81.8\%$ of the variation of the response variable is explained by the explanatory variables.

Note that the linear model (3.50) is very flexible and can model nonlinear relationships between the response $y$ and the explanatory variables $x$. For example, a quadratic relation in one variable $x$ could be included. Then $y_i=\alpha + \beta_1 x_i + \beta_2 x_i^2 + \varepsilon_i$ could be written in matrix notation as in (3.50), $y={\cal{X}} \beta +\varepsilon$ where

\begin{eqnarray*}
{\cal{X}}=\left(
\begin{array}{ccc}
1&x_{1}&x^2_{1}\\
1&x_{2}...
...}\\
\vdots&\vdots&\vdots\\
1&x_{n}&x^2_{n}
\end{array}\right).
\end{eqnarray*}



Properties of $\widehat\beta$

When $y_i$ is the $i$-th observation of a random variable $Y$, the errors are also random. Under standard assumptions (independence, zero mean and constant variance $\sigma^2$), inference can be conducted on $\beta$. Using the properties of Chapter 4, it is easy to prove:

\begin{eqnarray*}
&&E(\widehat \beta)=\beta\\
&&\hbox{Var}(\widehat \beta)=\sigma^2({\cal{X}}^{\top}{\cal{X}})^{-1}.
\end{eqnarray*}



The analogue of the $t$-test for the multivariate linear regression situation is

\begin{displaymath}t= \frac{\widehat{\beta}_j}{SE(\widehat{\beta}_j)}.\end{displaymath}

The standard error of each coefficient $\widehat{\beta}_j$ is given by the square root of the diagonal elements of the matrix $\mathop{\mathit{Var}}(\widehat{\beta})$. In standard situations, the variance $\sigma^2$ of the error $\varepsilon$ is not known. One may estimate it by

\begin{displaymath}
\hat\sigma^2=\frac{1}{n-(p+1)}(y-\hat{y})^{\top}(y-\hat{y}),
\end{displaymath}

where $(p+1)$ is the dimension of $\beta$. In testing $\beta_j=0$ we reject the hypothesis at the significance level $\alpha$ if $\vert t\vert\geq t_{1-\alpha/2;n-(p+1)}$. More general issues on testing linear models are addressed in Chapter 7.

The ANOVA Model in Matrix Notation

The simple ANOVA problem (Section 3.5) may also be rewritten in matrix terms. Recall the definition of a vector of ones from (2.1) and define a vector of zeros as $0_{n}$. Then construct the following ($n\times p$) matrix, (here $p=3$),

\begin{displaymath}\data{X}=\left(\begin {array}{ccc}
1_{m} & 0_{m} & 0_{m}\\
0...
..._{m} & 0_{m} \\
0_{m} & 0_{m} & 1_{m} \\
\end{array}\right),
\end{displaymath} (3.54)

where $m=10$. Equation (3.41) then reads as follows.

The parameter vector is $\beta=(\mu_{1},\mu_{2},\mu_{3})^{\top}$. The data set from Example 3.14 can therefore be written as a linear model $y=\data{X}\beta + \varepsilon$ where $y \in \mathbb{R}^n$ with $n=m\cdot p$ is the stacked vector of the columns of Table 3.1. The projection into the column space $C(\data{X})$ of (3.54) yields the least-squares estimator $\hat{\beta}=(\data{X}^{\top}\data{X})^{-1}\data{X}^{\top}y$. Note that $(\data{X}^{\top}\data{X})^{-1} =(1/10)\data{I}_{3}$ and that $\data{X}^{\top}y=(106,124,151)^{\top}$ is the sum $\sum_{k=1}^{m}
y_{kj}$ for each factor, i.e., the 3 column sums of Table 3.1. The least squares estimator is therefore the vector $\hat{\beta}_{H_{1}}=(\hat{\mu}_{1},\hat{\mu}_{2},\hat{\mu}_{3})
=(10.6, 12.4, 15.1)^{\top}$ of sample means for each factor level $j=1,2,3$. Under the null hypothesis of equal mean values $\mu_1=\mu_2=\mu_3=\mu$, we estimate the parameters under the same constraints. This can be put into the form of a linear constraint:

\begin{eqnarray*}
-\mu_{1}+\mu_2 & = & 0 \\
-\mu_1 + \mu_3 & = & 0 .
\end{eqnarray*}



This can be written as $\data{A}\beta=a$, where


\begin{displaymath}a=\left(
\begin{array}{c}
0\\
0
\end{array}\right)
\end{displaymath}

and

\begin{displaymath}\data{A}=\left(
\begin{array}{ccc}
-1 & 1 & 0\\
-1 & 0 & 1
\end{array}\right).
\end{displaymath}

The constrained least-squares solution can be shown (Exercise 3.24) to be given by:
\begin{displaymath}
\hat{\beta}_{H_0}=\hat{\beta}_{H_1}-(\data{X}^{\top}\data{X}...
...X})^{-1} \data{A}^{\top}\}^{-1}(\data{A}
\hat{\beta}_{H_1}-a).
\end{displaymath} (3.55)

It turns out that (3.55) amounts to simply calculating the overall mean $\bar{y}= 12.7$ of the response variable $y$: $\hat{\beta}_{H_0}= ( 12.7, 12.7, 12.7)^{\top}$.

The F-test that has already been applied in Example 3.14 can be written as

\begin{displaymath}
F=\frac{\{\vert\vert y-\data{X}\hat{\beta}_{H_0}\vert\vert^2...
...2\}/
2}{\vert\vert y-\data{X}\hat{\beta}_{H_1}\vert\vert^2/27}
\end{displaymath} (3.56)

which gives the same significant value $8.78$. Note that again we compare the $RSS_{H_0}$ of the reduced model to the $RSS_{H_1}$ of the full model. It corresponds to comparing the lengths of projections into different column spaces. This general approach in testing linear models is described in detail in Chapter 7.

Summary
$\ast$
The relation $y=\data{X} \beta + e$ models a linear relation between a one-dimensional variable $Y$ and a $p$-dimensional variable $X$. $\data{P}y$ gives the best linear regression fit of the vector $y$ onto $C(\data{X})$. The least squares parameter estimator is $\widehat\beta
= (\data{X}^{\top}\data{X})^{-1}\data{X}^{\top}y$.
$\ast$
The simple ANOVA model can be written as a linear model.
$\ast$
The ANOVA model can be tested by comparing the length of the projection vectors.
$\ast$
The test statistic of the F-Test can be written as

\begin{displaymath}
\frac{\{\vert\vert y-\data{X}\hat{\beta}_{H_0}\vert\vert^2-\...
...\}}{\vert\vert y-\data{X}\hat{\beta}_{H_1}\vert\vert^2/df(f)}.
\end{displaymath}

$\ast$
The adjusted coefficient of determination is

\begin{displaymath}
r^2_{\textrm{adj}} = r^2 - \frac{p(1-r^2)}{n-(p+1)}.
\end{displaymath}