3.4 Linear Model for Two Variables

We have looked many times now at downward- and upward-sloping scatterplots. What does the eye define here as slope? Suppose that we can construct a line corresponding to the general direction of the cloud. The sign of the slope of this line would correspond to the upward and downward directions. Call the variable on the vertical axis $Y$ and the one on the horizontal axis $X$. A slope line is a linear relationship between $X$ and $Y$:
\begin{displaymath}
y_{i}=\alpha+\beta x_{i} + \varepsilon_{i},\ i=1,\ldots,n.
\end{displaymath} (3.27)

Here, $\alpha$ is the intercept and $\beta$ is the slope of the line. The errors (or deviations from the line) are denoted as $\varepsilon_{i}$ and are assumed to have zero mean and finite variance $\sigma^2$. The task of finding $(\alpha,\beta)$ in (3.27) is referred to as a linear adjustment.

In Section 3.6 we shall derive estimators for $\alpha$ and $\beta$ more formally, as well as accurately describe what a ``good'' estimator is. For now, one may try to find a ``good'' estimator $(\widehat\alpha,\widehat\beta)$ via graphical techniques. A very common numerical and statistical technique is to use those $\widehat\alpha$ and $\widehat\beta$ that minimize:

\begin{displaymath}
(\widehat\alpha,\widehat\beta) = \arg\min_{(\alpha,\beta)}
\sum_{i=1}^{n} (y_{i}-\alpha-\beta x_{i})^2.
\end{displaymath} (3.28)

The solutions to this task are the estimators:
$\displaystyle \widehat\beta$ $\textstyle =$ $\displaystyle \frac{s_{XY}}{s_{XX}}$ (3.29)
$\displaystyle \widehat\alpha$ $\textstyle =$ $\displaystyle \overline y -\widehat\beta \overline x.$ (3.30)

The variance of $\widehat\beta$ is:
$\displaystyle Var( \widehat\beta) = \frac{\sigma^2}{n \cdot s_{XX}}.$     (3.31)

The standard error (SE) of the estimator is the square root of (3.31),
$\displaystyle SE( \widehat\beta) = \{Var(\widehat\beta)\}^{1/2}=\frac{\sigma}{(n \cdot s_{XX})
^{1/2}}.$     (3.32)

We can use this formula to test the hypothesis that $\beta$=0. In an application the variance $\sigma^2$ has to be estimated by an estimator $\widehat\sigma^2$ that will be given below. Under a normality assumption of the errors, the $t$-test for the hypothesis $\beta=0$ works as follows.

One computes the statistic

$\displaystyle t = \frac{\widehat\beta}{SE( \widehat\beta)}$     (3.33)

and rejects the hypothesis at a 5% significance level if $\mid t \mid \geq t_{0.975;n-2}$, where the 97.5% quantile of the Student's $t_{n-2}$ distribution is clearly the 95% critical value for the two-sided test. For $n \geq 30$, this can be replaced by 1.96, the 97.5% quantile of the normal distribution. An estimator $\widehat\sigma^2$ of $\sigma^2$ will be given in the following.

EXAMPLE 3.10   Let us apply the linear regression model (3.27) to the ``classic blue'' pullovers. The sales manager believes that there is a strong dependence on the number of sales as a function of price. He computes the regression line as shown in Figure 3.5.

Figure 3.5: Regression of sales ($X_{1}$) on price ($X_{2}$) of pullovers. 10268 MVAregpull.xpl
\includegraphics[width=1\defpicwidth]{MVAregpull.ps}

How good is this fit? This can be judged via goodness-of-fit measures. Define

\begin{displaymath}
\widehat y_i = \widehat\alpha + \widehat\beta x_i,
\end{displaymath} (3.34)

as the predicted value of $y$ as a function of $x$. With $\widehat y$ the textile shop manager in the above example can predict sales as a function of prices $x$. The variation in the response variable is:
\begin{displaymath}
n s_{YY} = \sum_{i=1}^n (y_{i}-\overline y)^2.
\end{displaymath} (3.35)

The variation explained by the linear regression (3.27) with the predicted values (3.34) is:
\begin{displaymath}
\sum_{i=1}^n (\widehat y_{i}-\overline y)^2.
\end{displaymath} (3.36)

The residual sum of squares, the minimum in (3.28), is given by:
\begin{displaymath}
RSS = \sum_{i=1}^n (y_{i}-\widehat y_{i})^2.
\end{displaymath} (3.37)

An unbiased estimator $\widehat\sigma^2$ of $\sigma^2$ is given by $RSS/(n-2)$.
The following relation holds between (3.35)-(3.37):
$\displaystyle \sum_{i=1}^n (y_{i}-\overline y)^2$ $\textstyle =$ $\displaystyle \sum_{i=1}^n (\widehat y_{i}-\overline y)^2 +
\sum_{i=1}^n (y_{i}-\widehat y_{i})^2,$ (3.38)
$\displaystyle \textrm{\it\color{red}total variation}$ $\textstyle =$ $\displaystyle \index{total variation}
\textrm{\it\color{green}explained variati...
...n}+
\textrm{\it\color{blue}unexplained variation}\index{unexplained variation}.$  

The coefficient of determination is $r^2$:
\begin{displaymath}
r^2 = \frac{\sum\limits_{i=1}^n (\widehat y_{i}-\overline y)...
...{\it explained variation}}{\textrm{\it total
variation}}\cdot
\end{displaymath} (3.39)

The coefficient of determination increases with the proportion of explained variation by the linear relation (3.27). In the extreme cases where $r^2=1$, all of the variation is explained by the linear regression (3.27). The other extreme, $r^2=0$, is where the empirical covariance is $s_{XY}=0$. The coefficient of determination can be rewritten as
\begin{displaymath}
r^2 = 1- \frac{\sum\limits_{i=1}^n (y_{i}- \widehat y_{i})^2}
{\sum\limits_{i=1}^n (y_{i}-\overline y)^2}.
\end{displaymath} (3.40)

From (3.39), it can be seen that in the linear regression (3.27), $r^2 =\ r^2_{XY}$ is the square of the correlation between $X$ and $Y$.

EXAMPLE 3.11   For the above pullover example, we estimate

\begin{displaymath}\widehat\alpha= 210.774 \quad \textrm{ and }\quad \widehat\beta=-0.364.\end{displaymath}

The coefficient of determination is

\begin{displaymath}r^2=0.028.\end{displaymath}

The textile shop manager concludes that sales are not influenced very much by the price (in a linear way).

Figure 3.6: Regression of sales ($X_1$) on price ($X_2$) of pullovers. The overall mean is given by the dashed line. 10281 MVAregzoom.xpl
\includegraphics[width=1\defpicwidth]{MVAregzoom.ps}

The geometrical representation of formula (3.38) can be graphically evaluated using Figure 3.6. This plot shows a section of the linear regression of the ``sales'' on ``price'' for the pullovers data. The distance between any point and the overall mean is given by the distance between the point and the regression line and the distance between the regression line and the mean. The sums of these two distances represent the total variance (solid blue lines from the observations to the overall mean), i.e., the explained variance (distance from the regression curve to the mean) and the unexplained variance (distance from the observation to the regression line), respectively.

In general the regression of $Y$ on $X$ is different from that of $X$ on $Y$. We will demonstrate this using once again the Swiss bank notes data.

Figure 3.7: Regression of $X_{5}$ (upper inner frame) on $X_{4}$ (lower inner frame) for genuine bank notes. 10285 MVAregbank.xpl
\includegraphics[width=1\defpicwidth]{MVAregbank.ps}

EXAMPLE 3.12   The least squares fit of the variables $X_4$ ($X$) and $X_5$ ($Y$) from the genuine bank notes are calculated. Figure 3.7 shows the fitted line if $X_{5}$ is approximated by a linear function of $X_{4}$. In this case the parameters are

\begin{displaymath}\widehat\alpha= 15.464 \quad \textrm{ and }\quad \widehat\beta= -0.638.\end{displaymath}

If we predict $X_{4}$ by a function of $X_{5}$ instead, we would arrive at a different intercept and slope

\begin{displaymath}\widehat\alpha= 14.666 \quad \textrm{ and }\quad \widehat\beta= -0.626.\end{displaymath}

The linear regression of $Y$ on $X$ is given by minimizing (3.28), i.e., the vertical errors $\varepsilon_i$. The linear regression of $X$ on $Y$ does the same but here the errors to be minimized in the least squares sense are measured horizontally. As seen in Example 3.12, the two least squares lines are different although both measure (in a certain sense) the slope of the cloud of points.

As shown in the next example, there is still one other way to measure the main direction of a cloud of points: it is related to the spectral decomposition of covariance matrices.

EXAMPLE 3.13   Suppose that we have the following covariance matrix:

\begin{displaymath}{\Sigma} = \left (\begin{array}{ll} 1 &\rho \\ \rho &1\end{array}\right ).\end{displaymath}

Figure 3.8 shows a scatterplot of a sample of two normal random variables with such a covariance matrix (with $\rho = 0.8$).

Figure 3.8: Scatterplot for a sample of two correlated normal random variables (sample size $n = 150$, $\rho = 0.8$). 10291 MVAcorrnorm.xpl
\includegraphics[width=1\defpicwidth]{corrnorm.ps}

The eigenvalues of $\Sigma$ are, as was shown in Example 2.4, solutions to:

\begin{displaymath}\left \vert\begin{array}{cc} 1-\lambda &\rho \\ \rho &1-\lambda
\end{array}\right \vert=0.\end{displaymath}

Hence, $ \lambda _1=1+\rho $ and $\lambda _2=1-\rho $. Therefore $\Lambda = \mathop{\hbox{diag}}(1+\rho ,1-\rho)$. The eigenvector corresponding to $ \lambda _1=1+\rho $ can be computed from the system of linear equations:

\begin{displaymath}\left (\begin{array}{cc} 1&\rho \\ \rho &1\end{array}\right )...
...+\rho )
\left (\begin{array}{c} x_1\\ x_2\end{array}\right ) \end{displaymath}

or

\begin{displaymath}\begin{array}{rcl}
x_1+\rho x_2&=& x_1+\rho x_1\\
\rho x_1+x_2&=& x_2+\rho x_2
\end{array} \end{displaymath}

and thus

\begin{displaymath}x_1 = x_2.\end{displaymath}

The first (standardized) eigenvector is

\begin{displaymath}\gamma _{\col{1}} = \left (\begin{array}{r} 1\big/\sqrt 2 \\
1\big/\sqrt 2 \end{array}\right ). \end{displaymath}

The direction of this eigenvector is the diagonal in Figure 3.8 and captures the main variation in this direction. We shall come back to this interpretation in Chapter 9. The second eigenvector (orthogonal to $\gamma_{\col{1}}$) is

\begin{displaymath}\gamma _{\col{2}} = \left (\begin{array}{r} 1\big/\sqrt 2\\
-1\big/\sqrt 2 \end{array}\right ). \end{displaymath}

So finally

\begin{displaymath}\Gamma=\left( \gamma_{\col{1}}, \gamma_{\col{2}}
\right) = \l...
...\sqrt 2\\
1\big/\sqrt 2 & -1\big/\sqrt 2 \end{array} \right )\end{displaymath}

and we can check our calculation by

\begin{displaymath}\Sigma = \Gamma\ \Lambda\ \Gamma^{\top} \ . \end{displaymath}

The first eigenvector captures the main direction of a point cloud. The linear regression of $Y$ on $X$ and $X$ on $Y$ accomplished, in a sense, the same thing. In general the direction of the eigenvector and the least squares slope are different. The reason is that the least squares estimator minimizes either vertical or horizontal errors (in 3.28), whereas the first eigenvector corresponds to a minimization that is orthogonal to the eigenvector (see Chapter 9).

Summary
$\ast$
The linear regression $y = \alpha + \beta x + \varepsilon$ models a linear relation between two one-dimensional variables.
$\ast$
The sign of the slope $\widehat\beta$ is the same as that of the covariance and the correlation of $x$ and $y$.
$\ast$
A linear regression predicts values of $Y$ given a possible observation $x$ of $X$.
$\ast$
The coefficient of determination $r^2$ measures the amount of variation in $Y$ which is explained by a linear regression on $X$.
$\ast$
If the coefficient of determination is $r^2$ = 1, then all points lie on one line.
$\ast$
The regression line of $X$ on $Y$ and the regression line of $Y$ on $X$ are in general different.
$\ast$
The $t$-test for the hypothesis $\beta$ = 0 is $ t= \frac{\widehat\beta}{SE(\widehat\beta)}$, where $SE(\widehat\beta)=\frac{\hat{\sigma}}{(n \cdot s_{XX})^{1/2}}$.
$\ast$
The $t$-test rejects the null hypothesis $\beta$ = 0 at the level of significance $\alpha$ if $\mid t \mid \geq t_{1-\alpha/2;n-2}$ where $t_{1-\alpha;n-2}$ is the $1-\alpha/2$ quantile of the Student's $t$-distribution with $(n-2)$ degrees of freedom.
$\ast$
The standard error $SE(\widehat\beta)$ increases/decreases with less/more spread in the $X$ variables.
$\ast$
The direction of the first eigenvector of the covariance matrix of a two-dimensional point cloud is different from the least squares regression line.