4.2 Multiple Linear Regression


{b,bse,bstan,bpval} = 9568 linreg ({x, y {,opt,om}})
estimates the coefficients $ \beta_0,\ldots,\beta_p$ for a linear problem from data x and y and calculates the ANOVA table
{xs,bs,pvalue} = 9571 linregfs (x, y {,alpha})
computes the forward selection and estimates the coefficients $ \beta_0,\ldots,\beta_p$ for a linear problem from data x and y and calculates the ANOVA table
{b,bse,bstan,bpval} = 9574 linregfs2 (x, y, colname {,opt})
computes the forward selection and estimates the coefficients $ \beta_0,\ldots,\beta_p$ for a linear problem from data x and y
{b,bse,bstan,bpval} = 9577 linregbs (x, y, colname {,opt})
computes the backward elimination and estimates the coefficients $ \beta_0,\ldots,\beta_p$ for a linear problem from data x and y and calculates the ANOVA table
{b,bse,bstan,bpval} = 9580 linregstep (x, y, colname {,opt})
computes the stepwise selection and estimates the coefficients $ \beta_0,\ldots,\beta_p$ for a linear problem from data x and y and calculates the ANOVA table

In this section, we consider the linear model

$\displaystyle Y=\beta_0+\beta_1 X_1+\ldots+\beta_p X_p\,.$

Looking at this model we are faced with two problems: From the mathematical point of view, the second problem is about reducing the dimension of the model. Seen with the eyes of the user, it gives us information about building a parsimonious model. To this end, we have to find a way to handle the selection or removal of a variable.

We want to use a simulated data set to demonstrate the solution to these two problems. This example is stored in 9583 XLGregr3.xpl where we generate five uniform $ [0,1]$ distributed variables $ X_1,\ldots,X_5$. Only three of them influence $ Y$:

$\displaystyle Y=2+2\,X_1-10\,X_3+0.5\,X_4+\varepsilon\,.$

Here, $ \varepsilon$ is a normally distributed error term.
  randomize(1)        ; sets a seed for the random generator
  eps=normal(10)      ; generates 10 standard normal errors
  x1=uniform(10)      ; generates 10 uniformly distributed values
  x2=uniform(10)
  x3=uniform(10)
  x4=uniform(10)
  x5=uniform(10)
  x=x1~x2~x3~x4~x5    ; creates the x data matrix
  y=2+2*x1-10*x3+0.5*x4+eps/10 ; creates y
  z=x~y               ; creates the data matrix z
  z                   ; returns z
This shows
  Contents of z
  [ 1,]  0.98028  0.35235  0.29969  0.85909  0.62176   1.3936 
  [ 2,]  0.83795  0.82747  0.13025  0.79595  0.59754   2.7269 
  [ 3,]  0.15873  0.93534  0.91259  0.72789  0.43156  -6.5193 
  [ 4,]  0.67269  0.67909  0.28156  0.20918  0.19878   0.69022 
  [ 5,]  0.50166  0.97112  0.39945  0.57865  0.19337  -0.66278 
  [ 6,]  0.94527  0.36003  0.77747  0.029797 0.40124  -3.9237 
  [ 7,]  0.18426  0.29004  0.24534  0.44418  0.35116   0.11605 
  [ 8,]  0.36232  0.35453  0.53022  0.4497   0.8062   -2.3026 
  [ 9,]  0.50832  0.00516  0.90669  0.16523  0.75683  -5.9188 
  [10,]  0.76022  0.17825  0.37929  0.093234 0.17747  -0.20187

Let us start with the first problem and use the the quantlet 9586 linreg to estimate the parameters of the model

$\displaystyle Y=\beta_0+\beta_1X_1+\ldots+\beta_5X_5+\varepsilon$

  {beta,bse,bstan,bpval}=linreg(x,y)  ; computes the linear
                                      ;    regression
produces
  A  N  O  V  A            SS    df     MSS     F-test   P-value
  ______________________________________________________________
  Regression             87.241   5    17.448  4700.763   0.0000
  Residuals               0.015   4     0.004
  Total Variation        87.255   9     9.695
  
  Multiple R      = 0.99991
  R^2             = 0.99983
  Adjusted R^2    = 0.99962
  Standard Error  = 0.06092
  
  
  PARAMETERS        Beta      SE     StandB     t-test   P-value
  ______________________________________________________________
  b[ 0,]=         2.0745    0.0941   0.0000     22.056   0.0000
  b[ 1,]=         1.9672    0.0742   0.1875     26.517   0.0000
  b[ 2,]=         0.0043    0.0995   0.0005      0.043   0.9677
  b[ 3,]=       -10.0887    0.0936  -0.9201   -107.759   0.0000
  b[ 4,]=         0.3991    0.1203   0.0387      3.318   0.0294
  b[ 5,]=         0.0708    0.1355   0.0053      0.523   0.6289

We obtain the ANOVA and parameter tables which return the same values as found in the previous section. Substituting the estimated parameters $ \widehat\beta_0,\ldots,\widehat\beta_5$, we get with our generated data set

$\displaystyle \widehat Y(x)= 2.0745+1.9672x_1+0.0043x_2-10.0887x_3+0.3991x_4+0.0708x_5\,.$

We know that $ X_2$ and $ X_5$ do not have any influence on $ Y$. This is reflected by the fact that the estimates of $ \beta _2$ and $ \beta_5$ are close to zero. Now we reach the point where we are faced with our second problem, how to eliminate these variables. We can get a first impression by considering the parameter estimates and their $ t$-values in the parameter table. A $ t$-value is small if there is no influence of the corresponding variable. This is reflected in the $ p$-value, which is the significance level for testing the hypothesis that a parameter equals zero. From the above table, we can see that only the $ p$-values for the constant, $ X_1$, $ X_3$ and $ X_4$ are smaller than 0.05, the typical significance level for hypothesis testing. The $ p$-values of $ X_2$ and $ X_5$ are much larger than 0.05 which means that they are not significantly different from zero.

The above way of choosing variables is convenient, but we want to know if the elimination or selection of a variable improves the result or not. This leads immediately to the stepwise model selection methods.

Let us first consider forward selection. The idea is to start from one ``good'' variable $ X_j$ and calculate the simple linear regression for

$\displaystyle Y=\beta_0+\beta_j X_j+\varepsilon\,.$

Then we decide stepwise for each of the remaining variables if its inclusion to the model improves the fit of the model. The algorithm is:
FS1
Choose the variable $ X_j$ with the highest $ t$- or $ F$-value and calculate the simple linear regression.
FS2
Of the remaining variables, add the variable $ X_k$ which fulfills one of the three (equivalent) criteria below:
$ \bullet$
$ X_k$ has the highest sample partial correlation.
$ \bullet$
The model with $ X_k$ increases the $ R^2$-value the most.
$ \bullet$
$ X_k$ has the highest $ t$- or $ F$-value.
FS3
Repeat FS2 until one of the stopping rules applies:
$ \bullet$
The order $ p$ of the model has reached a predetermined $ p^*$.
$ \bullet$
The $ F$-value is smaller then a predetermined value $ F_{\textrm{in}}$.
$ \bullet$
$ X_k$ does not significantly improve the model fit.
A similar idea leads to backward elimination. We start with the linear regression for the full model and eliminate stepwise variables without influence.
BE1
Calculate the linear regression for the full model.
BE2
Eliminate the variable $ X_k$ with one of the following (equivalent) properties:
$ \bullet$
$ X_k$ has the smallest sample partial correlation among all remaining variables.
$ \bullet$
The removing of $ X_k$ causes the smallest change of $ R^2$.
$ \bullet$
Of the remaining variables, $ X_k$ has the smallest $ t$- or $ F$-values.
BE3
repeat BE2 until one of the following stopping rules is valid:
$ \bullet$
The order $ p$ of the model has reached a predetermined $ p^*$.
$ \bullet$
The $ F$-value is larger then a predetermined value $ F_{\textrm{out}}$.
$ \bullet$
Removing $ X_k$ does not significantly change the model fit.
A kind of compromise between forward selection and backward elimination is given by the stepwise selection method. Beginning with one variable just like in forward selection we have to choose one of the four alternatives:
  1. Add a variable.
  2. Remove a variable.
  3. Exchange two variables.
  4. Stop the selection.
This can be done with the following rules:
ST1
Add the variable $ X_k$ if one of the forward selection criteria FS2 is satisfied.
ST2
Remove the variable $ X_k$ with the smallest $ F$-value if there are (possibly more than one) variables with an $ F$-value smaller than $ F_{\textrm{out}}$.
ST3
Remove the variable $ X_k$ with the smallest $ F$-value if this removal results in a larger $ R^2$-value than it was obtained with the same number of variables before.
ST4
Exchange the variables $ X_k$ in the model and $ X_\ell$ not in the model if this will increase the $ R^2$-value.
ST5
Stop the selection if neither ST1, ST2, ST3 nor ST4 is satisfied.
Remarks: In XploRe we find the quantlets 9593 linregfs and 9596 linregfs2 for forward selection, 9599 linregbs for backward elimination, and 9602 linregstep for stepwise selection. Whereas 9605 linregfs only returns the selected regressors $ X_i$ , the regression coefficients $ \beta_i$ and the $ p$-values, the other three quantlets report each step, the ANOVA and the parameter tables. Because both the syntax and the output formats of these three quantlets are the same, we will only illustrate one of them with an example.

We use the data set generated above of the model

$\displaystyle Y\,=\,2+2X_1-10X_3+0.5X_4+\varepsilon$

to demonstrate the usage of stepwise elimination. Before computing the regression, we need to store the names of the variables in a column vector:
  colname=string("X%.f",1:cols(x))  
                       ; sets the column names to X1,...,X5
  {beta,se,betastan,p} = linregstep(x,y,colname) 
                       ; computes the stepwise selection
9608 linregstep returns the same values as 9611 linreg . It shows the following output:
  Contents of EnterandOut

  Stepwise Regression
  -------------------
  F-to-enter 5.19
  probability of F-to-enter 0.96
  F-to-remove 3.26
  probability of F-to-remove 0.90

  Variables entered and dropped in the following Steps:

  Step  Multiple R      R^2        F        SigF       Variable(s)
   1     0.9843       0.9688    248.658    0.000  In : X3
   2     0.9992       0.9984   2121.111    0.000  In : X1
   3     0.9999       0.9998  10572.426    0.000  In : X4
  A  N  O  V  A       SS      df     MSS       F-test   P-value
  _____________________________________________________________
  Regression        87.239     3    29.080   10572.426   0.0000
  Residuals          0.017     6     0.003
  Total Variation       87     9     9.695

  Multiple R      = 0.99991
  R^2             = 0.99981
  Adjusted R^2    = 0.99972
  Standard Error  = 0.05245

  Contents of Summary

  Variables in the Equation for Y:

  PARAMETERS    Beta    SE    StandB    t-test P-value Variable
  _____________________________________________________________
  b[ 0,]=     2.0796  0.0742  0.0000   28.0417 0.0000  Constant
  b[ 1,]=     1.9752  0.0630  0.1883   31.3494 0.0000  X1
  b[ 2,]=   -10.0622  0.0690 -0.9177 -145.7845 0.0000  X3
  b[ 3,]=     0.4257  0.0626  0.0413    6.8014 0.0005  X4
First, the quantlet 9614 linregstep returns the $ F_{\textrm{in}}$ values as F-to-enter and $ F_{\textrm{out}}$ as F-to-remove. Then each step is reported and we obtain again the ANOVA and parameter tables described in the previous section.

As expected, 9617 linregstep selects the variables $ X_1$, $ X_3$ and $ X_4$ and estimates the model as

$\displaystyle \widehat Y(x)= 2.0796+ 1.9752x_1- 10.0622x_3+ 0.4257x_4\,.$

Recall the results of the previous ordinary regression. We can see that the accuracy of the estimated parameters has been improved by the selection method (especially for $ \widehat\beta_4$). In addition, we obtained the information as to which variables can be ignored because the model does not depend on them.