4.2 Multiple Linear Regression
- {b,bse,bstan,bpval} =
linreg
({x, y {,opt,om}})
- estimates the coefficients
for a
linear problem from data x and y and calculates the
ANOVA table
- {xs,bs,pvalue} =
linregfs
(x, y {,alpha})
- computes the forward selection and estimates the coefficients
for a linear problem from data x
and y and calculates the ANOVA table
- {b,bse,bstan,bpval} =
linregfs2
(x, y, colname {,opt})
- computes the forward selection and estimates the coefficients
for a linear problem from data x
and y
- {b,bse,bstan,bpval} =
linregbs
(x, y, colname {,opt})
- computes the backward elimination and estimates the coefficients
for a linear problem from data x
and y and calculates the ANOVA table
- {b,bse,bstan,bpval} =
linregstep
(x, y, colname {,opt})
- computes the stepwise selection and estimates the coefficients
for a linear problem from data x
and y and calculates the ANOVA table
|
In this section, we consider the linear model
Looking at this model we are faced with two problems:
- Estimating the parameter vector
- Testing the significance of the components
From the mathematical point of view, the second problem is about
reducing the dimension of the model. Seen with the eyes of the user,
it gives us information about building a parsimonious model.
To this end, we have to find a way to handle the
selection or removal of a variable.
We want to use a simulated data set to demonstrate the solution to these two
problems. This example is stored in
XLGregr3.xpl
where we generate five
uniform
distributed variables
. Only three of them
influence
:
Here,
is a normally distributed error term.
randomize(1) ; sets a seed for the random generator
eps=normal(10) ; generates 10 standard normal errors
x1=uniform(10) ; generates 10 uniformly distributed values
x2=uniform(10)
x3=uniform(10)
x4=uniform(10)
x5=uniform(10)
x=x1~x2~x3~x4~x5 ; creates the x data matrix
y=2+2*x1-10*x3+0.5*x4+eps/10 ; creates y
z=x~y ; creates the data matrix z
z ; returns z
This shows
Contents of z
[ 1,] 0.98028 0.35235 0.29969 0.85909 0.62176 1.3936
[ 2,] 0.83795 0.82747 0.13025 0.79595 0.59754 2.7269
[ 3,] 0.15873 0.93534 0.91259 0.72789 0.43156 -6.5193
[ 4,] 0.67269 0.67909 0.28156 0.20918 0.19878 0.69022
[ 5,] 0.50166 0.97112 0.39945 0.57865 0.19337 -0.66278
[ 6,] 0.94527 0.36003 0.77747 0.029797 0.40124 -3.9237
[ 7,] 0.18426 0.29004 0.24534 0.44418 0.35116 0.11605
[ 8,] 0.36232 0.35453 0.53022 0.4497 0.8062 -2.3026
[ 9,] 0.50832 0.00516 0.90669 0.16523 0.75683 -5.9188
[10,] 0.76022 0.17825 0.37929 0.093234 0.17747 -0.20187
Let us start with the first problem and use the the quantlet
linreg
to estimate the parameters of the model
{beta,bse,bstan,bpval}=linreg(x,y) ; computes the linear
; regression
produces
A N O V A SS df MSS F-test P-value
______________________________________________________________
Regression 87.241 5 17.448 4700.763 0.0000
Residuals 0.015 4 0.004
Total Variation 87.255 9 9.695
Multiple R = 0.99991
R^2 = 0.99983
Adjusted R^2 = 0.99962
Standard Error = 0.06092
PARAMETERS Beta SE StandB t-test P-value
______________________________________________________________
b[ 0,]= 2.0745 0.0941 0.0000 22.056 0.0000
b[ 1,]= 1.9672 0.0742 0.1875 26.517 0.0000
b[ 2,]= 0.0043 0.0995 0.0005 0.043 0.9677
b[ 3,]= -10.0887 0.0936 -0.9201 -107.759 0.0000
b[ 4,]= 0.3991 0.1203 0.0387 3.318 0.0294
b[ 5,]= 0.0708 0.1355 0.0053 0.523 0.6289
We obtain the ANOVA and parameter tables which return the same
values as found in the previous section. Substituting the estimated
parameters
, we get with our generated data set
We know that
and
do not have any influence on
. This is
reflected by the fact that the estimates of
and
are close
to zero.
Now we reach the point where we are faced with our second problem, how to
eliminate these variables. We can get a first impression by considering the
parameter estimates and their
-values in the parameter
table. A
-value is small if there is no influence of the corresponding
variable. This is reflected in the
-value, which is the significance level
for testing the hypothesis that a parameter equals zero. From the above table,
we can see that only the
-values for the constant,
,
and
are smaller than 0.05, the typical significance level for hypothesis testing.
The
-values of
and
are much larger than 0.05 which means
that they are not significantly different from zero.
The above way of choosing variables is convenient, but we want to know if the elimination
or selection of a variable improves the result or not. This leads immediately to
the stepwise model selection methods.
Let us first consider forward selection. The idea is to start from one
``good'' variable
and calculate the simple linear regression for
Then we decide stepwise for each of the remaining variables if
its inclusion to the model improves the fit of the model.
The algorithm is:
- FS1
- Choose the variable
with the highest
- or
-value and calculate the simple linear regression.
- FS2
- Of the remaining variables, add the variable
which
fulfills one of the three (equivalent) criteria below:
-

has the highest sample partial correlation.

- The model with
increases the
-value the most.

has the highest
- or
-value.
- FS3
- Repeat FS2 until one of the stopping rules applies:
-

- The order
of the model has reached a predetermined
.

- The
-value is smaller then a predetermined value
.

does not significantly improve the model fit.
A similar idea leads to backward elimination. We start with the
linear regression for the full model and eliminate stepwise variables without
influence.
- BE1
- Calculate the linear regression for the full model.
- BE2
- Eliminate the variable
with one of the following
(equivalent) properties:
-

has the smallest sample partial correlation
among all remaining variables.

- The removing of
causes the smallest change of
.

- Of the remaining variables,
has the smallest
- or
-values.
- BE3
- repeat BE2 until one of the following stopping rules is
valid:
-

- The order
of the model has reached a predetermined
.

- The
-value is larger then a predetermined value
.

- Removing
does not significantly change the model fit.
A kind of compromise between forward selection and backward elimination
is given by the stepwise selection method. Beginning with one variable
just like in forward selection we have to choose one of the four
alternatives:
- Add a variable.
- Remove a variable.
- Exchange two variables.
- Stop the selection.
This can be done with the following rules:
- ST1
- Add the variable
if one of the forward selection
criteria FS2 is satisfied.
- ST2
- Remove the variable
with the smallest
-value if
there are (possibly more than one) variables with an
-value smaller
than
.
- ST3
- Remove the variable
with the smallest
-value if this
removal results in a larger
-value than it was obtained with the
same number of variables before.
- ST4
- Exchange the variables
in the model and
not in the
model if this will increase the
-value.
- ST5
- Stop the selection if neither ST1, ST2, ST3
nor ST4 is satisfied.
Remarks:
- The rules ST1, ST2 and ST3 only make sense if
there are two or more variables in the model. That is why they are only
admitted in this case.
- Considering ST3, we see the possibility that the same variable can
be added and removed in several steps of the procedure.
In
XploRe
we find the quantlets
linregfs
and
linregfs2
for
forward selection,
linregbs
for backward elimination, and
linregstep
for stepwise selection. Whereas
linregfs
only
returns the selected regressors
, the regression coefficients
and the
-values, the other three quantlets report each step, the ANOVA and
the parameter tables. Because both the syntax and the output formats of these
three quantlets are the same, we will only illustrate one of them with an example.
We use the data set generated above of the model
to demonstrate the usage of stepwise elimination. Before computing the regression,
we need to store the names of the variables in a column vector:
colname=string("X%.f",1:cols(x))
; sets the column names to X1,...,X5
{beta,se,betastan,p} = linregstep(x,y,colname)
; computes the stepwise selection
linregstep
returns the same values as
linreg
.
It shows the following output:
Contents of EnterandOut
Stepwise Regression
-------------------
F-to-enter 5.19
probability of F-to-enter 0.96
F-to-remove 3.26
probability of F-to-remove 0.90
Variables entered and dropped in the following Steps:
Step Multiple R R^2 F SigF Variable(s)
1 0.9843 0.9688 248.658 0.000 In : X3
2 0.9992 0.9984 2121.111 0.000 In : X1
3 0.9999 0.9998 10572.426 0.000 In : X4
A N O V A SS df MSS F-test P-value
_____________________________________________________________
Regression 87.239 3 29.080 10572.426 0.0000
Residuals 0.017 6 0.003
Total Variation 87 9 9.695
Multiple R = 0.99991
R^2 = 0.99981
Adjusted R^2 = 0.99972
Standard Error = 0.05245
Contents of Summary
Variables in the Equation for Y:
PARAMETERS Beta SE StandB t-test P-value Variable
_____________________________________________________________
b[ 0,]= 2.0796 0.0742 0.0000 28.0417 0.0000 Constant
b[ 1,]= 1.9752 0.0630 0.1883 31.3494 0.0000 X1
b[ 2,]= -10.0622 0.0690 -0.9177 -145.7845 0.0000 X3
b[ 3,]= 0.4257 0.0626 0.0413 6.8014 0.0005 X4
First, the quantlet
linregstep
returns the
values as
F-to-enter and
as F-to-remove. Then each step is reported
and we obtain again the ANOVA and parameter tables described in the previous
section.
As expected,
linregstep
selects the variables
,
and
and estimates the model as
Recall the results of the previous ordinary regression.
We can see that the accuracy of the estimated parameters has
been improved by the selection method (especially for
).
In addition, we obtained the information as to which variables can
be ignored because the model does not depend on them.