2.10 Dummy Variables
Up to now, we have carried out the study of the MLRM on the basis
of a set of variables (regressors and the endogenous variable)
that are quantitative, i.e. which adopt real continuous values.
However, the MLRM can be applied in a wider framework which allows
us to include as regressors non-quantitative factors such as time
effects, space effects, qualitative variables or quantitative
grouped variables. In order to include these factors in an MLRM,
the so-called dummy variables are defined. These variables will be
included in the
matrix of regressors, and they can be thought
of as artificial variables which have the aim of representing the
non quantitative factors. To understand what dummy variables mean,
we will consider some common situations which require including
this class of factors in such a way that it will be necessary to
use dummy variables.
- Sometimes the researcher can suspect that a given
behaviour relation varies from one period to another. For example,
it can be expected that the consumption function in war time has a
lower value than that in peace time. Analogously, if we have dealt
with quarterly or monthly data we may expect some variations
between the seasons. To analyse how war time, or a given season,
affect the corresponding endogenous variable, they are considered
as time effects which can not be included in the model by means of
quantitative variables but require using dummy variables.
- In several cases, researchers can expect some changes
in an economic function between different geographic areas, due to
the different economic structures of these places. So, these
non-quantitative space effects are represented by means of dummy
variables.
- Some qualitative variables such as sex, marital status, race,
etc. can be considered as important factors to explain economic
behaviour, so these should be included in the systematic component
of the MLRM. These qualitative variables are represented by means
of dummy variables.
- In some frameworks we try to study certain behaviour
relations on the basis of microeconomic data. In some of these
cases constructing intervals by grouping data could be
interesting. For this, we also make use of dummy variables.
To sum up, we can define dummy variables as those which are
constructed by researchers in order to include non-quantitative
factors in a regression model. These factors can distinguish two o
more categories, in such a way that each dummy variables takes one
value for the category we consider, and zero value for the rest of
categories. In other words, what we mean is that a sample can be
divided into two or more partitions in which some or all of the
coefficients may differ.
2.10.1 Models with Changes in the Intercept
Suppose the factors reflected by means of dummy variables affect
only the intercept of the relation. In this case, these dummy
variables are included in "additive" form, that is to say, as
another regressor together with its corresponding coefficient.
Some examples of this situation include the following:
Time change in the intercept. We want to
explain the consumption behaviour in a country during a time
period, which includes some war time. It is thought of that the
autonomous consumption in war periods is lower that that in peace
periods. Additionally, we assume that the propensity to consume is
the same. In other words, what we want to reflect is a possible
non stability of the intercept during the sample period.
Qualitative variables as explanatory
factors of the model. Suppose we want to analyze the behaviour of
the wages of a set of workers, and it is suspected that,
independently of other quantitative factors, the variable sex can
be considered as another explanatory factor.
In the above examples it is possible to specify and estimate a
different model for every category. That is to say, in example 1
we could specify one model for a war period, and another for a
peace period. Analogously, in example 2 we could specify one model
for men and another for women. Alternatively, it is possible to
specify one only model, by including dummy variables, in such a
way that the behaviour of the endogenous variable for each
category is differentiated. In this case we use all the sample
information, so that the estimation of the parameters is more
efficient.
When a regression model is specified, the inclusion of the non
quantitative factor is carried out by defining dummy variables. If
we suppose that the non quantitative factor has only two
categories, we can represent them by means of one only dummy
variable:
with
, where the assignation of zero or one to
each category is usually arbitrary.
In general, if we consider a model with
quantitative
explanatory variables and a non-quantitative factor which
distinguishes two categories, the corresponding specification
could be:
 |
(2.219) |
The inclusion of the term
allows us to
differentiate the intercept term in the two categories or to
consider a qualitative factor as a relevant variable. To see this,
note that if all the usual regression assumptions hold for
(2.219) then:
 |
(2.220) |
Thus, for the first category the intercept is
rather than
, and the
parameter
represents the difference between the
intercept values in the two categories. In the framework of
Example 2, result (2.220) allows us to see that the
coefficient of the dummy variable represents the impact on the
expected value of
of an observation which is being in the
included group (first category) rather than the excluded group
(second category), maintaining all the other regressors constant.
Once we have estimated the parameter vector of the model by OLS or
ML, we can test several hypotheses about the significance of such
parameters. Specifically, the most interesting tests are the
following:
- Testing the statistical significance of the intercept corresponding to
the first category:
To solve this test, we can use the general F-statistic.
- Testing the statistical significance of the intercept corresponding to
the second category:
- Testing whether there is a statistically significant difference
between the intercept of the two categories. This can be also
thought of as testing the stability of the intercept.
Alternatively, this
test can be thought of as testing the statistical significance of the
qualitative variable under consideration.
We must note that accepting this null hypothesis in
Example 1 implies concluding the stability of the intercept.
The last two tests can be solved by using a t-ratio statistic.
Alternatively, in the previous examples we could specify the model by
defining two dummy variables as follows:
In this case, the specified model should be:
 |
(2.221) |
and analogously to (2.220) we have:
 |
(2.222) |
which allows us to differentiate the effect of the intercept in
the two categories, so we could think that
(2.221) is also a correct specification.
However, model (2.221) leads us to the so-called
"dummy variable trap". This term means that, if the model to
estimate is (2.221), then matrix
will be:
 |
(2.223) |
where the last two columns reflect the sample observations of the
dummy variables. Given this
matrix, it follows that:
That is to say, the observations of the first column in
are an
exact linear combination of those of the two last columns. This
means that model (2.221) does not satisfy the
full rank property, or in other words, there is perfect
multicollinearity. Given that the rank of
is less than
,
matrix
is nonsingular, so its inverse does not exist
and so
can not be obtained. This fact allows us to
conclude that, for the examples previously defined, the correct
specification is (2.219), which includes only one
dummy variable.
Nevertheless, if the model is specified without intercept, there
should not be any problem to estimate it(given the corresponding
matrix
, without the column of ones).
Including intercept is usually advised in order to assure that
certain properties are satisfied and the meaning of some measures
is maintained. Thus, if we specify a model with intercept and a
non-quantitative factor which has
categories, including
dummy variables is enough. Although it is the general rule, we can
consider the possibility of not including intercept (remember that
some results will be not satisfied ) and then including
categories.
Having solved the question about the number of dummy variables to
include, we now generalize the way of proceeding for the two
following situations:
- a)
- To represent a factor which has more that two
categories. This is the case of the so-called seasonality models.
- b)
- To represent several non-quantitative factors which
have different number of categories.
The first case allows us, among other situations, to use dummy
variables to deal with seasonality, that is to say, to allow the
regression to have some heterogeneity between seasons (quarters or
months). In this case, given that the non quantitative factor
distinguishes four categories for quarters (12 for months), and
considering that the model has an intercept, we define three dummy
variables (11 for months) as follows:
With respect to the second generalization (b)), suppose a model in
which we want to consider the sex variable (two categories) and
the race variable (three categories). Analogously to the previous
case, we have to define a dummy variable for sex, and two dummies
for race.
2.10.2 Models with Changes in some Slope Parameters
We now consider the situation where the
non-quantitative factors affect one or more coefficients of the
explanatory variables (slopes). In this situation, the dummy
variables are included in "multiplicative" or "interactive" mode,
that is to say, multiplied by the corresponding regressor. Some
examples of this situation are the following:
Time varying of a slope. Taking the
situation of Example 1, now it is supposed that the phenomenon
war/peace affects only the propensity to consume. In this sense,
we try to analyze whether the coefficient of a given explanatory
variable is homogeneous during the sample period.
Taking the case of Example 2, we now try to
analyze the behaviour of the wages of a set of workers. This wage
depends, together with other quantitative factors, on the years of
experience, and these are considered as not independent of sex,
that is to say, there is an "interaction effect" between both
variables.
Again, these situations could be specified through two models: one
model for the war period and another for the peace period in
Example 3, and analogously, a model for men and another for women
in Example 4. Alternatively, we can specify one unique model which
includes dummy variables, allowing it a more efficient estimation
of the parameters.
The general regression model which allows an interaction effect
between a non quantitative factor (represented by means of a dummy
variable
) and a given quantitative variable (suppose
), is written as:
 |
(2.224) |
where it is supposed that the non-quantitative factor has only two
categories. In order to prove that model (2.224)
adequately reflects the situations of the previous examples, we
calculate:
 |
(2.225) |
We suppose that in Example 3 the income variable is
, so
the propensity to consume is given by its corresponding
coefficient. For the first category this slope or propensity is
rather than
, and the
parameter
reflects the difference between the slope
values in the two categories (war/peace). For Example 4, if we
consider that
is the variable years of experience, it is
obvious that specification (2.224) adequately
reflects the interaction effect mentioned earlier .
Again, once the model has been estimated, the most interesting
testing hypotheses we can carry out are the following:
- Testing the statistical significance of the coefficient of
in the first category.
We use a F-statistic to test this hypothesis.
- Testing the statistical significance of the coefficient of
in the second category.
- Testing whether there are non-significant differences between the
coefficients of
for the two categories. In other words,
testing the statistical significance of the interaction effect
defined earlier.
Similarly to the previous subsection, these last two hypotheses
can be tested by using a t-ratio statistic.
2.10.3 Models with Changes in all the Coefficients
In this case, we try to analyze
jointly each of the situations which have been considered
separately in previous subsections. Thus, we assume that the non
quantitative factor represented by dummy variables affects both
the intercept and all the coefficients of the regressors.
Obviously, this is the more general situation, which contains
particularly the influence of a non-quantitative factor on a
subset of coefficients.
In order to reflect the effect of the non-quantitative factor on
the intercept, the dummy variables should be included in additive
form, while they should be included in multiplicative form to
reflect the effect on the slopes.
Thus, the general specification, under the assumption of having a
non quantitative factor which distinguishes only two categories,
is given by:
 |
(2.226) |
From (2.226) we obtain the expected value of the
endogenous variable conditioned to each of the categories
considered, with the aim of verifying that the specification
adequately reflects our objective:
 |
(2.227) |
According to (2.227),
and
denote, respectively, the constant terms for the
first and second categories, while
and
(
) denote the slopes of
(
) for the first and second categories. By
contrast to previous cases, when the non-quantitative factor
affects all the coefficients of the model, there is no gain of
efficiency of specifying (2.226) compared to
considering one model for each category. This is due, as
Goldberger (1964) argues to the fact that the normal equations
of the estimation of (2.226) are constituted by
two blocks of independent equations: that corresponding to the
observations of the first category, and those of the second
category. However, the t and F-statistics and
of the two
ways of dealing with the information are not equivalent.
Having estimated (2.226), the most interesting
hypotheses testing we can carry out are the following:
- Testing the significance of the slope of
(j=2,3,...,k) corresponding to the first category:
We use the general F-statistic to solve this test.
- Testing the significance of the slope of
(j=2,3,...,k) corresponding to the second category:
- Testing whether there is a non-significant difference between the
coefficient of
of the two categories:
- Testing whether there is a non-significant difference between the
intercept of both categories:
To solve these tests, we will use the t-ratio statistic.
- Testing the joint hypothesis that the endogenous variable
has no different behaviour between the two categories:
This test is solved by using a general F statistic. It can also be
denoted as a test of stability, given that accepting
leads
us to conclude that all the coefficients of the model are
invariant during the sample period.
Thus, dummy variables provide a method of verifying one of the
classical assumptions of the MLRM: that referring to the stability
of the coefficients. Nevertheless, note that this method requires
knowing (or supposing) the point (observation) of possible change
in the behaviour of the model.
2.10.4 Example
Considering the same example
than in previous sections, we now want to illustrate the way of
testing structural change, that is to say, testing the stability
of the coefficients. For this end, we suppose that a possible
change in the consumption behavior can be produced in 1986, so it
is thought of that the incorporation of Spain to the European
Union can affect the coefficients. We are interested in testing
the following cases: a) the change affects the autonomous
consumption; b) the change affects the slope of the export
variable, and c) the change affects all the model. To carry out
this test, we begin by defining the dummy variable which allows us
to differentiate the periods before and after 1986, that is to
say:
Thus, we have to estimate the following three regression models:
- a.
-
- b.
-
- c.
-
In regression
we test structural change in the intercept,
that is to say,
. The second regression allows
us to test structural change which affects the coefficient of the
export variable by means of the same hypothesis as that of the
first regression. Both hypotheses are tested using the t-ratio
statistic. The third regression allows us to test structural
change which affects all the model, that is to say,
, by means of the general F
statistic.
The quantlet XEGmlrm08.xpl allows to estimate these
regression models and test the corresponding hypotheses.
Furthermore, the corresponding output includes the results of the
four regressions
The first regression is used for calculating the restricted
residual sum of squares, given that including the restriction
in regression
, we have the
first regression of the output, that is to say
. Observing
the value of the F statistic (5.0059) and its corresponding
p-value, we conclude rejection the null hypotheses. This
conclusion means that there is structural change which affects all
the model. Note that the t-ratio of
in regressions
and
leads to reject the null hypothesis
, that is to say, each parameter is not stable.