2.10 Dummy Variables

Up to now, we have carried out the study of the MLRM on the basis of a set of variables (regressors and the endogenous variable) that are quantitative, i.e. which adopt real continuous values. However, the MLRM can be applied in a wider framework which allows us to include as regressors non-quantitative factors such as time effects, space effects, qualitative variables or quantitative grouped variables. In order to include these factors in an MLRM, the so-called dummy variables are defined. These variables will be included in the $ X$ matrix of regressors, and they can be thought of as artificial variables which have the aim of representing the non quantitative factors. To understand what dummy variables mean, we will consider some common situations which require including this class of factors in such a way that it will be necessary to use dummy variables. To sum up, we can define dummy variables as those which are constructed by researchers in order to include non-quantitative factors in a regression model. These factors can distinguish two o more categories, in such a way that each dummy variables takes one value for the category we consider, and zero value for the rest of categories. In other words, what we mean is that a sample can be divided into two or more partitions in which some or all of the coefficients may differ.


2.10.1 Models with Changes in the Intercept

Suppose the factors reflected by means of dummy variables affect only the intercept of the relation. In this case, these dummy variables are included in "additive" form, that is to say, as another regressor together with its corresponding coefficient. Some examples of this situation include the following:

2.10.1.0.1 Example 1.

Time change in the intercept. We want to explain the consumption behaviour in a country during a time period, which includes some war time. It is thought of that the autonomous consumption in war periods is lower that that in peace periods. Additionally, we assume that the propensity to consume is the same. In other words, what we want to reflect is a possible non stability of the intercept during the sample period.

2.10.1.0.2 Example 2.

Qualitative variables as explanatory factors of the model. Suppose we want to analyze the behaviour of the wages of a set of workers, and it is suspected that, independently of other quantitative factors, the variable sex can be considered as another explanatory factor.

In the above examples it is possible to specify and estimate a different model for every category. That is to say, in example 1 we could specify one model for a war period, and another for a peace period. Analogously, in example 2 we could specify one model for men and another for women. Alternatively, it is possible to specify one only model, by including dummy variables, in such a way that the behaviour of the endogenous variable for each category is differentiated. In this case we use all the sample information, so that the estimation of the parameters is more efficient.

When a regression model is specified, the inclusion of the non quantitative factor is carried out by defining dummy variables. If we suppose that the non quantitative factor has only two categories, we can represent them by means of one only dummy variable:

\begin{displaymath}
d_{i}=
\begin{array}{cc}
1 & \forall i \in category 1 \\
0 & \forall i \in category 2
\end{array}\end{displaymath}

with $ (i=1,2,\ldots,n)$, where the assignation of zero or one to each category is usually arbitrary.

In general, if we consider a model with $ k-1$ quantitative explanatory variables and a non-quantitative factor which distinguishes two categories, the corresponding specification could be:

\begin{displaymath}\begin{array}{cc} y_{i}=\beta_{1}+\beta_{2}x_{2i}+\ldots+\beta_{k}x_{ki}+\beta_{k+1}d_{i}+u_{i} & (i=1,2,\ldots,n) \end{array}\end{displaymath} (2.219)

The inclusion of the term $ \beta_{k+1}d_{i}$ allows us to differentiate the intercept term in the two categories or to consider a qualitative factor as a relevant variable. To see this, note that if all the usual regression assumptions hold for (2.219) then:

\begin{displaymath}\begin{array}{c} \textrm{E}(y_{i}\vert d_{i}=1)=(\beta_{1}+\b...
...0)=\beta_{1}+\beta_{2}x_{2i}+\ldots+\beta_{k}x_{ki} \end{array}\end{displaymath} (2.220)

Thus, for the first category the intercept is $ (\beta_{1}+\beta_{k+1})$ rather than $ \beta_{1}$, and the parameter $ \beta_{k+1}$ represents the difference between the intercept values in the two categories. In the framework of Example 2, result (2.220) allows us to see that the coefficient of the dummy variable represents the impact on the expected value of $ y$ of an observation which is being in the included group (first category) rather than the excluded group (second category), maintaining all the other regressors constant.

Once we have estimated the parameter vector of the model by OLS or ML, we can test several hypotheses about the significance of such parameters. Specifically, the most interesting tests are the following:

The last two tests can be solved by using a t-ratio statistic.

Alternatively, in the previous examples we could specify the model by defining two dummy variables as follows:

$\displaystyle d_{1i}=\begin{array}{cc}
1 & \forall i \in category 1 \\
0 & \forall i \in category 2
\end{array}$

$\displaystyle d_{2i}=\begin{array}{cc}
1 & \forall i \in category 2 \\
0 & \forall i \in category 1
\end{array}$

In this case, the specified model should be:

$\displaystyle y_{i}=\beta_{1}+\beta_{2}x_{2i}+\ldots+\beta_{k}x_{ki}+\beta_{k+1}d_{1i}+\beta_{k+2}d_{2i}+u_{i} \quad (i=1,2,\ldots,n)$ (2.221)

and analogously to (2.220) we have:

\begin{displaymath}\begin{array}{c} \textrm{E}(y_{i}\vert d_{1i}=1,d_{2i}=0)=(\b...
...\beta_{k+2})+\beta_{2}x_{2i}+\ldots+\beta_{k}x_{ki} \end{array}\end{displaymath} (2.222)

which allows us to differentiate the effect of the intercept in the two categories, so we could think that (2.221) is also a correct specification.

However, model (2.221) leads us to the so-called "dummy variable trap". This term means that, if the model to estimate is (2.221), then matrix $ X$ will be:

$\displaystyle X= \begin{pmatrix}1 & x_{21} & \ldots & x_{k1} & 0 & 1 \\ 1 & x_{...
...\begin{pmatrix}x_{1} & x_{2} & \ldots & x_{k} & x_{k+1} & x_{k+2} \end{pmatrix}$ (2.223)

where the last two columns reflect the sample observations of the dummy variables. Given this $ X$ matrix, it follows that:

\begin{displaymath}
\begin{array}{cc}
x_{1i}=x_{(k+1)i}+x_{(k+2)i} & \forall i=1,2,\ldots,n
\end{array}\end{displaymath}

That is to say, the observations of the first column in $ X$ are an exact linear combination of those of the two last columns. This means that model (2.221) does not satisfy the full rank property, or in other words, there is perfect multicollinearity. Given that the rank of $ X$ is less than $ k+2$, matrix $ X^{\top }X$ is nonsingular, so its inverse does not exist and so $ \hat{\beta}$ can not be obtained. This fact allows us to conclude that, for the examples previously defined, the correct specification is (2.219), which includes only one dummy variable.

Nevertheless, if the model is specified without intercept, there should not be any problem to estimate it(given the corresponding matrix $ X$, without the column of ones).

Including intercept is usually advised in order to assure that certain properties are satisfied and the meaning of some measures is maintained. Thus, if we specify a model with intercept and a non-quantitative factor which has $ m$ categories, including $ m-1$ dummy variables is enough. Although it is the general rule, we can consider the possibility of not including intercept (remember that some results will be not satisfied ) and then including $ m$ categories.

Having solved the question about the number of dummy variables to include, we now generalize the way of proceeding for the two following situations:

a)
To represent a factor which has more that two categories. This is the case of the so-called seasonality models.
b)
To represent several non-quantitative factors which have different number of categories.
The first case allows us, among other situations, to use dummy variables to deal with seasonality, that is to say, to allow the regression to have some heterogeneity between seasons (quarters or months). In this case, given that the non quantitative factor distinguishes four categories for quarters (12 for months), and considering that the model has an intercept, we define three dummy variables (11 for months) as follows:

$\displaystyle d_{ji}=\begin{array}{cc}
1 & \forall i \in quarter(month) j \\
0 & otherwise
\end{array}$

With respect to the second generalization (b)), suppose a model in which we want to consider the sex variable (two categories) and the race variable (three categories). Analogously to the previous case, we have to define a dummy variable for sex, and two dummies for race.


2.10.2 Models with Changes in some Slope Parameters

We now consider the situation where the non-quantitative factors affect one or more coefficients of the explanatory variables (slopes). In this situation, the dummy variables are included in "multiplicative" or "interactive" mode, that is to say, multiplied by the corresponding regressor. Some examples of this situation are the following:

2.10.2.0.1 Example 3.

Time varying of a slope. Taking the situation of Example 1, now it is supposed that the phenomenon war/peace affects only the propensity to consume. In this sense, we try to analyze whether the coefficient of a given explanatory variable is homogeneous during the sample period.

2.10.2.0.2 Example 4.

Taking the case of Example 2, we now try to analyze the behaviour of the wages of a set of workers. This wage depends, together with other quantitative factors, on the years of experience, and these are considered as not independent of sex, that is to say, there is an "interaction effect" between both variables.

Again, these situations could be specified through two models: one model for the war period and another for the peace period in Example 3, and analogously, a model for men and another for women in Example 4. Alternatively, we can specify one unique model which includes dummy variables, allowing it a more efficient estimation of the parameters.

The general regression model which allows an interaction effect between a non quantitative factor (represented by means of a dummy variable $ d_{i}$) and a given quantitative variable (suppose $ x_{2}$), is written as:

\begin{displaymath}\begin{array}{cc} y_{i}=\beta_{1}+\beta_{2}x_{2i}+\ldots+\bet...
..._{k+1}(d_{i}x_{2i})+u_{i} & \forall i=1,2,\ldots,n. \end{array}\end{displaymath} (2.224)

where it is supposed that the non-quantitative factor has only two categories. In order to prove that model (2.224) adequately reflects the situations of the previous examples, we calculate:

\begin{displaymath}\begin{array}{c} \textrm{E}(y_{i}\vert d_{i}=1)=\beta_{1}+(\b...
...0)=\beta_{1}+\beta_{2}x_{2i}+\ldots+\beta_{k}x_{ki} \end{array}\end{displaymath} (2.225)

We suppose that in Example 3 the income variable is $ x_{2}$, so the propensity to consume is given by its corresponding coefficient. For the first category this slope or propensity is $ (\beta_{2}+\beta_{k+1})$ rather than $ \beta_{2}$, and the parameter $ \beta_{k+1}$ reflects the difference between the slope values in the two categories (war/peace). For Example 4, if we consider that $ x_{2}$ is the variable years of experience, it is obvious that specification (2.224) adequately reflects the interaction effect mentioned earlier .

Again, once the model has been estimated, the most interesting testing hypotheses we can carry out are the following:

Similarly to the previous subsection, these last two hypotheses can be tested by using a t-ratio statistic.


2.10.3 Models with Changes in all the Coefficients

In this case, we try to analyze jointly each of the situations which have been considered separately in previous subsections. Thus, we assume that the non quantitative factor represented by dummy variables affects both the intercept and all the coefficients of the regressors. Obviously, this is the more general situation, which contains particularly the influence of a non-quantitative factor on a subset of coefficients.

In order to reflect the effect of the non-quantitative factor on the intercept, the dummy variables should be included in additive form, while they should be included in multiplicative form to reflect the effect on the slopes.

Thus, the general specification, under the assumption of having a non quantitative factor which distinguishes only two categories, is given by:

$\displaystyle y_{i}=\beta_{1}+\beta_{2}x_{2i}+\ldots+\beta_{k}x_{ki}+\beta_{k+1}d_{i}+\beta_{k+2}(d_{i}x_{2i})+\ldots+\beta_{k+k}(d_{i}x_{ki})+u_{i}$ (2.226)

From (2.226) we obtain the expected value of the endogenous variable conditioned to each of the categories considered, with the aim of verifying that the specification adequately reflects our objective:

\begin{displaymath}\begin{array}{c} \textrm{E}(y_{i}\vert d_{i}=1)= (\beta_{1}+\...
...0)=\beta_{1}+\beta_{2}x_{2i}+\ldots+\beta_{k}x_{ki} \end{array}\end{displaymath} (2.227)

According to (2.227), $ (\beta_{1}+\beta_{k+1})$ and $ \beta_{1}$ denote, respectively, the constant terms for the first and second categories, while $ (\beta_{j}+\beta_{k+j})$ and $ \beta_{j}$ ( $ j=2,3,\ldots,k$) denote the slopes of $ x_{j}$ ( $ j=2,3,\ldots,k$) for the first and second categories. By contrast to previous cases, when the non-quantitative factor affects all the coefficients of the model, there is no gain of efficiency of specifying (2.226) compared to considering one model for each category. This is due, as Goldberger (1964) argues to the fact that the normal equations of the estimation of (2.226) are constituted by two blocks of independent equations: that corresponding to the observations of the first category, and those of the second category. However, the t and F-statistics and $ R^{2}$ of the two ways of dealing with the information are not equivalent.

Having estimated (2.226), the most interesting hypotheses testing we can carry out are the following:

To solve these tests, we will use the t-ratio statistic. Thus, dummy variables provide a method of verifying one of the classical assumptions of the MLRM: that referring to the stability of the coefficients. Nevertheless, note that this method requires knowing (or supposing) the point (observation) of possible change in the behaviour of the model.


2.10.4 Example

Considering the same example than in previous sections, we now want to illustrate the way of testing structural change, that is to say, testing the stability of the coefficients. For this end, we suppose that a possible change in the consumption behavior can be produced in 1986, so it is thought of that the incorporation of Spain to the European Union can affect the coefficients. We are interested in testing the following cases: a) the change affects the autonomous consumption; b) the change affects the slope of the export variable, and c) the change affects all the model. To carry out this test, we begin by defining the dummy variable which allows us to differentiate the periods before and after 1986, that is to say:

$\displaystyle d_{i}=
\begin{cases}
1 & \text{i=1986,\ldots,1997} \\
0 & \text{otherwise}.
\end{cases}$

Thus, we have to estimate the following three regression models:

a.
$ y_{i}=\beta_{1}+\beta_{2}x_{2i}+\beta_{3}x_{3i}+\beta_{4}d_{i}+u_{i}$
b.
$ y_{i}=\beta_{1}+\beta_{2}x_{2i}+\beta_{3}x_{3i}+\beta_{4}(x_{2i}d_{i})+u_{i}$
c.
$ y_{i}=\beta_{1}+\beta_{2}x_{2i}+\beta_{3}x_{3i}+\beta_{4}d_{i}+\beta_{5}(x_{2i}d_{i})+\beta_{6}(x_{3i}d_{i})+u_{i}$

In regression $ a)$ we test structural change in the intercept, that is to say, $ H_{0}:\beta_{4}=0$. The second regression allows us to test structural change which affects the coefficient of the export variable by means of the same hypothesis as that of the first regression. Both hypotheses are tested using the t-ratio statistic. The third regression allows us to test structural change which affects all the model, that is to say, $ H_{0}:\beta_{4}=\beta_{5}=\beta_{6}=0$, by means of the general F statistic.

The quantlet XEGmlrm08.xpl allows to estimate these regression models and test the corresponding hypotheses. Furthermore, the corresponding output includes the results of the four regressions

13169 XEGmlrm08.xpl

The first regression is used for calculating the restricted residual sum of squares, given that including the restriction $ \beta_{4}=\beta_{5}=\beta_{6}=0$ in regression $ c)$, we have the first regression of the output, that is to say $ y_{i}=\beta_{1}+\beta_{2}x_{2i}+\beta_{3}x_{3i}+u_{i}$. Observing the value of the F statistic (5.0059) and its corresponding p-value, we conclude rejection the null hypotheses. This conclusion means that there is structural change which affects all the model. Note that the t-ratio of $ \beta_{4}$ in regressions $ a)$ and $ b)$ leads to reject the null hypothesis $ H_{0}:\beta_{4}=0$, that is to say, each parameter is not stable.