|
In a panel data set, a given sample of individuals is observed at
different time periods and thus provides multiple observations on each
individual in the sample. As a simple example, assume that the
relationship between the earnings and the experience of an employee
is given by the linear model
We expect of course that a skilled worker earns more than an
unskilled worker. However, if there is no information on the educational
background of the employee, we might represent the impact of
unobserved education by introducing an individual specific constant
. Therefore, the model is written as
We may also include a measure of heterogeneity with respect to time. For
example there may be a common time trend of an unknown form. Hence the
model is augmented by a time specific constant :
The effects and
are assumed to be either deterministic
(fixed effects model) or random (random effects model). The
crucial assumption associated with these different models is that
the random effects model assumes that
We may also mix the assumptions of a random and fixed effects model.
For example, we may assume that is random and
is
fixed. Moreover, it may be assumed that some regressors are correlated
with the random effects, while others are uncorrelated. Such a
specification is in particular attractive if the set of regressors contains
some variables that are constant in time. We therefore write the
mixed model as
![]() |
(12.24) |
A typical (static) panel data analysis usually proceeds as follows:
The data set is assumed to be ordered by the cross-section units. That is,
the complete data of the first individual are given in the first rows,
then the data of the second individual in the rows
and so on. If the data are an unbalanced data set, the first two columns
must provide the identification number of the cross-section unit and the
time period. If the data are in the form of a balanced panel, it is sufficient to
provide the common number of time periods
to assign the data to the
cross-section units. Most procedures do not use the time index in
the second column. Exceptions are the
pantime
and the
pandyn
quantlets. Thus, the user may insert some arbitrary column (for example, a
column generated by the
XploRe
command matrix(rows(z),1)) in all other
cases.
The data matrix must be organized in the following form:
{1} | {2} | 3 | 4 | ![]() |
3+![]() |
![]() |
![]() |
3+![]() |
1 | 1 | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
|||||||
1 | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
2 | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
An example for such a data set is the file earnings.dat (see Data Sets (B.8)). The first 10 rows look like:
[ 1,] 1 1 2600 20 9 2 [ 2,] 1 2 2500 21 9 2 [ 3,] 1 3 2700 22 9 2 [ 4,] 1 4 3400 23 9 2 [ 5,] 1 5 3354 24 9 2 [ 6,] 2 1 2850 19 13 1 [ 7,] 2 2 4000 20 13 1 [ 8,] 2 3 4000 21 13 1 [ 9,] 2 4 4200 22 13 1 [ 10,] 2 5 5720 23 13 1
The first column is the individual index, the second column is the time
index and the third column gives the monthly earnings.
The fourth column shows the experience in years and the fifth column
presents the years of schooling. The elements in the final column are 2
for a female and 1 for a male participant. All cross sectional units are
observed at periods and, therefore, the first two columns can be
skipped. In this case we specify the joint number of time periods
when calling the quantlets.
Summing up, the columns of the matrix are defined as follows:
If it is the case that a balanced data set is (inappropriately) ordered by
time so that all observations for period 1 are given first, then the
observations of period 2 and so on, then the quantlet
pansort
can be
used to rearrange the data set:
z = pansort(z0 {,N})where
The
panstats
quantlet provides some summary statistics for the
variables in the data set. It is highly recommended to first compute
the summary statistics to find out possible problems with the data.
The column Within Var.% gives the fraction of variance
due to the within-group deviations. If this fraction is zero, the
respective variable is constant over time. This is an important
information for the estimation of fixed effects models.
For the earnings data we type
panstats(z)The output is
[ 1,] [ 2,] N*T: 500, N: 100, Min T(i): 5, Max T(i): 5 [ 3,] --------------------------------------------------------- [ 4,] Minimum Maximum Mean Within Var.% Std.Error [ 5,] --------------------------------------------------------- [ 6,] 1350 1.3e+04 3938 17.87 1513 [ 7,] 5 45 24.25 2.27 9.351 [ 8,] 9 13 9.94 0.2764 1.319 [ 9,] 1 2 1.34 0 0.4742 [10,]This table indicates that the second variable (g) is constant over time (Within Var.% is zero) and there are only a few individuals with a changing schooling variable (Within Var.% is only 0.03 percent of the total variance). Accordingly, the schooling variable can be treated as (roughly) constant as well.
When specifying a panel data model, we first have to decide whether fixed time effects are to be included in the regression. Since in most panel data sets the number of time periods is small compared to the number of cross-section units, including time effects does not imply a severe loss in efficiency even if there are no time effects at all. So it is recommended to include time effects regularly. After the final model is specified, we may also test for a presence of time effects.
If the data set is a balanced panel, no information about the individuals and time periods is needed. The program simply assumes that the observations are in a consecutive order. Therefore, only the common number of time periods has to be indicated. To remove the time effects the following command can be used:
z1 = pantime(z {,T})This will produce a new data set z1 that results from subtracting the time mean and adding the overall mean of the variables. If the input parameter
Applying such a data transformation and estimating with individual effects
will give the same coefficient estimates as if the models are
estimated with time dummies. However, the variance
estimates differ due to the different degrees of freedom correction.
Whenever the (average) number of time series is much smaller than the
number of individuals , the difference is negligible. More importantly,
the
measure is lower than in the case of an explicit inclusion of
time effects because the variance of the dependent variable becomes
smaller after subtracting the time mean. If these differences matter,
time dummies should be included in the variable list instead of using the
pantime
quantlet.
A test for the existence of time effects can be computed as follows.
First the model is estimated with fixed individual effects by using the
original data set z. Then, the
pantime
quantlet is applied to
remove the time effects. The resulting data set z1 is again estimated
using the
panfix
quantlet. The likelihood-ratio statistic is obtained
as
, where
is the log-likelihood value for the model without time
effects and
is the log-likelihood value of the model
after removing the time effects. The LR statistic has an asymptotic
distribution with
degrees of freedom.
To test for the existence of individual effects, two different test
statistics can be used. The -statistic and the LM-statistic are computed
by the quantlet
panfix
. With respect to the power of the test it is important to
indicate a possible correlation of the regressors with the individual effect.
If it is assumed that all regressors are correlated with the individual
effects (the fixed effects model), then the
-test is optimal. In the
case of the random effects model, the LM-statistic is preferable. In a mixed
specification the LM-statistic is computed by using the
residuals from the instrumental variable (IV) estimation instead of the
ordinary least squares residuals.
The quantlet
panfix
is called as
{output, siga, sige} = panfix(z, m {,T})The string output yields the output table of an estimation assuming the first given
Besides the test statistic(s) for the existence of individual effects,
the
panfix
quantlet computes estimates for the error variances
(siga) and
(sige) which
are used for the Hausman test called by
output = panhaus(z, siga, sige, m {,T})
This quantlet computes Hausman's test statistics for the hypothesis that the
first explanatory variables are correlated with the individual
effect
(e.g. Hsiao; 1986). The test is based
on the
difference of the between- and the within-group estimator. This version of
the Hausman test is numerically identical to the usual version which is
based on a comparison of the within-group and the GLS estimator. The
advantage of the former version of
the Hausman test is that the differences between the coefficient estimates
can be seen as an estimate for the parameter vector
in the
``Mundlak model'':
If it is assumed that all variables are correlated with the
individual effect the ``within-group estimator'' can be applied.
This estimator requires that all variables are time varying at least for
some cross-section units. The respective command is
{output, siga, sige} = panfix(z, m {,T})where the parameters are the same as before. If
If it is assumed that at least one variable is uncorrelated with the
individual effect and that the error
is
homoscedastic and serially uncorrelated, a more efficient estimate can
be obtained using the
panrand
quantlet. This estimator employs the
error variances
(siga) and
(sige) from the
panfix
quantlet and computes the Generalized
Least Squares (GLS) (for
) or Generalized Instrumental Variable (GIV)
(for
) estimates:
{outfix, siga, sige} = panfix(z, m {,T}) panrand(z ,siga, sige, m {,T})
The
panfix
quantlet stores the estimates of the variances in the
variables siga and sige. These variance estimates are
provided as an input for the
panrand
quantlet. This estimation
procedure is similar to the one suggested by Hausman and Taylor (1981).
However, the only difference between the original estimator of Hausman and
Taylor and the one used in
XploRe
is that the former uses
and
as separate instruments, whereas the
panrand
quantlet uses
as a joint
instrument. The reason for using the latter estimator is that it is
easier to compute and usually no efficiency gain results from splitting
into
and
.
Assume that we want to estimate an earnings function of the form
The following code can be found in the example
quantlet
XLGmetric07.xpl
.
First we run the library and load the data set:
library("metrics") z = read("earnings")To estimate a fixed effects specification we only include the variables with a substantial variation in time. Here only the experience variables
z1=z[,1:2]~log(z[,3])~z[,4]~(z[,4]^2)
To estimate the fixed effects model the
panfix
quantlet
is used:
{output,siga,sige} = panfix(z1,2) output
Since we have set , all regressors are allowed to be correlated
with the individual effect. The output is
[ 1,] ===================================================== [ 2,] Fixed-Effect Model: y(i,t)=x(i,t)'beta+ a(i) + e(i,t) [ 3,] ===================================================== [ 4,] PARAMETERS Estimate robust SE t-value [ 5,] ===================================================== [ 6,] beta[ 1 ]= 0.085408 0.01362 6.271 [ 7,] beta[ 2 ]= -0.00038595 0.000245 -1.575 [ 8,] CONSTANT = 6.4021 0.1929 33.185 [ 9,] ===================================================== [10,] Var. of a(i): 0.48222 e(i,t): 0.015594 [11,] AR(1)-test p-val: 0.9967 Autocorr.: 0.1628 [12,] F(no eff.) p-val: 0.0000 R-square: 0.9050 [13,] LM(siga=0) p-val: 0.0000 Log-Like: -2080.423 [14,] =====================================================Comparing the estimate of
A problem with this estimate is that we have no estimates of the coefficients attached to g and s. Such estimates are available using the more restrictive framework of a random effects model. This specification is estimated by using the commands
z1 = z1~z[,5:6] panrand(z1,siga,sige,0)Setting
[ 1,] ===================================================== [ 2,] Random-Effect Model: y(i,t)=x(i,t)'beta+ a(i) +e(i,t) [ 3,] ===================================================== [ 4,] PARAMETERS Estimate SE t-value [ 5,] ===================================================== [ 6,] beta[ 1 ]= 0.070265 0.01067 6.586 [ 7,] beta[ 2 ]= -0.00035168 0.0002077 -1.693 [ 8,] beta[ 3 ]= 0.18934 0.04433 4.271 [ 9,] beta[ 4 ]= -0.32376 0.1471 -2.201 [10,] constant = 5.298 0.5076 10.436 [11,] ===================================================== [12,] R-square: 0.9979 , N = 100 , N*T = 500 [13,] =====================================================
A problem with this specification is, however, that the regressors may be correlated with the individual effect. To test this hypothesis, the Hausman test is performed:
panhaus(z1,siga,sige,2)The resulting output table is
[ 1,] ===================================================== [ 2,] Hausman Specification Test for the null hypothesis: [ 3,] 2 Variables x(i,t) are uncorrelated with a(i) [ 4,] ===================================================== [ 5,] d = beta(between)-beta(within) SE t-value [ 6,] ===================================================== [ 7,] d [ 1 ] = -0.0630 0.0398 -1.585 [ 8,] d [ 2 ] = 0.0000 0.0008 0.047 [ 9,] ===================================================== [10,] P-Value (RANDOM against FIXED effects): 0.0000 [11,] =====================================================
The test yields some indication for a correlation of the individual
effects with the first variable (x) but only a weak evidence
for the second variable (x). The joint test clearly
rejects the random effects specification.
For the remaining variables which are assumed to be constant in time, no test statistic is available. Accordingly, we may specify a mixed specification with the first variable (or the first two variables are allowed to be correlated with the individual effect and all other variables are assumed to be uncorrelated with the individual effect). This specification is estimated in two steps.
First we ignore the random effects structure and perform a simple instrumental variable estimation of the model:
{output,siga,sige} = panfix(z1,2) output
[ 1,] ===================================================== [ 2,] Mixed Specification: y=x(i,t)'b1+z(i,t)'b2+a(i)+e(i,t) [ 3,] ===================================================== [ 4,] PARAMETERS Estimate robust SE t-value [ 5,] ===================================================== [ 6,] beta[ 1 ]= 0.083721 0.01378 6.077 [ 7,] beta[ 2 ]= -0.00036598 0.0002477 -1.478 [ 8,] beta[ 3 ]= 0.19717 0.0449 4.391 [ 9,] beta[ 4 ]= -0.32609 0.1347 -2.421 [10,] CONSTANT = 4.9066 0.5434 9.030 [11,] ===================================================== [12,] Var. of a(i): 0.38122 e(i,t): 0.015434 [13,] AR(1)-test p-val: 0.9939 Autocorr.: 0.1535 [14,] LM(siga=0) p-val: 0.0000 [15,] R2: 0.5274 [16,] ===================================================== [17,] The first 2 Variables x(i,t) are assumed to be [18,] correlated with the individual effects [19,] ---> IV estimate to compute siga and sige [20,] =====================================================
Since we ignore the covariance matrix of the errors at this stage, the estimation may be inefficient. However, the standard errors of the estimate are estimated robustly and thus are valid under very general assumptions. To improve the efficiency of the estimator, a GIV estimator is applied at the second step:
panrand(z1,siga,sige,2)
[ 1,] ===================================================== [ 2,] GLS-IV estimation of the Mixed specification [ 3,] ===================================================== [ 4,] y = x(i,t)'b1 + z(i,t)'b2 + a(i) + e(i,t) [ 5,] ===================================================== [ 6,] PARAMETERS Estimate SE t-value [ 7,] ===================================================== [ 8,] beta[ 1 ]= 0.083717 0.01107 7.565 [ 9,] beta[ 2 ]= -0.00036593 0.0002128 -1.719 [10,] beta[ 3 ]= 0.19769 0.04057 4.872 [11,] beta[ 4 ]= -0.32608 0.1309 -2.492 [12,] constant = 4.9015 0.4428 11.069 [13,] ===================================================== [14,] R-square: 0.9970 , N = 100 , N*T = 500 [15,] ===================================================== [16,] The first 2 Variables x(i,t) are assumed to be [17,] correlated with the individual effects [18,] =====================================================The