12.4 Panel Data Analysis


z = 29415 pansort (z0 {,N})
arranges the data set appropriately
z = 29418 panstats (z {,T})
computes summary statistics for the variables
{output, siga, sige} = 29421 panfix (z, m {,T})
estimates a fixed effects (or mixed) model
output = 29424 panrand (z, siga, sige, m {,T})
estimates a random effects (or mixed) model
z = 29427 pantime (z0 {,T})
removes fixed time effects
output = 29430 panhaus (z, siga, sige, m {,T})
performs a Hausman specification test

In a panel data set, a given sample of $ N$ individuals is observed at different time periods and thus provides multiple observations on each individual in the sample. As a simple example, assume that the relationship between the earnings and the experience of an employee is given by the linear model

$\displaystyle y = \alpha + x \beta + u,
$

where $ y$ denotes log of total earnings and $ x$ is a measure of experience. Furthermore, assume that there are data for $ i=1,\ldots,N$ individuals at $ t=1,\ldots,T_i$ time periods. If $ T_i=T$ is constant for all cross-section units, then the data set is called a balanced panel. Otherwise the panel is unbalanced.

We expect of course that a skilled worker earns more than an unskilled worker. However, if there is no information on the educational background of the employee, we might represent the impact of unobserved education by introducing an individual specific constant $ \alpha_i$. Therefore, the model is written as

$\displaystyle y_{it}=\alpha_i + x_{it}\beta + u_{it},
$

where $ \alpha_i$ depends on the skills (or ability) of the employee. One may also expect that the parameter $ \beta$ varies across individuals but this would amount to a separate analysis of the $ N$ cross section units. In typical panel data models, individual heterogeneity is represented by an individual specific intercept alone.

We may also include a measure of heterogeneity with respect to time. For example there may be a common time trend of an unknown form. Hence the model is augmented by a time specific constant $ \lambda_t$:

$\displaystyle y_{it}=\alpha_i + \lambda_t + x_{it}\beta + u_{it} \ .
$

Comprehensive overviews of this type of models are given by Hsiao (1986) and Baltagi (1995).

The effects $ \alpha_i$ and $ \lambda_t$ are assumed to be either deterministic (fixed effects model) or random (random effects model). The crucial assumption associated with these different models is that the random effects model assumes that

$\displaystyle E(x_{it}\alpha_i) = 0 \ \hbox{ and } \ E(x_{it}\lambda_t) = 0
\ \hbox{ for all } i,t \ ,
$

whereas in a fixed effects model, the regressors may be correlated with the individual and time effects. For the error $ u_{it}$ it is usually assumed that $ E(u_{it})=0$, $ E(u_{it}^2)=\sigma^2$ and that $ u_{it}$ is uncorrelated across $ i$ and $ t$. However, the 29433 panfix quantlet does not require such restrictive assumptions.

We may also mix the assumptions of a random and fixed effects model. For example, we may assume that $ \alpha_i$ is random and $ \lambda_t$ is fixed. Moreover, it may be assumed that some regressors are correlated with the random effects, while others are uncorrelated. Such a specification is in particular attractive if the set of regressors contains some variables that are constant in time. We therefore write the mixed model as

$\displaystyle y_{it} = x_{it}^T \beta + z_{it}^T \gamma + \alpha_i + \lambda_t + u_{it}$ (12.24)

where We will neglect random time effects because it is practically of minor importance and we prefer a dynamic model to represent random effects in time.

A typical (static) panel data analysis usually proceeds as follows:

  1. Model specification. First, an $ F$-test or an LM-test is used to test the hypothesis that individual effects are significant. Of course, if these tests do not provide any evidence for individual effects, then the model can simply be estimated by ordinary least squares (OLS). Second, the Hausman test is used to decide whether the regressors are correlated with the individual effect. In XploRe these tests can be performed by using the 29440 panfix and 29443 panhaus quantlets.
  2. Variance decomposition. If at least one variable is assumed to be uncorrelated with the random effects, the generalized least squares (GLS) estimator requires an estimate of the variance of $ \alpha_i$ and $ u_{it}$. These estimates are obtained in XploRe by using the 29450 panfix quantlet. If all regressors are assumed to be correlated with the individual effects, 29453 panfix gives the final estimates.
  3. GLS and GIV estimation. Using the estimates of the error variances, Generalized Least Squares (GLS) or Generalized Instrumental Variable (GIV) estimates are computed by the 29456 panrand quantlet.




12.4.1 The Data Set

The data set is assumed to be ordered by the cross-section units. That is, the complete data of the first individual are given in the first $ T_1$ rows, then the data of the second individual in the rows $ T_1+1,\ldots,T_1+T_2$ and so on. If the data are an unbalanced data set, the first two columns must provide the identification number of the cross-section unit and the time period. If the data are in the form of a balanced panel, it is sufficient to provide the common number of time periods $ T$ to assign the data to the cross-section units. Most procedures do not use the time index in the second column. Exceptions are the 29662 pantime and the 29665 pandyn quantlets. Thus, the user may insert some arbitrary column (for example, a column generated by the XploRe command matrix(rows(z),1)) in all other cases.

The data matrix must be organized in the following form:

{1} {2} 3 4 $ \cdots$ 3+$ m$ $ m$+4 $ \cdots$ 3+$ m+k$
1 1 $ y_{11}$ $ x_{11,1}$ $ \cdots$ $ x_{11,m}$ $ z_{11,1}$ $ \cdots$ $ z_{11,k}$
$ \vdots$               $ \vdots$
1 $ T_1$ $ y_{1 T_1}$ $ x_{1 T_1,1}$ $ \cdots$ $ x_{1 T_1,m}$ $ z_{1
T_1,1}$ $ \cdots$ $ z_{1 T_1,k}$
2 $ 1$ $ y_{21}$ $ x_{21,1}$ $ \cdots$ $ x_{21,m}$ $ z_{21,1}$ $ \cdots$ $ z_{21,k}$
$ \vdots$               $ \vdots$

An example for such a data set is the file earnings.dat (see Data Sets (B.8)). The first 10 rows look like:

  [   1,]        1        1     2600       20        9        2 
  [   2,]        1        2     2500       21        9        2 
  [   3,]        1        3     2700       22        9        2 
  [   4,]        1        4     3400       23        9        2 
  [   5,]        1        5     3354       24        9        2 
  [   6,]        2        1     2850       19       13        1 
  [   7,]        2        2     4000       20       13        1 
  [   8,]        2        3     4000       21       13        1 
  [   9,]        2        4     4200       22       13        1 
  [  10,]        2        5     5720       23       13        1

The first column is the individual index, the second column is the time index and the third column gives the monthly earnings. The fourth column shows the experience in years and the fifth column presents the years of schooling. The elements in the final column are 2 for a female and 1 for a male participant. All cross sectional units are observed at $ T_i=T$ periods and, therefore, the first two columns can be skipped. In this case we specify the joint number of time periods $ T=5$ when calling the quantlets.

Summing up, the columns of the matrix are defined as follows:

first column:
Identification number for cross section units This column is skipped if a balanced panel is specified by the variable $ T$. The quantlets 29674 pantime and 29677 pandyn do not use this column for an unbalanced data set and, thus, the entries of this column may be set to arbitrary values.
second column:
Time index This column is skipped if a balanced panel is specified by the variable $ T$. Only the quantlets 29680 pantime and 29683 pandyn make use of the time index so that in these cases the time index may be set to arbitrary values for all other quantlets involving panel data.
third column:
Dependent variable
next $ m$ columns:
First set of explanatory variables If a mixed specification is used (as in 29686 panfix , 29689 panrand , 29692 panhaus ) these columns give the values for those variables, which are assumed to be (i) time varying and (ii) uncorrelated with the individual effects.
next $ k$ columns:
Second set of explanatory variables For a mixed specification the second set of explanatory variables are assumed to be uncorrelated with the individual effects. If no distinction between the explanatory variables is made, all explanatory variables are listed in the first set.

If it is the case that a balanced data set is (inappropriately) ordered by time so that all observations for period 1 are given first, then the observations of period 2 and so on, then the quantlet 29695 pansort can be used to rearrange the data set:

  z = pansort(z0 {,N})
where $ N$ indicates the number of cross section units in the original data set z0. If the parameter $ N$ is not given, the 29698 pansort quantlet will sort the unbalanced data set with respect to the cross sectional units and the time periods.

The 29701 panstats quantlet provides some summary statistics for the variables in the data set. It is highly recommended to first compute the summary statistics to find out possible problems with the data. The column Within Var.% gives the fraction of variance due to the within-group deviations. If this fraction is zero, the respective variable is constant over time. This is an important information for the estimation of fixed effects models. For the earnings data we type

   panstats(z)
The output is
  [ 1,]  
  [ 2,]  N*T:    500,  N:     100,  Min T(i):   5,  Max T(i):   5
  [ 3,] ---------------------------------------------------------
  [ 4,]   Minimum     Maximum      Mean    Within Var.% Std.Error
  [ 5,] ---------------------------------------------------------
  [ 6,]      1350     1.3e+04        3938       17.87        1513
  [ 7,]         5          45       24.25        2.27       9.351
  [ 8,]         9          13        9.94      0.2764       1.319
  [ 9,]         1           2        1.34           0      0.4742
  [10,]
This table indicates that the second variable (g) is constant over time (Within Var.% is zero) and there are only a few individuals with a changing schooling variable (Within Var.% is only 0.03 percent of the total variance). Accordingly, the schooling variable can be treated as (roughly) constant as well.


12.4.2 Time Effects

When specifying a panel data model, we first have to decide whether fixed time effects are to be included in the regression. Since in most panel data sets the number of time periods is small compared to the number of cross-section units, including time effects does not imply a severe loss in efficiency even if there are no time effects at all. So it is recommended to include time effects regularly. After the final model is specified, we may also test for a presence of time effects.

If the data set is a balanced panel, no information about the individuals and time periods is needed. The program simply assumes that the observations are in a consecutive order. Therefore, only the common number of time periods has to be indicated. To remove the time effects the following command can be used:

   z1 = pantime(z {,T})
This will produce a new data set z1 that results from subtracting the time mean and adding the overall mean of the variables. If the input parameter $ T$ is given, the data is assumed to be a balanced panel with a common number of time periods, and the columns containing the individual number and time periods are skipped.

Applying such a data transformation and estimating with individual effects will give the same coefficient estimates as if the models are estimated with time dummies. However, the variance estimates differ due to the different degrees of freedom correction. Whenever the (average) number of time series is much smaller than the number of individuals $ N$, the difference is negligible. More importantly, the $ R^2$ measure is lower than in the case of an explicit inclusion of time effects because the variance of the dependent variable becomes smaller after subtracting the time mean. If these differences matter, time dummies should be included in the variable list instead of using the 29870 pantime quantlet.

A test for the existence of time effects can be computed as follows. First the model is estimated with fixed individual effects by using the original data set z. Then, the 29873 pantime quantlet is applied to remove the time effects. The resulting data set z1 is again estimated using the 29876 panfix quantlet. The likelihood-ratio statistic is obtained as $ \textrm{LR}_\lambda=2(\ell_1-\ell_0)$, where $ \ell_0$ is the log-likelihood value for the model without time effects and $ \ell_1$ is the log-likelihood value of the model after removing the time effects. The LR statistic has an asymptotic $ \chi^2$ distribution with $ \max(T_i)-1$ degrees of freedom.


12.4.3 Model Specification

To test for the existence of individual effects, two different test statistics can be used. The $ F$-statistic and the LM-statistic are computed by the quantlet 29938 panfix . With respect to the power of the test it is important to indicate a possible correlation of the regressors with the individual effect. If it is assumed that all regressors are correlated with the individual effects (the fixed effects model), then the $ F$-test is optimal. In the case of the random effects model, the LM-statistic is preferable. In a mixed specification the LM-statistic is computed by using the residuals from the instrumental variable (IV) estimation instead of the ordinary least squares residuals.

The quantlet 29941 panfix is called as

   {output, siga, sige} = panfix(z, m {,T})
The string output yields the output table of an estimation assuming the first given $ m$ explanatory variables as time varying and correlated with the individual effects. The remaining $ k$ variables are assumed to be uncorrelated with the individual effect. The common time period $ T$ is included in the list of input parameters if the data are a balanced panel. An example for the use of this quantlet is given in the next section.

Besides the test statistic(s) for the existence of individual effects, the 29944 panfix quantlet computes estimates for the error variances $ \sigma_\alpha^2$ (siga) and $ \sigma_\varepsilon^2$ (sige) which are used for the Hausman test called by

   output = panhaus(z, siga, sige, m {,T})

This quantlet computes Hausman's test statistics for the hypothesis that the first $ m$ explanatory variables are correlated with the individual effect $ \alpha_i$ (e.g. Hsiao; 1986). The test is based on the difference of the between- and the within-group estimator. This version of the Hausman test is numerically identical to the usual version which is based on a comparison of the within-group and the GLS estimator. The advantage of the former version of the Hausman test is that the differences between the coefficient estimates can be seen as an estimate for the parameter vector $ \delta$ in the ``Mundlak model'':

$\displaystyle \alpha_i = \bar x_i^T \delta + \eta_i + \varepsilon_{it} \ ,
$

where $ \bar x_i$ denotes the time mean for the individual $ i$. If the $ j$th element of the vector $ \delta$ is zero, then there is no correlation between the (mean of the) $ j$th variable in $ x_{it}$. Therefore, the $ t$-statistics for the elements of $ \delta$ can be seen as tests of the correlation with the respective explanatory variables. Accordingly, the test procedure can be used to select mixed specifications.


12.4.4 Estimation

If it is assumed that all variables are correlated with the individual effect $ \alpha_i$ the ``within-group estimator'' can be applied. This estimator requires that all variables are time varying at least for some cross-section units. The respective command is

    {output, siga, sige} = panfix(z, m {,T})
where the parameters are the same as before. If $ m=0$, then the OLS estimator is computed and for $ m>0$ an Instrumental Variable (IV) estimator is applied. The standard errors of this estimator are estimated in a robust fashion, that is, the standard errors are valid for quite general forms of autocorrelation and heteroscedasticity (Arrelano; 1987). The usual standard deviations and $ t$-statistics for the within-group estimator can be obtained from the quantlet 30036 panrand by setting $ m$ equal to the number of all explanatory variables.

If it is assumed that at least one variable is uncorrelated with the individual effect $ \alpha_i$ and that the error $ u_{it}$ is homoscedastic and serially uncorrelated, a more efficient estimate can be obtained using the 30039 panrand quantlet. This estimator employs the error variances $ \sigma_\alpha^2$ (siga) and $ \sigma_\varepsilon^2$ (sige) from the 30042 panfix quantlet and computes the Generalized Least Squares (GLS) (for $ m=0$) or Generalized Instrumental Variable (GIV) (for $ m>0$) estimates:

   {outfix, siga, sige} = panfix(z, m {,T}) 
   panrand(z ,siga, sige, m {,T})

The 30045 panfix quantlet stores the estimates of the variances in the variables siga and sige. These variance estimates are provided as an input for the 30048 panrand quantlet. This estimation procedure is similar to the one suggested by Hausman and Taylor (1981). However, the only difference between the original estimator of Hausman and Taylor and the one used in XploRe is that the former uses $ \bar z_i$ and $ z_{it}-\bar z_i$ as separate instruments, whereas the 30055 panrand quantlet uses $ z_{it}=(z_{it}-\bar z_i)+\bar z_i$ as a joint instrument. The reason for using the latter estimator is that it is easier to compute and usually no efficiency gain results from splitting $ z_{it}$ into $ \bar z_i$ and $ z_{it}-\bar z_i$.


12.4.5 An Example

Assume that we want to estimate an earnings function of the form

$\displaystyle y_{it} = b_0 + b_1 x_{it} + b_2 x_{it}^2 + b_3 g_i + b_4 s_i +
\alpha_i + \varepsilon_{it}
$

where
$ y_{it}$ is the log of earnings,
$ x_{it}$ is an experience variable,
$ g_i$ is a dummy variable coding the gender and
$ s_i$ is a measure of schooling attainment.
See Berndt (1991) for an introduction to the estimation of an earnings function.

The following code can be found in the example quantlet 30161 XLGmetric07.xpl . First we run the library and load the data set:

  library("metrics")
  z = read("earnings")
To estimate a fixed effects specification we only include the variables with a substantial variation in time. Here only the experience variables $ x$ and $ x^2$ are included. The new data set is constructed as follows:
  z1=z[,1:2]~log(z[,3])~z[,4]~(z[,4]^2)

To estimate the fixed effects model the 30164 panfix quantlet is used:

  {output,siga,sige} = panfix(z1,2)  
  output

Since we have set $ m=2$, all regressors are allowed to be correlated with the individual effect. The output is

  [ 1,] =====================================================
  [ 2,] Fixed-Effect Model: y(i,t)=x(i,t)'beta+ a(i) + e(i,t)
  [ 3,] =====================================================
  [ 4,] PARAMETERS        Estimate     robust SE      t-value
  [ 5,] =====================================================
  [ 6,] beta[ 1 ]=        0.085408       0.01362        6.271
  [ 7,] beta[ 2 ]=     -0.00038595      0.000245       -1.575
  [ 8,] CONSTANT =          6.4021        0.1929       33.185
  [ 9,] =====================================================
  [10,] Var. of a(i):      0.48222       e(i,t):     0.015594
  [11,] AR(1)-test   p-val: 0.9967       Autocorr.:    0.1628
  [12,] F(no eff.)   p-val: 0.0000       R-square:     0.9050
  [13,] LM(siga=0)   p-val: 0.0000       Log-Like:  -2080.423
  [14,] =====================================================
Comparing the estimate of $ \sigma_\alpha^2$ with the estimate of $ \sigma_\varepsilon^2$ it turns out that the individual effect clearly dominates the remaining error.

A problem with this estimate is that we have no estimates of the coefficients attached to g and s. Such estimates are available using the more restrictive framework of a random effects model. This specification is estimated by using the commands

  z1 = z1~z[,5:6]
  panrand(z1,siga,sige,0)
Setting $ m=0$ implies that no variable is allowed to be correlated with the individual effect. The resulting output table is as follows:
  [ 1,] =====================================================
  [ 2,] Random-Effect Model: y(i,t)=x(i,t)'beta+ a(i) +e(i,t)
  [ 3,] =====================================================
  [ 4,] PARAMETERS        Estimate        SE          t-value
  [ 5,] =====================================================
  [ 6,] beta[ 1 ]=        0.070265       0.01067        6.586
  [ 7,] beta[ 2 ]=     -0.00035168     0.0002077       -1.693
  [ 8,] beta[ 3 ]=         0.18934       0.04433        4.271
  [ 9,] beta[ 4 ]=        -0.32376        0.1471       -2.201
  [10,] constant =           5.298        0.5076       10.436
  [11,] =====================================================
  [12,] R-square:  0.9979 ,    N =    100 ,   N*T =      500 
  [13,] =====================================================

A problem with this specification is, however, that the regressors may be correlated with the individual effect. To test this hypothesis, the Hausman test is performed:

  panhaus(z1,siga,sige,2)
The resulting output table is
  [ 1,] =====================================================
  [ 2,] Hausman Specification Test for the null hypothesis:  
  [ 3,]  2 Variables x(i,t) are uncorrelated with a(i)      
  [ 4,] =====================================================
  [ 5,]   d = beta(between)-beta(within)    SE        t-value
  [ 6,] =====================================================
  [ 7,]   d [ 1 ] =        -0.0630       0.0398        -1.585
  [ 8,]   d [ 2 ] =         0.0000       0.0008         0.047
  [ 9,] =====================================================
  [10,] P-Value (RANDOM against FIXED effects):        0.0000
  [11,] =====================================================

The test yields some indication for a correlation of the individual effects with the first variable (x) but only a weak evidence for the second variable (x$ ^2$). The joint test clearly rejects the random effects specification.

For the remaining variables which are assumed to be constant in time, no test statistic is available. Accordingly, we may specify a mixed specification with the first variable (or the first two variables are allowed to be correlated with the individual effect and all other variables are assumed to be uncorrelated with the individual effect). This specification is estimated in two steps.

First we ignore the random effects structure and perform a simple instrumental variable estimation of the model:

  {output,siga,sige} = panfix(z1,2)
  output
  [ 1,] =====================================================
  [ 2,] Mixed Specification: y=x(i,t)'b1+z(i,t)'b2+a(i)+e(i,t)
  [ 3,] =====================================================
  [ 4,] PARAMETERS        Estimate     robust SE      t-value
  [ 5,] =====================================================
  [ 6,] beta[ 1 ]=        0.083721       0.01378        6.077
  [ 7,] beta[ 2 ]=     -0.00036598     0.0002477       -1.478
  [ 8,] beta[ 3 ]=         0.19717        0.0449        4.391
  [ 9,] beta[ 4 ]=        -0.32609        0.1347       -2.421
  [10,] CONSTANT =          4.9066        0.5434        9.030
  [11,] =====================================================
  [12,] Var. of a(i):      0.38122       e(i,t):     0.015434
  [13,] AR(1)-test   p-val: 0.9939       Autocorr.:    0.1535
  [14,] LM(siga=0)   p-val: 0.0000                           
  [15,] R2:                 0.5274                           
  [16,] =====================================================
  [17,] The first  2 Variables x(i,t) are assumed to be      
  [18,] correlated with the individual effects               
  [19,] ---> IV estimate to compute siga and sige            
  [20,] =====================================================

Since we ignore the covariance matrix of the errors at this stage, the estimation may be inefficient. However, the standard errors of the estimate are estimated robustly and thus are valid under very general assumptions. To improve the efficiency of the estimator, a GIV estimator is applied at the second step:

  panrand(z1,siga,sige,2)
  [ 1,] =====================================================
  [ 2,] GLS-IV estimation of the Mixed specification         
  [ 3,] =====================================================
  [ 4,]      y = x(i,t)'b1 + z(i,t)'b2 + a(i) + e(i,t)       
  [ 5,] =====================================================
  [ 6,] PARAMETERS        Estimate        SE          t-value
  [ 7,] =====================================================
  [ 8,] beta[ 1 ]=        0.083717       0.01107        7.565
  [ 9,] beta[ 2 ]=     -0.00036593     0.0002128       -1.719
  [10,] beta[ 3 ]=         0.19769       0.04057        4.872
  [11,] beta[ 4 ]=        -0.32608        0.1309       -2.492
  [12,] constant =          4.9015        0.4428       11.069
  [13,] =====================================================
  [14,] R-square:  0.9970 ,    N =    100 ,   N*T =      500 
  [15,] =====================================================
  [16,] The first  2 Variables x(i,t) are assumed to be      
  [17,] correlated with the individual effects               
  [18,] =====================================================
The $ t$-statistics of the GIV estimation is substantially larger than the (robust) $ t$-statistics from the first stage estimation. This reflects the improved efficiency of the estimation.