next up previous contents index
Next: References Up: 2. Econometrics Previous: 2.3 Stochastic Volatility and

Subsections



2.4 Finite Mixture Models

Many econometric isuues require models that are richer or more flexible than the conventional regression type models. Several possibilities exist. For example, as explained in Sect. 2.2.3, the logit model is made more realistic by generalizing it to a mixed logit. Many models currently used in econometrics can be generalized in such a way.

In this section, we assume that the univariate or multivariate observations $ \boldsymbol{y}_j$ are considered as draws of

$\displaystyle \widetilde{f}(y_j)= \sum_{g=1}^{G} \eta_g\, f(\boldsymbol{y}_j\vert\boldsymbol{\theta}_g)$ (2.39)

with $ \eta_1 + \ldots + \eta_G=1$. The densities $ f(\cdot\vert\boldsymbol{\theta}_g)$ are called component distributions. The observation $ \boldsymbol{y}_j$ comes from one of these component distributions but we do not observe to which component it belongs. The mixture problem involves making inference about the $ \eta_g$'s and the parameters of the component distributions given only a sample from the mixture. The closer the component distributions are to each other, the more difficult this is because of problems of identifiability and computational instability.


2.4.1 Inference and Identification

The structure of (2.39) implies that the likelihood for all the $ J$ observations contains $ G^J$ terms

$\displaystyle L(\boldsymbol{\eta},\boldsymbol{\theta}\vert y) \propto \prod_{j=...
...^{G} \eta_g f\left(\boldsymbol{y}_j\vert\boldsymbol{\theta}_g\right) \right)\;,$ (2.40)

where $ \boldsymbol{\eta}=(\eta_1,\ldots,\eta_G)^T$ and $ \boldsymbol{\theta}=(\boldsymbol{\theta}_1,\ldots,\boldsymbol{\theta}_G)^T$ contain all the parameters and $ \boldsymbol {y}$ denotes all the data. Maximum likelihood estimation using numerical optimization techniques, requiring many evaluations of the likelihood function, becomes cumbersome, if not unfeasible, for large $ G$ and $ J$. This is even worse for multivariate observations.

Bayesian inference on finite mixture distributions by MCMC sampling is explained in [19]. Gibbs sampling on $ (\boldsymbol{\eta},\boldsymbol{\theta})$ is difficult since the posterior distributions of $ \boldsymbol{\eta}\vert\boldsymbol{\theta}, \boldsymbol{y}$ and $ \boldsymbol{\theta}\vert\boldsymbol{\eta}, \boldsymbol{y}$ are generally unknown. For the same reason as for the probit model in Sect. 2.2.1 and the stochastic volatility model in Sect. 2.3, inference on the finite mixture model is straightforward once the state or group of an observation is known. Data augmentation is therefore an appropriate way to render inference easier. Define the state indicator $ S_j$ which takes value $ s_j=g$ when $ \boldsymbol{y}_j$ belongs to state or group $ g$ where $ g \in \{1,\ldots,G\}$. Denote by $ \boldsymbol{S}$ the $ J$-dimensional discrete vector containing all the state indicators. To facilitate the inference, prior independence, that is $ \varphi(\boldsymbol{\eta}, \boldsymbol{\theta}, \boldsymbol{S}) =
\varphi(\boldsymbol{\eta})\varphi(\boldsymbol{\theta})\varphi(\boldsymbol{S})$, is usually imposed. As shown in the next examples, the posterior distributions $ \boldsymbol{S}\vert\boldsymbol{\eta},\boldsymbol{\theta}, \boldsymbol{y}$, $ \boldsymbol{\theta}\vert\boldsymbol{\eta}, \boldsymbol{S}, \boldsymbol{y}$ and $ \boldsymbol{\eta}\vert\boldsymbol{\theta}, \boldsymbol{S}, \boldsymbol{y}$ are either known distributions easy to sample from or they are distributions for which a second, but simpler, MCMC sampler is set up. A Gibbs sampler with three main blocks may therefore be used.

The complete data likelihood of the finite mixture is invariant to a relabeling of the states. This means that we can take the labeling $ \{1,2,\ldots,G \}$ and do a permutation $ \{\rho(1),
\rho(2),\ldots,\rho(G)\}$ without changing the value of the likelihood function. If the prior is also invariant to relabeling then the posterior has this property also. As a result, the posterior has potentially $ G!$ different modes. To solve this identification or label switching problem, identification restrictions have to be imposed.

Note that the inference described here is conditional on $ G$, the number of components. There are two modelling approaches to take care of $ G$. First, one can treat $ G$ as an extra parameter in the model as is done in [44] who make use of the reversible jump MCMC methods. In this way, the prior information on the number of components can be taken explicitly into account by specifying for example a Poisson distribution on $ G$ in such a way that it favors a small number of components. A second approach is to treat the choice of $ G$ as a problem of model selection. By so-doing one separates the issue of the choice of $ G$ from estimation with $ G$ fixed. For example, one can take $ G=2$ and $ G=3$ and do the estimation separately for the two models. Then Bayesian model comparison techniques (see Chap. III.11) can be applied, for instance by the calculation of the Bayes factor, see [14] and [9] for more details.


2.4.2 Examples

We review two examples. The first example fits US quarterly GNP data using a Markov switching autoregressive model. The second example is about the clustering of many GARCH models.

2.4.2.1 Markov Switching Autoregressive Model

[23] uses US quarterly real GNP growth data from 1951:2 to 1984:4. This series was initially used by [30] and is displayed in Fig. 2.2. The argument is that contracting and expanding periods are generated by the same model but with different parameters. These models are called state- (or regime-) switching models.

Figure 2.2: US real GNP growth data in percentages (1951:2 to 1984:4)
\includegraphics[width=8.8cm]{text/4-2/Hamilton.eps}

After some investigation using Bayesian model selection techniques, the adequate specification for the US growth data is found to be the two-state switching AR(2) model

$\displaystyle y_t= \beta_{i,1} y_{t-1} + \beta_{i,2} y_{t-2} + \beta_{i,3} + \e...
...{t,i} \qquad \epsilon_{t,i} \sim N\left(0,\sigma_{i}^{2}\right) \qquad i=1,2\;.$ (2.41)

The idea behind the choice of two states is motivated by the contracting (negative growth) and expanding periods (positive growth) in an economy. The conditional posteriors for the $ \sigma_{i}^{2}$'s are independent inverted gamma distributions. For the $ \beta_{i}$'s, the conditional posteriors are independent normal distributions. Inference for the switching model in (2.41) is done in two steps. The first step is to construct an MCMC sample by running the random permutation sampler. Generally speaking, a draw from the random permutation sampler is obtained as follows:
(1)
Draw from the model by use of the Gibbs sampler for example.
(2)
Relabel the states randomly.
By so-doing, one samples from the unconstrained parameter space with balanced label switching. Note that in (2), there are $ G!$ possibilities to relabel when there are $ G$ possible states.

In the second step, this sample is used to identify the model. This is done by visual inspection of the posterior marginal and bivariate densities. Identification restrictions need to be imposed to avoid multimodality of the posterior densities. Once suitable restrictions are found, a final MCMC sample is constructed to obtain the moments of the constrained posterior density. The latter sample is constructed by permutation sampling under the restrictions, which means that (2) is replaced by one permutation defining the constrained parameter space.

In the GNP growth data example, two identification restrictions seem possible, namely $ \beta_{1,1} < \beta_{2,1}$ and $ \beta_{1,3} <
\beta_{2,3}$, see [23] for details. Table 2.9 provides the posterior means and standard deviations of the $ \beta_{i,j}$'s for both identification restrictions.


Table 2.9: Posterior means and standard deviations
  $ \beta_{1,1} < \beta_{2,1}$ $ \beta_{1,3} <
\beta_{2,3}$
  Contraction Expansion Contraction Expansion
$ \beta_{i,1}$ $ \hphantom{-}0.166$ ($ 0.125$) $ \hphantom{-}0.33$  ($ 0.101$) $ \hphantom{-}0.249$ ($ 0.164$) $ \hphantom{-}0.295$ ($ 0.116$)
$ \beta_{i,2}$ $ \hphantom{-}0.469$ ($ 0.156$) $ -0.129$ ($ 0.093$) $ \hphantom{-}0.462$ ($ 0.164$) $ -0.114$ ($ 0.098$)
$ \beta_{i,3}$ $ -0.479$ ($ 0.299$) $ \hphantom{-}1.07$  ($ 0.163$) $ -0.557$ ($ 0.322$) $ \hphantom{-}1.06$  ($ 0.175$)

The GNP growth in contraction and expansion periods not only have different unconditional means, they are also driven by different dynamics. Both identification restrictions result in similar posterior moments.

2.4.2.2 Clustering of Many GARCH Models

[4] focus on the differentiation between the component distributions via different conditional heteroskedasticity structures by the use of GARCH models. In this framework, the observation $ y_j$ is multivariate and the $ \boldsymbol{\theta}_g$'s are the parameters of GARCH(1,1) models. The purpose is to estimate many, of the order of several hundreds, GARCH models. Each financial time series belongs to one of the $ G$ groups but it is not known a priori which series belongs to which cluster.

An additional identification problem arises due to the possibility of empty groups. If a group is empty then the posterior of $ \boldsymbol{\theta}_g$ is equal to the prior of  $ \boldsymbol{\theta}_g$. Therefore an improper prior is not allowed for $ \boldsymbol{\theta}_g$. The identification problems are solved by using an informative prior on each $ \boldsymbol{\theta}_g$. The identification restrictions use the fact that we work with GARCH models: we select rather non-overlapping supports for the parameters, such that the prior $ \varphi(\boldsymbol{\theta})=\prod_{g=1}^{G} \varphi
(\boldsymbol{\theta}_g)$ depends on a labeling choice. Uniform prior densities on each parameter, on finite intervals, possibly subject to stationarity restrictions, are relatively easy to specify.

Bayesian inference is done by use of the Gibbs sampler and data augmentation. Table 2.10 summarizes the three blocks of the sampler.


Table 2.10: Summary of conditional posteriors
Parameter Conditional posterior
  or sampling method
$ \boldsymbol{S}$ Multinomial distribution
$ \boldsymbol {\eta }$ Dirichlet distribution
$ \boldsymbol {\theta }$ Griddy-Gibbs sampler

Because of the prior independence of the $ \boldsymbol{\theta}_g$'s, the griddy-Gibbs sampler is applied separately $ G$ times.


Table 2.11: Posterior results on $ \boldsymbol {\eta }$ and $ \boldsymbol {\theta }$ (G=3)
  $ \eta_1$ $ \eta_2$ $ \eta_3$
True value $ 0.25$ $ 0.50$ $ 0.25$
Mean $ 0.2166$ $ 0.4981$ $ 0.2853$
Standard deviation $ 0.0555$ $ 0.0763$ $ 0.0692$
Correlation matrix $ 1$ $ -0.4851$ $ -0.2677$
  $ -0.4851$ $ 1$ $ -0.7127$
  $ -0.2677$ $ -0.7127$ $ 1$

    $ g=1$ $ g=2$ $ g=3$
True value $ \alpha_g$ $ 0.04$ $ 0.12$ $ 0.20$
  $ \beta_g$ $ 0.90$ $ 0.60$ $ 0.40$
Prior interval $ \alpha_g$ $ 0.001$,$ 0.07$ $ 0.07$,$ 0.15$ $ 0.15$,$ 0.25$
  $ \beta_g$ $ 0.65$,$ 0.97$ $ 0.45$,$ 0.75$ $ 0.20$,$ 0.60$
Mean $ \alpha_g$ $ 0.0435$ $ 0.1041$ $ 0.1975$
  $ \beta_g$ $ 0.8758$ $ 0.5917$ $ 0.4369$
Standard deviation $ \alpha_g$ $ 0.0060$ $ 0.0092$ $ 0.0132$
  $ \beta_g$ $ 0.0238$ $ 0.0306$ $ 0.0350$
Correlation $ \alpha_g, \beta_g$ $ -0.7849$ $ -0.71409$ $ -0.7184$

As an illustration we show the posterior marginals of the following model

$\displaystyle \widetilde{f}(y_j)= \sum_{g=1}^{3} \eta_g f(y_j\vert\boldsymbol{\theta}_g)$ (2.42)

with $ \eta_1=0.25$, $ \eta_2=0.5$, $ J=100$ and $ T_j=1000$. The components are defined more precisely as

$\displaystyle f(y_j\vert\boldsymbol{\theta}_g) = \prod_{t=1}^{T_j} f(y_{j,t}\vert\boldsymbol{\theta}_g,I_{j,t})$ (2.43)
$\displaystyle y_{j,t}\vert\boldsymbol{\theta}_g,I_{j,t} \sim N(0,h_{j,t})$ (2.44)
$\displaystyle h_{j,t} = (1-\alpha_g - \beta_g) \widetilde{\omega}_j + \alpha_g (y_{j,t-1})^2 + \beta_g h_{j,t-1}\;,$ (2.45)

where $ I_{j,t}$ is the information set until $ t-1$ containing (at least) $ y_{j,1},\ldots,y_{j,t-1}$ and initial conditions which are assumed to be known. For the simulation of the data $ \widetilde{\omega}_j$ is fixed equal to one which implies that the unconditional variance for every generated data series is equal to one. However, the constant $ \widetilde{\omega}_j$ in the conditional variance is not subject to inference, rather it is fixed at the empirical variance of the data. Table 2.11 presents the true values, the prior intervals on the $ \boldsymbol{\theta}_g$'s and posterior results on $ \boldsymbol {\eta }$ and  $ \boldsymbol {\theta }$.

[4] succesfully apply this model to return series of 131 US stocks. Comparing the marginal likelihoods for different models, they find that $ G=3$ is the appropriate choice for the number of component distributions.

Other interesting examples of finite mixture modelling exist in the literature. [Frühwirth-Schnatter and Kaufmann (2002)] develop a regime switching panel data model. Their purpose is to cluster many short time series to capture asymmetric effects of monetary policy on bank lending. [18] develop a finite mixture negative binomial count model to estimate six measures of medical care demand by the elderly. [11] offer a flexible Bayesian analysis of the problem of causal inference in models with non-randomly assigned treatments. Their approach is illustrated using hospice data and hip fracture data.


next up previous contents index
Next: References Up: 2. Econometrics Previous: 2.3 Stochastic Volatility and