next up previous contents index
Next: 11.5 More Monte Carlo Up: 11. Bayesian Computational Methods Previous: 11.3 Monte Carlo Methods

Subsections



11.4 Markov Chain Monte Carlo Methods

As described precisely in Chap. III.3 and in [37], MCMC methods try to overcome the limitation of regular Monte Carlo methods by mean of a Markov chain with stationary distribution the posterior distribution. There exist rather generic ways of producing such chains, including Metropolis-Hastings and Gibbs algorithms. Besides the fact that stationarity of the target distribution is enough to justify a simulation method by Markov chain generation, the idea at the core of MCMC algorithms is that local exploration, when properly weighted, can lead to a valid representation of the distribution of interest, as for instance, the Metropolis-Hastings algorithm.


11.4.1 Metropolis-Hastings as Universal Simulator

The Metropolis-Hastings, presented in [37] and Chap. II.3, offers a straightforward solution to the problem of simulating from the posterior distribution $ \pi(\theta\vert x)\propto
f(x\vert\theta)\,\pi(\theta)$: starting from an arbitrary point $ \theta_0$, the corresponding Markov chain explores the surface of this posterior distribution by a random walk proposal $ q(\theta\vert\theta^{\prime})$ that progressively visits the whole range of the possible values of $ \theta$.

Metropolis-Hastings Algorithm

At iteration $ t$

  1. Generate $ \xi\sim q( \xi \vert \theta^{(t)} )$, $ u_t\sim\mathcal{U}([0,1])$
  2. Take

    $\displaystyle \theta^{(t+1)} = \begin{cases}\xi_t &\text{if}\quad u_t\le \displ...
...ft(\xi_t\vert\theta^{(t)}\right)}}\\ \theta^{(t)} &\text{otherwise} \end{cases}$    

Example 14 (Continuation of Example 11)   In the case $ p=1$, the probit model defined in Example 11 can also be over-parameterised as

$\displaystyle P(Y_i=1\vert x_i) = 1 - P(Y_i=0\vert x_i) = \Phi\left(x_i\beta/\sigma\right)\;,$    

since it only depends on $ \beta
/\sigma $. The Bayesian processing of non-identified models poses no serious difficulty as long as the posterior distribution is well defined. This is the case for a proper prior like

$\displaystyle \pi\left(\beta,\sigma^2\right) \propto \sigma^{-4}\,\exp\left\{-1/\sigma^2\right\}\,\exp\left\{-\beta^2/50\right\}$    

that corresponds to a normal distribution on $ \beta$ and a gamma distribution on $ \sigma^{-2}$. While the posterior distribution on $ (\beta,\sigma)$ is not a standard distribution, it is available up to a normalising constant. Therefore, it can be directly processed via an MCMC algorithm. In this case, we chose a Gibbs sampler that simulates $ \beta$ and $ \sigma ^2$ alternatively, from

$\displaystyle \pi(\beta\vert\boldsymbol{x},\boldsymbol{y},\sigma) \propto \prod...
.../\sigma\right) \prod_{y_i=0} \Phi\left(-x_i\beta/\sigma\right) \times\pi(\beta)$    

and

$\displaystyle \pi\left(\sigma^2\vert\boldsymbol{x},\boldsymbol{y},\beta\right) ...
... \prod_{y_i=0} \Phi\left(-x_i\beta/\sigma\right) \times\pi\left(\sigma^2\right)$    

respectively. Since both of these conditional distributions are also non-standard, we replace the direct simulation by a one-dimensional Metropolis-Hastings step, using normal $ \mathcal{N}(\beta^{(t)},1)$ and log-normal $ \mathcal{L}\mathcal{N}(\log\sigma^{(t)},0.04)$ random walk proposals, respectively. For a simulated dataset of $ 1000$ points, the contour plot of the log-posterior distribution is given in Fig. 11.7, along with the last $ 1000$ points of a corresponding MCMC sample after $ 100{,}000$ iterations. This graph shows a very satisfactory repartition of the simulated parameters over the likelihood surface, with higher concentrations near the largest posterior regions. For another simulation, Fig. 11.8 details the first $ 500$ steps, when started at $ (\beta ,\sigma ^2)=(0.1,4.0)$. Although each step contains both a $ \beta$ and$ \sigma $ proposal, some moves are either horizontal or vertical: this corresponds to cases when either the $ \beta$ or the $ \sigma $ proposals have been rejected. Note also the fairly rapid convergence to a modal zone of the posterior distribution in this case.

Obviously, this is only a toy example and more realistic probit models do not fare so well with down-the-shelf random walk Metropolis-Hastings algorithms, as shown for instance in [32] (see also [37], Sect. 10.3.2).11.8

Figure 11.7: Contour plot of the log-posterior distribution for a probit sample of $ 1000$ observations, along with $ 1000$ points of an MCMC sample (Source: [37])
\includegraphics[width=11.4cm]{text/3-11/probit1.eps}

Figure 11.8: First $ 500$ steps of the Metropolis-Hastings algorithm on the probit log-posterior surface, when started at $ (\beta ,\sigma ^2)=(0.1,4.0)$
\includegraphics[width=8.3cm]{text/3-11/probste.eps}

The difficulty inherent to random walk Metropolis-Hastings algorithms is the scaling of the proposal distribution: it must be adapted to the shape of the target distribution so that, in a reasonable number of steps, the whole support of this distribution can be visited. If the scale of the proposal is too small, this will not happen as the algorithm stays ''too local'' and, if there are several modes on the posterior, the algorithm may get trapped within one modal region because it cannot reach other modal regions with jumps of too small magnitude. The larger the dimension $ p$ is, the harder it is to set up the right scale, though, because

(a)
the curse of dimension implies that there are more and more empty regions in the space, that is, regions with zero posterior probability;
(b)
the knowledge and intuition about the modal regions get weaker and weaker;
(c)
the proper scaling involves a symmetric $ (p,p)$ matrix $ \Xi$ in the proposal $ g((\theta-\theta^{\prime})^{\text{T}} \Xi (\theta-\theta^{\prime}))$. Even when the matrix $ \Xi$ is diagonal, it gets harder to scale as the dimension increases (unless one resorts to a Gibbs like implementation, where each direction is scaled separately).
Note also that the on-line scaling of the algorithm against the empirical acceptance rate is inherently flawed in that the attraction of a modal region may give a false sense of convergence and lead to a choice of too small a scale, simply because other modes will not be visited during the scaling experiment.


11.4.2 Gibbs Sampling and Latent Variable Models

The Gibbs sampler is a definitely attractive algorithm for Bayesian problems because it naturally fits the hierarchical structures so often found in such problems. ''Natural'' being a rather vague notion from a simulation point of view, it routinely happens that other algorithms fare better than the Gibbs sampler. Nonetheless, Gibbs sampler is often worth a try (possibly with other Metropolis-Hastings refinements at a later stage) in well-structured objects like Bayesian hierarchical models and more general graphical models.

A very relevant illustration is made of latent variable models, where the observational model is itself defined as a mixture model,

$\displaystyle f(x\vert\theta) = \int_{\mathcal{Z}} f(x\vert z,\theta)\,g(z\vert\theta)\,{\text{d}} z\;.$    

Such models were instrumental in promoting the Gibbs sampler in the sense that they have the potential to make Gibbs sampling sound natural very easily. (See also Chap. II.3.) For instance, [43] wrote a precursor article to [14] that designed specific two-stage Gibbs samplers for a variety of latent variable models. And many of the first applications of Gibbs sampling in the early 90's were actually for models of that kind. The usual implementation of the Gibbs sampler in this case is to simulate the missing variables $ Z$ conditional on the parameters and reciprocally, as follows:

Latent Variable Gibbs Algorithm

At iteration $ t$

  1. Generate $ z^{(t+1)} \sim g(z\vert\theta^{(t)})$
  2. Generate $ \theta^{(t+1)} \sim \pi(\theta\vert x,z^{(t+1)})$

While we could have used the probit case as an illustration (Example 11), as done in Chap. II.3, we choose to pick the case of mixtures (Example 6) as a better setting.

Example 15 (Continuation of Example 6)   The natural missing data structure of a mixture of distribution is historical. In one of the first mixtures to be ever studied by Bertillon, in 1863, a bimodal structure on the height of conscripts in south eastern France (Doubs) can be explained by the mixing of two populations of military conscripts, one from the plains and one from the mountains (or hills). Therefore, in the analysis of data from distributions of the form

$\displaystyle \sum_{i=1}^k p_i f(x\vert\theta_i)\;,$    

a common missing data representation is to associate with each observation $ x_j$ a missing multinomial variable $ z_j\sim
\mathcal{M}_k(1;p_1,\ldots,p_k)$ such that $ x_j\vert z_j=i\sim
f(x\vert\theta_i)$. In heterogeneous populations made of several homogeneous subgroups or subpopulations, it makes sense to interpret $ z_j$ as the index of the population of origin of $ x_j$, which has been lost in the observational process.

However, mixtures are also customarily used for density approximations, as a limited dimension proxy to non-parametric approaches. In such cases, the components of the mixture and even the number $ k$ of components in the mixture are often meaningless for the problem to be analysed. But this distinction between natural and artificial completion (by the $ z_j$'s) is lost to the MCMC sampler, whose goal is simply to provide a Markov chain that converges to the posterior as stationary distribution. Completion is thus, from a simulation point of view, a mean to generate such a chain.

The most standard Gibbs sampler for mixture models ([11]) is thus based on the successive simulation of the $ z_j$'s and of the $ \theta _i$'s, conditional on one another and on the data:

  1. Generate $ z_j\vert\theta,x_j$ $ (\kern.5pt j=1,\ldots,n)$
  2. Generate $ \theta_i\vert\boldsymbol{x},\boldsymbol{z}$ $ (i=1,\ldots,k)$
Given that the density $ f$ is most often from an exponential family, the simulation of the $ \theta _i$'s is generally straightforward.

As an illustration, consider the case of a normal mixture with two components, with equal known variance and fixed weights,

$\displaystyle p\,\mathcal{N}\left(\mu_1,\sigma^2\right) + (1-p)\,\mathcal{N}\left(\mu_2,\sigma^2\right) \;.$ (11.10)

Assume in addition a normal $ \mathcal{N}(0,10\sigma^2)$ prior on both means $ \mu _1$ and $ \mu _2$. It is easy to see that $ \mu _1$ and $ \mu _2$ are independent, given $ (\boldsymbol{z},\boldsymbol{x})$, and the respective conditional distributions are

$\displaystyle \mathcal{N} \left( \sum_{z_i=j} x_i / \left( 0.1 + n_j\right), \sigma^2 / \left( 0.1 + n_j\right) \right)\;,$    

where $ n_j$ denotes the number of $ z_i$'s equal to $ j$. Even more easily, it comes that the conditional posterior distribution of $ \boldsymbol{z}$ given $ (\mu_1,\mu_2)$ is a product of binomials, with

  $\displaystyle P(Z_i=1\vert x_i,\mu_1,\mu_2)$    
  $\displaystyle = \frac{p \exp\left\{-(x_i-\mu_1)^2/2\sigma^2 \right\} }{ p \exp\...
...^2/2\sigma^2 \right\} + (1-p) \exp\left\{-(x_i-\mu_2)^2/2\sigma^2 \right\} }\;.$    

Figure 11.9 illustrates the behavior of the Gibbs sampler in that setting, with a simulated dataset of $ 100$ points from the $ 0.7 \mathcal{N}(0,1) +0.3 \mathcal{N}(2.7,1)$ distribution. The representation of the MCMC sample after $ 5000$ iterations is quite in agreement with the posterior surface, represented via a grid on the $ (\mu_1,\mu_2)$ space and some contours. The sequence of consecutive steps represented on the left graph also shows that the mixing behavior is satisfactory, since the jumps are in scale with the modal region of the posterior.

This experiment gives a wrong sense of safety, though, because it does not point out the fairly large dependence of the Gibbs sampler to the initial conditions, already signaled in [11] under the name of trapping states. Indeed, the conditioning of $ (\mu_1,\mu_2)$ on $ \boldsymbol{z}$ implies that the new simulations of the means will remain very close to the previous values, especially if there are many observations, and thus that the new allocations $ \boldsymbol{z}$ will not differ much from the previous allocations. In other words, to see a significant modification of the allocations (and thus of the means) would require a very very large number of iterations. Figure 11.10 illustrates this phenomenon for the same sample as in Fig. 11.9, for a wider scale: there always exists a second mode in the posterior distribution, which is much lower than the first mode located around $ (0,2.7)$. Nonetheless, a Gibbs sampler initialized close to the second and lower mode will not be able to leave the vicinity of this (irrelevant) mode, even after a large number of iterations. The reason is as given above: to jump to the other mode, a majority of $ z_j$'s would need to change simultaneously and the probability of such a jump is too close to 0 to let the event occur.11.9

Figure 11.9: Gibbs sample of $ 5000$ points for the mixture posterior (left) and path of the last $ 100$ consecutive steps (right) against the posterior surface (Source: [37])
\includegraphics[width=11.7cm]{text/3-11/gib+mix2.eps}

Figure 11.10: Posterior surface and corresponding Gibbs sample for the two mean mixture model, when initialized close to the second and lower mode, based on $ 10{,}000$ iterations (Source: [37])
\includegraphics[width=11.5cm]{text/3-11/wrongmix.eps}

This example illustrates quite convincingly that, while the completion is natural from a model point of view (since it is a part of the definition of the model), it does not necessarily transfer its utility for the simulation of the posterior. Actually, when the missing variable model allows for a closed form likelihood, as is the case for mixtures, probit models (Examples 11 and 14) and even hidden Markov models (see [8]), the whole range of the MCMC technology can be used as well. The appeal of alternatives like random walk Metropolis-Hastings schemes is that they remain in a smaller dimension space, since they avoid the completion step(s), and that they are not restricted in the range of their moves.11.10

Example 16 (Continuation of Example 15)   Given that the likelihood of a sample $ (x_1,\ldots,x_n)$ from the mixture distribution (11.10) can be computed in O$ (2n)$ time, a regular random walk Metropolis-Hastings algorithm can be used in this setup. Figure 11.11 shows how quickly this algorithm escapes the attraction of the poor mode, as opposed to the Gibbs sampler of Fig. 11.10: within a few iterations of the algorithm, the chain drifts over the poor mode and converges almost deterministically to the proper region of the posterior surface. The random walk is based on $ \mathcal{N}(\mu_i^{(t)},0.04)$ proposals, although other scales would work as well but would require more iterations to reach the proper model regions. For instance, a scale of $ 0.005$ in the Normal proposal above needs close to $ 5000$ iterations to attain the main mode.

Figure 11.11: Track of a $ 1000$ iteration random walk Metropolis-Hastings sample on the posterior surface, the starting point is indicated by a cross. (The scale of the random walk is $ 0.2$.)
\includegraphics[width=11.1cm]{text/3-11/rwMix.eps}

The secret of a successful MCMC implementation in such latent variable models is to maintain the distinction between latency in models and latency in simulation (the later being often called use of auxiliary variables). When latent variables can be used with adequate mixing of the resulting chain and when the likelihood cannot be computed in a closed form (as in hidden semi-Markov models, [6]), a Gibbs sampler is a still simple solution that is often easy to simulate from. Adding well-mixing random walk Metropolis-Hastings steps in the simulation scheme cannot hurt the overall mixing of the chain ([37], Chap. 13), especially when several scales can be used at once (see Sect. 11.5). A final word is that the completion can be led in an infinity of ways and that several of these should be tried or used in parallel to increase the chances of success.


11.4.3 Reversible Jump Algorithms for Variable Dimension Models

As described in Sect. 11.2.3, model choice is computationally different from testing in that it considers at once a (much) wider range of models $ \mathfrak{M}_i$ and parameter spaces $ \Theta_i$. Although early approaches could only go through a pedestrian pairwise comparison, a more adequate perspective is to envision the model index $ i$ as part of the parameter to be estimated, as in (11.3). The (computational) difficulty is that we are then dealing with a possibly infinite space that is the collection of unrelated sets: how can we then simulate from the corresponding distribution?11.11

The MCMC solution proposed by [20] is called reversible jump MCMC, because it is based on a reversibility constraint on the transitions between the sets $ \Theta_i$. In fact, the only real difficulty compared with previous developments is to validate moves (or jumps) between the $ \Theta_i$'s, since proposals restricted to a given $ \Theta_i$ follow from the usual (fixed-dimensional) theory. Furthermore, reversibility can be processed at a local level: since the model indicator $ \mu$ is a integer-valued random variable, we can impose reversibility for each pair $ (k_1,k_2)$ of possible values of $ \mu$. The idea at the core of reversible jump MCMC is then to supplement each of the spaces $ \Theta_{k_1}$ and $ \Theta_{k_2}$ with adequate artificial spaces in order to create a bijection between them. For instance, if $ \dim(\Theta_{k_1})>\dim(\Theta_{k_2})$ and if the move from $ {\Theta}_{k_1}$ to $ {\Theta}_{k_2}$ can be represented by a deterministic transformation of $ \theta^{(k_1)}$

$\displaystyle \theta^{(k_2)}=T_{k_1\rightarrow k_2}\left(\theta^{(k_1)}\right)\;,$    

Green (1995) imposes a dimension matching condition which is that the opposite move from $ {\Theta}_{k_2}$ to $ {\Theta}_{k_1}$ is concentrated on the curve

$\displaystyle \left\{ \theta^{(k_1)}\,:\,\theta^{(k_2)} = T_{k_1\rightarrow k_2}(\theta^{(k_1)}) \right\} \;.$    

In the general case, if $ \theta^{(k_1)}$ is completed by a simulation $ u_1\sim g_1(u_1)$ into $ (\theta^{(k_1)}, u_1)$ and $ \theta^{(k_2)}$ by $ u_2\sim g_2(u_2)$ into $ (\theta^{(k_2)}, u_2)$ so that the mapping between $ (\theta^{(k_1)}, u_1)$ and $ (\theta^{(k_2)}, u_2)$ is a bijection,

$\displaystyle \left(\theta^{(k_2)},u_2\right) = T_{k_1\rightarrow k_2} \left(\theta^{(k_1)},u_1\right)\;,$ (11.11)

the probability of acceptance for the move from model $ \mathfrak{M}_{k_1}$ to model $ \mathfrak{M}_{k_2}$ is then

$\displaystyle \min\left( \frac{\pi\left(k_2,\theta^{(k_2)}\right) }{ \pi\left(k...
..._1\right)} {\partial \left(\theta^{(k_1)},u_1\right)} \right\vert, 1 \right)\;,$    

involving The acceptance probability for the reverse move is based on the inverse ratio if the move from $ \mathfrak{M}_{k_2}$ to $ \mathfrak{M}_{k_1}$ also satisfies (11.11) with $ u_2\sim g_2(u_2)$.11.12

The pseudo-code representation of Green's algorithm is thus as follows:

Green's Algorithm

At iteration $ t$, if $ x^{(t)}=(m,\theta^{(m)})$,

  1. Select model $ \mathfrak{M}_n$ with probability $ \pi_{mn}$
  2. Generate $ u_{mn}\sim\varphi_{mn}(u)$
  3. Set $ (\theta^{(n)},v_{nm})=T_{m\rightarrow n}(\theta^{(m)},u_{mn})$
  4. Take $ x^{(t+1)} =( n,\theta^{(n)})$ with probability

    $\displaystyle \min\left( \frac{\pi\left(n,\theta^{(n)}\right) }{ \pi\left(m,\th...
...\right) }{ \partial \left(\theta^{(m)},u_{mn}\right)} \right\vert, 1 \right)\;,$    

    and take $ x^{(t+1)} =x^{(t)}$ otherwise.

As for previous methods, the implementation of this algorithm requires a certain skillfulness in picking the right proposals and the appropriate scales. This art of reversible jump MCMC is illustrated on the two following examples, extracted from Robert and Casella (2004, Sect. 14.2.3).

Example 17 (Continuation of Example 6)   If we consider for model $ \mathfrak{M}_k$ the $ k$ component normal mixture distribution,

$\displaystyle \sum_{j=1}^k p_{jk} {\cal N}\left(\mu_{jk},\sigma_{jk}^2\right)\;,$    

moves between models involve changing the number of components in the mixture and thus adding new components or removing older components or yet again changing several components. As in [34], we can restrict the moves when in model $ \mathfrak{M}_k$ to only models $ \mathfrak{M}_{k+1}$ and $ \mathfrak{M}_{k-1}$. The simplest solution is to use a birth-and-death process: The birth step consists in adding a new normal component in the mixture generated from the prior and the death step is the opposite, removing one of the $ k$ components at random. In this case, the corresponding birth acceptance probability is

  $\displaystyle \min\left( \frac{ \pi_{(k+1)k}}{\pi_{k(k+1)} }\,\frac{(k+1)!}{k!}...
...ta_{k+1})}{\pi_{k}(\theta_{k})\,(k+1)\varphi_{k(k+1)}(u_{k(k+1)})}\, ,1 \right)$    
  $\displaystyle \quad = \min\left( \frac{ \pi_{(k+1)k}}{\pi_{k(k+1)} }\, \frac{\v...
...ell_{k+1}(\theta_{k+1})\,(1-p_{k+1})^{k-1}}{\ell_{k}(\theta_{k})} ,1 \right)\;,$    

where $ \ell_k$ denotes the likelihood of the $ k$ component mixture model $ \mathfrak{M}_k$ and $ \varrho(k)$ is the prior probability of model $ \mathfrak{M}_k$.11.13

While this proposal can work well in some setting, as in Richardson and Green (1997) when the prior is calibrated against the data, it can also be inefficient, that is, leading to a high rejection rate, if the prior is vague, since the birth proposals are not tuned properly. A second proposal, central to the solution of [34], is to devise more local jumps between models, called split and combine moves, since a new component is created by splitting an existing component into two, under some moment preservation conditions, and the reverse move consists in combining two existing components into one, with symmetric constraints that ensure reversibility. (See, e.g., [37], for details.)

Figures 11.12-11.14 illustrate the implementation of this algorithm for the so-called Galaxy dataset used by [34] (see also [38]), which contains $ 82$ observations on the speed of galaxies. On Fig. 11.12, the MCMC output on the number of components $ k$ is represented as a histogram on $ k$, and the corresponding sequence of $ k$'s. The prior used on $ k$ is a uniform distribution on $ \{1,\ldots,20\}$: as shown by the lower plot, most values of $ k$ are explored by the reversible jump algorithm, but the upper bound does not appear to be restrictive since the $ k^{(t)}$'s hardly ever reach this upper limit. Figure 11.13 illustrates the fact that conditioning the output on the most likely value of $ k$ ($ 3$ here) is possible. The nine graphs in this figure show the joint variation of the three types of parameters, as well as the stability of the Markov chain over the $ 1{,}000{,}000$ iterations: the cumulated averages are quite stable, almost from the start.

The density plotted on top of the histogram in Fig. 11.14 is another good illustration of the inferential possibilities offered by reversible jump algorithms, as a case of model averaging: this density is obtained as the average over iterations $ t$ of

$\displaystyle \sum_{j=1}^{k^{(t)}} p^{(t)}_{jk} \mathcal{N}\left(\mu^{(t)}_{jk},\left(\sigma^{(t)}_{jk}\right)^2\right)\;,$    

which approximates the posterior expectation $ \mathbb{E}[f(y\vert\theta )\vert\boldsymbol{x}]$, where $ \boldsymbol {x}$ denotes the data $ x_1,\ldots,x_{82}$.

Figure 11.12: Histogram and raw plot of $ 100{,}000$ $ k$'s produced by a reversible jump MCMC algorithm for the Galaxy dataset
\includegraphics[width=11.7cm]{text/3-11/galakhist.eps}

Figure 11.13: Reversible jump MCMC output on the parameters of the model $ \mathcal {M}_3$ for the Galaxy dataset, obtained by conditioning on $ k=3$. The left column gives the histogram of the weights, means, and variances; the middle column the scatterplot of the pairs weights-means, means-variances, and variances-weights; the right column plots the cumulated averages (over iterations) for the weights, means, and variances
\includegraphics[width=11.7cm]{text/3-11/galtetahis.eps}

Figure 11.14: Fit of the dataset by the averaged density, $ \mathbb{E}[f(y\vert\theta )\vert\boldsymbol{x}]$
\includegraphics[width=11cm]{text/3-11/galaxfit.eps}

Example 18 (Continuation of Example 3)   For the $ AR(p)$ model of Example 3, the best way to include the stationarity constraints is to use the lag-polynomial representation

$\displaystyle \prod_{i=1}^p \left( 1 - \lambda_i B\right) \,X_t = \epsilon_t \,, \quad \epsilon_t\sim{\mathcal N}\left(0,\sigma^2\right)\;,$    

of model $ \mathfrak{M}_p$, and to constrain the inverse roots, $ \lambda_i$, to stay within the unit circle if complex and within $ [-1,1]$ if real (see, e.g. [35], Sect. 4.5.2). The associated uniform priors for the real and complex roots $ \lambda_j$ is

$\displaystyle \pi_p(\boldsymbol{\lambda}) = \frac{1}{\left\lfloor p/2 \right\rf...
...mbda_i\not\in{\mathbb{R}}} \frac{1}{\pi}{\mathbb{I}}_{\vert\lambda_i\vert<1}\;,$    

where $ \left\lfloor p/2 \right\rfloor+1$ is the number of different values of $ r_p$. This factor must be included within the posterior distribution when using reversible jump since it does not vanish in the acceptance probability of a move between models $ \mathfrak{M}_p$ and  $ \mathfrak{M}_q$. Otherwise, this results in a modification of the prior probability of each model.

Once again, a simple choice is to use a birth-and-death scheme where the birth moves either create a real or two conjugate complex roots. As in the birth-and-death proposal for Example 17, the acceptance probability simplifies quite dramatically since it is for instance

$\displaystyle \min\left( \frac{\pi_{(p+1)p}}{\pi_{p(p+1)}}\, \frac{(r_{p}+1)!}{...
...ell_{p+1}\left(\theta_{p+1}\right)}{\ell_{p}\left(\theta_{p}\right)} ,1 \right)$    

in the case of a move from $ \mathfrak{M}_p$ to $ \mathfrak{M}_{p+1}$. (As for the above mixture example, the factorials are related to the possible choices of the created and the deleted roots.)

Figure 11.15 presents some views of the corresponding reversible jump MCMC algorithm. Besides the ability of the algorithm to explore a range of values of $ k$, it also shows that Bayesian inference using these tools is much richer, since it can, for instance, condition on or average over the order $ k$, mix the parameters of different models and run various tests on these parameters. A last remark on this graph is that both the order and the value of the parameters are well estimated, with a characteristic trimodality on the histograms of the $ \theta _i$'s, even when conditioning on $ k$ different from $ 3$, the value used for the simulation.

Figure 11.15: Output of a reversible jump algorithm based on an $ AR(3)$ simulated dataset of $ 530$ points (upper left) with true parameters $ \theta _i$ $ (-0.1,0.3,-0.4)$ and $ \sigma =1$. The first histogram is associated with $ k$, the following histograms are associated with the $ \theta _i$'s, for different values of $ k$, and of $ \sigma ^2$. The final graph is a scatterplot of the complex roots (for iterations where there were complex roots). The one before last graph plots the evolution over the iterations of $ \theta _1,\theta _2,\theta _3$ (Source: Robert 2003)
\includegraphics[width=11.7cm]{text/3-11/plain.eps}


next up previous contents index
Next: 11.5 More Monte Carlo Up: 11. Bayesian Computational Methods Previous: 11.3 Monte Carlo Methods