11.4 Markov Chain Monte Carlo Methods

As described precisely in Chap. III.3 and in [37], MCMC methods try to overcome the limitation of regular Monte Carlo methods by mean of a Markov chain with stationary distribution the posterior distribution. There exist rather generic ways of producing such chains, including Metropolis-Hastings and Gibbs algorithms. Besides the fact that stationarity of the target distribution is enough to justify a simulation method by Markov chain generation, the idea at the core of MCMC algorithms is that local exploration, when properly weighted, can lead to a valid representation of the distribution of interest, as for instance, the Metropolis-Hastings algorithm.

11.4.1 Metropolis-Hastings as Universal Simulator

The Metropolis-Hastings, presented in [37] and Chap. II.3, offers a straightforward solution to the problem of simulating from the posterior distribution $\pi(\theta\vert x)\propto f(x\vert\theta)\,\pi(\theta)$ : starting from an arbitrary point $\theta_0$ , the corresponding Markov chain explores the surface of this posterior distribution by a random walk proposal $q(\theta\vert\theta^{\prime})$ that progressively visits the whole range of the possible values of $\theta$ .

Example 14 (Continuation of Example 11) In the case , the probit model defined in Example 11 can also be over-parameterised as

$\displaystyle P(Y_i=1\vert x_i) = 1 - P(Y_i=0\vert x_i) = \Phi\left(x_i\beta/\sigma\right)\;,$

since it only depends on $\beta /\sigma$ . The Bayesian processing of non-identified models poses no serious difficulty as long as the posterior distribution is well defined. This is the case for a proper prior like

$\displaystyle \pi\left(\beta,\sigma^2\right) \propto \sigma^{-4}\,\exp\left\{-1/\sigma^2\right\}\,\exp\left\{-\beta^2/50\right\}$

that corresponds to a normal distribution on $\beta$ and a gamma distribution on $\sigma^{-2}$ . While the posterior distribution on $(\beta,\sigma)$ is not a standard distribution, it is available up to a normalising constant. Therefore, it can be directly processed via an MCMC algorithm. In this case, we chose a Gibbs sampler that simulates $\beta$ and $\sigma ^2$ alternatively, from

$\displaystyle \pi(\beta\vert\boldsymbol{x},\boldsymbol{y},\sigma) \propto \prod... .../\sigma\right) \prod_{y_i=0} \Phi\left(-x_i\beta/\sigma\right) \times\pi(\beta)$

and

$\displaystyle \pi\left(\sigma^2\vert\boldsymbol{x},\boldsymbol{y},\beta\right) ... ... \prod_{y_i=0} \Phi\left(-x_i\beta/\sigma\right) \times\pi\left(\sigma^2\right)$

respectively. Since both of these conditional distributions are also non-standard, we replace the direct simulation by a one-dimensional Metropolis-Hastings step, using normal $\mathcal{N}(\beta^{(t)},1)$ and log-normal $\mathcal{L}\mathcal{N}(\log\sigma^{(t)},0.04)$ random walk proposals, respectively. For a simulated dataset of points, the contour plot of the log-posterior distribution is given in Fig. 11.7, along with the last points of a corresponding MCMC sample after iterations. This graph shows a very satisfactory repartition of the simulated parameters over the likelihood surface, with higher concentrations near the largest posterior regions. For another simulation, Fig. 11.8 details the first steps, when started at $(\beta ,\sigma ^2)=(0.1,4.0)$ . Although each step contains both a $\beta$ and a $\sigma$ proposal, some moves are either horizontal or vertical: this corresponds to cases when either the $\beta$ or the $\sigma$ proposals have been rejected. Note also the fairly rapid convergence to a modal zone of the posterior distribution in this case.

Obviously, this is only a toy example and more realistic probit models do not fare so well with down-the-shelf random walk Metropolis-Hastings algorithms, as shown for instance in [32] (see also [37], Sect. 10.3.2).^11.8

**Figure 11.7:** Contour plot of the log-posterior distribution for a probit sample of observations, along with points of an MCMC sample (Source: [37])
$\includegraphics[width=11.4cm]{text/3-11/probit1.eps}$

**Figure 11.8:** First steps of the Metropolis-Hastings algorithm on the probit log-posterior surface, when started at $(\beta ,\sigma ^2)=(0.1,4.0)$
$\includegraphics[width=8.3cm]{text/3-11/probste.eps}$

The difficulty inherent to random walk Metropolis-Hastings algorithms is the scaling of the proposal distribution: it must be adapted to the shape of the target distribution so that, in a reasonable number of steps, the whole support of this distribution can be visited. If the scale of the proposal is too small, this will not happen as the algorithm stays ''too local'' and, if there are several modes on the posterior, the algorithm may get trapped within one modal region because it cannot reach other modal regions with jumps of too small magnitude. The larger the dimension

is, the harder it is to set up the right scale, though, because

11.4.2 Gibbs Sampling and Latent Variable Models

The Gibbs sampler is a definitely attractive algorithm for Bayesian problems because it naturally fits the hierarchical structures so often found in such problems. ''Natural'' being a rather vague notion from a simulation point of view, it routinely happens that other algorithms fare better than the Gibbs sampler. Nonetheless, Gibbs sampler is often worth a try (possibly with other Metropolis-Hastings refinements at a later stage) in well-structured objects like Bayesian hierarchical models and more general graphical models.

A very relevant illustration is made of latent variable models, where the observational model is itself defined as a mixture model,

While we could have used the probit case as an illustration (Example 11), as done in Chap. II.3, we choose to pick the case of mixtures (Example 6) as a better setting.

Example 15 (Continuation of Example 6) The natural missing data structure of a mixture of distribution is historical. In one of the first mixtures to be ever studied by Bertillon, in 1863, a bimodal structure on the height of conscripts in south eastern France (Doubs) can be explained by the mixing of two populations of military conscripts, one from the plains and one from the mountains (or hills). Therefore, in the analysis of data from distributions of the form

$\displaystyle \sum_{i=1}^k p_i f(x\vert\theta_i)\;,$

a common missing data representation is to associate with each observation a missing multinomial variable $z_j\sim \mathcal{M}_k(1;p_1,\ldots,p_k)$ such that $x_j\vert z_j=i\sim f(x\vert\theta_i)$ . In heterogeneous populations made of several homogeneous subgroups or subpopulations, it makes sense to interpret as the index of the population of origin of , which has been lost in the observational process.

However, mixtures are also customarily used for density approximations, as a limited dimension proxy to non-parametric approaches. In such cases, the components of the mixture and even the number of components in the mixture are often meaningless for the problem to be analysed. But this distinction between natural and artificial completion (by the 's) is lost to the MCMC sampler, whose goal is simply to provide a Markov chain that converges to the posterior as stationary distribution. Completion is thus, from a simulation point of view, a mean to generate such a chain.

The most standard Gibbs sampler for mixture models ([11]) is thus based on the successive simulation of the 's and of the $\theta _i$ 's, conditional on one another and on the data:

Generate $z_j\vert\theta,x_j$ $(\kern.5pt j=1,\ldots,n)$
Generate $\theta_i\vert\boldsymbol{x},\boldsymbol{z}$ $(i=1,\ldots,k)$

Given that the density is most often from an exponential family, the simulation of the $\theta _i$ 's is generally straightforward.

As an illustration, consider the case of a normal mixture with two components, with equal known variance and fixed weights,

$\displaystyle p\,\mathcal{N}\left(\mu_1,\sigma^2\right) + (1-p)\,\mathcal{N}\left(\mu_2,\sigma^2\right) \;.$

(11.10)

Assume in addition a normal $\mathcal{N}(0,10\sigma^2)$ prior on both means $\mu _1$ and $\mu _2$ . It is easy to see that $\mu _1$ and $\mu _2$ are independent, given $(\boldsymbol{z},\boldsymbol{x})$ , and the respective conditional distributions are

$\displaystyle \mathcal{N} \left( \sum_{z_i=j} x_i / \left( 0.1 + n_j\right), \sigma^2 / \left( 0.1 + n_j\right) \right)\;,$

where denotes the number of 's equal to . Even more easily, it comes that the conditional posterior distribution of $\boldsymbol{z}$ given $(\mu_1,\mu_2)$ is a product of binomials, with

	$\displaystyle P(Z_i=1\vert x_i,\mu_1,\mu_2)$
	$\displaystyle = \frac{p \exp\left\{-(x_i-\mu_1)^2/2\sigma^2 \right\} }{ p \exp\... ...^2/2\sigma^2 \right\} + (1-p) \exp\left\{-(x_i-\mu_2)^2/2\sigma^2 \right\} }\;.$

Figure 11.9 illustrates the behavior of the Gibbs sampler in that setting, with a simulated dataset of points from the $0.7 \mathcal{N}(0,1) +0.3 \mathcal{N}(2.7,1)$ distribution. The representation of the MCMC sample after iterations is quite in agreement with the posterior surface, represented via a grid on the $(\mu_1,\mu_2)$ space and some contours. The sequence of consecutive steps represented on the left graph also shows that the mixing behavior is satisfactory, since the jumps are in scale with the modal region of the posterior.

This experiment gives a wrong sense of safety, though, because it does not point out the fairly large dependence of the Gibbs sampler to the initial conditions, already signaled in [11] under the name of trapping states. Indeed, the conditioning of $(\mu_1,\mu_2)$ on $\boldsymbol{z}$ implies that the new simulations of the means will remain very close to the previous values, especially if there are many observations, and thus that the new allocations $\boldsymbol{z}$ will not differ much from the previous allocations. In other words, to see a significant modification of the allocations (and thus of the means) would require a very very large number of iterations. Figure 11.10 illustrates this phenomenon for the same sample as in Fig. 11.9, for a wider scale: there always exists a second mode in the posterior distribution, which is much lower than the first mode located around . Nonetheless, a Gibbs sampler initialized close to the second and lower mode will not be able to leave the vicinity of this (irrelevant) mode, even after a large number of iterations. The reason is as given above: to jump to the other mode, a majority of 's would need to change simultaneously and the probability of such a jump is too close to 0 to let the event occur.^11.9

**Figure 11.9:** Gibbs sample of points for the mixture posterior (*left*) and path of the last consecutive steps (*right*) against the posterior surface (Source: [37])
$\includegraphics[width=11.7cm]{text/3-11/gib+mix2.eps}$

**Figure 11.10:** Posterior surface and corresponding Gibbs sample for the two mean mixture model, when initialized close to the second and lower mode, based on iterations (Source: [37])
$\includegraphics[width=11.5cm]{text/3-11/wrongmix.eps}$

This example illustrates quite convincingly that, while the completion is natural from a model point of view (since it is a part of the definition of the model), it does not necessarily transfer its utility for the simulation of the posterior. Actually, when the missing variable model allows for a closed form likelihood, as is the case for mixtures, probit models (Examples 11 and 14) and even hidden Markov models (see [8]), the whole range of the MCMC technology can be used as well. The appeal of alternatives like random walk Metropolis-Hastings schemes is that they remain in a smaller dimension space, since they avoid the completion step(s), and that they are not restricted in the range of their moves.^11.10

**Figure 11.11:** Track of a iteration random walk Metropolis-Hastings sample on the posterior surface, the starting point is indicated by a cross. (The scale of the random walk is .)
$\includegraphics[width=11.1cm]{text/3-11/rwMix.eps}$

The secret of a successful MCMC implementation in such latent variable models is to maintain the distinction between latency in models and latency in simulation (the later being often called use of auxiliary variables). When latent variables can be used with adequate mixing of the resulting chain and when the likelihood cannot be computed in a closed form (as in hidden semi-Markov models, [6]), a Gibbs sampler is a still simple solution that is often easy to simulate from. Adding well-mixing random walk Metropolis-Hastings steps in the simulation scheme cannot hurt the overall mixing of the chain ([37], Chap. 13), especially when several scales can be used at once (see Sect. 11.5). A final word is that the completion can be led in an infinity of ways and that several of these should be tried or used in parallel to increase the chances of success.

11.4.3 Reversible Jump Algorithms for Variable Dimension Models

As described in Sect. 11.2.3, model choice is computationally different from testing in that it considers at once a (much) wider range of models $\mathfrak{M}_i$ and parameter spaces $\Theta_i$ . Although early approaches could only go through a pedestrian pairwise comparison, a more adequate perspective is to envision the model index

as part of the parameter to be estimated, as in (11.3). The (computational) difficulty is that we are then dealing with a possibly infinite space that is the collection of unrelated sets: how can we then simulate from the corresponding distribution?^11.11

The MCMC solution proposed by [20] is called reversible jump MCMC, because it is based on a reversibility constraint on the transitions between the sets $\Theta_i$ . In fact, the only real difficulty compared with previous developments is to validate moves (or jumps) between the $\Theta_i$ 's, since proposals restricted to a given $\Theta_i$ follow from the usual (fixed-dimensional) theory. Furthermore, reversibility can be processed at a local level: since the model indicator $\mu$ is a integer-valued random variable, we can impose reversibility for each pair

of possible values of $\mu$ . The idea at the core of reversible jump MCMC is then to supplement each of the spaces $\Theta_{k_1}$ and $\Theta_{k_2}$ with adequate artificial spaces in order to create a bijection between them. For instance, if $\dim(\Theta_{k_1})>\dim(\Theta_{k_2})$ and if the move from ${\Theta}_{k_1}$ to ${\Theta}_{k_2}$ can be represented by a deterministic transformation of $\theta^{(k_1)}$

As for previous methods, the implementation of this algorithm requires a certain skillfulness in picking the right proposals and the appropriate scales. This art of reversible jump MCMC is illustrated on the two following examples, extracted from Robert and Casella (2004, Sect. 14.2.3).

Example 17 (Continuation of Example 6) If we consider for model $\mathfrak{M}_k$ the component normal mixture distribution,

$\displaystyle \sum_{j=1}^k p_{jk} {\cal N}\left(\mu_{jk},\sigma_{jk}^2\right)\;,$

moves between models involve changing the number of components in the mixture and thus adding new components or removing older components or yet again changing several components. As in [34], we can restrict the moves when in model $\mathfrak{M}_k$ to only models $\mathfrak{M}_{k+1}$ and $\mathfrak{M}_{k-1}$ . The simplest solution is to use a birth-and-death process: The birth step consists in adding a new normal component in the mixture generated from the prior and the death step is the opposite, removing one of the components at random. In this case, the corresponding birth acceptance probability is

	$\displaystyle \min\left( \frac{ \pi_{(k+1)k}}{\pi_{k(k+1)} }\,\frac{(k+1)!}{k!}... ...ta_{k+1})}{\pi_{k}(\theta_{k})\,(k+1)\varphi_{k(k+1)}(u_{k(k+1)})}\, ,1 \right)$
	$\displaystyle \quad = \min\left( \frac{ \pi_{(k+1)k}}{\pi_{k(k+1)} }\, \frac{\v... ...ell_{k+1}(\theta_{k+1})\,(1-p_{k+1})^{k-1}}{\ell_{k}(\theta_{k})} ,1 \right)\;,$

where $\ell_k$ denotes the likelihood of the component mixture model $\mathfrak{M}_k$ and $\varrho(k)$ is the prior probability of model $\mathfrak{M}_k$ .^11.13

While this proposal can work well in some setting, as in Richardson and Green (1997) when the prior is calibrated against the data, it can also be inefficient, that is, leading to a high rejection rate, if the prior is vague, since the birth proposals are not tuned properly. A second proposal, central to the solution of [34], is to devise more local jumps between models, called split and combine moves, since a new component is created by splitting an existing component into two, under some moment preservation conditions, and the reverse move consists in combining two existing components into one, with symmetric constraints that ensure reversibility. (See, e.g., [37], for details.)

Figures 11.12-11.14 illustrate the implementation of this algorithm for the so-called Galaxy dataset used by [34] (see also [38]), which contains observations on the speed of galaxies. On Fig. 11.12, the MCMC output on the number of components is represented as a histogram on , and the corresponding sequence of 's. The prior used on is a uniform distribution on $\{1,\ldots,20\}$ : as shown by the lower plot, most values of are explored by the reversible jump algorithm, but the upper bound does not appear to be restrictive since the $k^{(t)}$ 's hardly ever reach this upper limit. Figure 11.13 illustrates the fact that conditioning the output on the most likely value of ( here) is possible. The nine graphs in this figure show the joint variation of the three types of parameters, as well as the stability of the Markov chain over the iterations: the cumulated averages are quite stable, almost from the start.

The density plotted on top of the histogram in Fig. 11.14 is another good illustration of the inferential possibilities offered by reversible jump algorithms, as a case of model averaging: this density is obtained as the average over iterations of

$\displaystyle \sum_{j=1}^{k^{(t)}} p^{(t)}_{jk} \mathcal{N}\left(\mu^{(t)}_{jk},\left(\sigma^{(t)}_{jk}\right)^2\right)\;,$

which approximates the posterior expectation $\mathbb{E}[f(y\vert\theta )\vert\boldsymbol{x}]$ , where $\boldsymbol {x}$ denotes the data $x_1,\ldots,x_{82}$ .

**Figure 11.12:** Histogram and raw plot of 's produced by a reversible jump MCMC algorithm for the Galaxy dataset
$\includegraphics[width=11.7cm]{text/3-11/galakhist.eps}$

**Figure 11.13:** Reversible jump MCMC output on the parameters of the model $\mathcal {M}_3$ for the Galaxy dataset, obtained by conditioning on . The *left column* gives the histogram of the weights, means, and variances; the *middle column* the scatterplot of the pairs weights-means, means-variances, and variances-weights; the *right column* plots the cumulated averages (over iterations) for the weights, means, and variances
$\includegraphics[width=11.7cm]{text/3-11/galtetahis.eps}$

**Figure 11.14:** Fit of the dataset by the averaged density, $\mathbb{E}[f(y\vert\theta )\vert\boldsymbol{x}]$
$\includegraphics[width=11cm]{text/3-11/galaxfit.eps}$

Example 18 (Continuation of Example 3) For the model of Example 3, the best way to include the stationarity constraints is to use the lag-polynomial representation

$\displaystyle \prod_{i=1}^p \left( 1 - \lambda_i B\right) \,X_t = \epsilon_t \,, \quad \epsilon_t\sim{\mathcal N}\left(0,\sigma^2\right)\;,$

of model $\mathfrak{M}_p$ , and to constrain the inverse roots, $\lambda_i$ , to stay within the unit circle if complex and within if real (see, e.g. [35], Sect. 4.5.2). The associated uniform priors for the real and complex roots $\lambda_j$ is

$\displaystyle \pi_p(\boldsymbol{\lambda}) = \frac{1}{\left\lfloor p/2 \right\rf... ...mbda_i\not\in{\mathbb{R}}} \frac{1}{\pi}{\mathbb{I}}_{\vert\lambda_i\vert<1}\;,$

where $\left\lfloor p/2 \right\rfloor+1$ is the number of different values of . This factor must be included within the posterior distribution when using reversible jump since it does not vanish in the acceptance probability of a move between models $\mathfrak{M}_p$ and $\mathfrak{M}_q$ . Otherwise, this results in a modification of the prior probability of each model.

Once again, a simple choice is to use a birth-and-death scheme where the birth moves either create a real or two conjugate complex roots. As in the birth-and-death proposal for Example 17, the acceptance probability simplifies quite dramatically since it is for instance

$\displaystyle \min\left( \frac{\pi_{(p+1)p}}{\pi_{p(p+1)}}\, \frac{(r_{p}+1)!}{... ...ell_{p+1}\left(\theta_{p+1}\right)}{\ell_{p}\left(\theta_{p}\right)} ,1 \right)$

in the case of a move from $\mathfrak{M}_p$ to $\mathfrak{M}_{p+1}$ . (As for the above mixture example, the factorials are related to the possible choices of the created and the deleted roots.)

Figure 11.15 presents some views of the corresponding reversible jump MCMC algorithm. Besides the ability of the algorithm to explore a range of values of , it also shows that Bayesian inference using these tools is much richer, since it can, for instance, condition on or average over the order , mix the parameters of different models and run various tests on these parameters. A last remark on this graph is that both the order and the value of the parameters are well estimated, with a characteristic trimodality on the histograms of the $\theta _i$ 's, even when conditioning on different from , the value used for the simulation.