Next: 11.5 More Monte Carlo Up: 11. Bayesian Computational Methods Previous: 11.3 Monte Carlo Methods

Subsections

# 11.4 Markov Chain Monte Carlo Methods

As described precisely in Chap. III.3 and in [37], MCMC methods try to overcome the limitation of regular Monte Carlo methods by mean of a Markov chain with stationary distribution the posterior distribution. There exist rather generic ways of producing such chains, including Metropolis-Hastings and Gibbs algorithms. Besides the fact that stationarity of the target distribution is enough to justify a simulation method by Markov chain generation, the idea at the core of MCMC algorithms is that local exploration, when properly weighted, can lead to a valid representation of the distribution of interest, as for instance, the Metropolis-Hastings algorithm.

## 11.4.1 Metropolis-Hastings as Universal Simulator

The Metropolis-Hastings, presented in [37] and Chap. II.3, offers a straightforward solution to the problem of simulating from the posterior distribution : starting from an arbitrary point , the corresponding Markov chain explores the surface of this posterior distribution by a random walk proposal that progressively visits the whole range of the possible values of .

Metropolis-Hastings Algorithm

At iteration

1. Generate ,
2. Take

Example 14 (Continuation of Example 11)   In the case , the probit model defined in Example 11 can also be over-parameterised as

since it only depends on . The Bayesian processing of non-identified models poses no serious difficulty as long as the posterior distribution is well defined. This is the case for a proper prior like

that corresponds to a normal distribution on and a gamma distribution on . While the posterior distribution on is not a standard distribution, it is available up to a normalising constant. Therefore, it can be directly processed via an MCMC algorithm. In this case, we chose a Gibbs sampler that simulates and alternatively, from

and

respectively. Since both of these conditional distributions are also non-standard, we replace the direct simulation by a one-dimensional Metropolis-Hastings step, using normal and log-normal random walk proposals, respectively. For a simulated dataset of  points, the contour plot of the log-posterior distribution is given in Fig. 11.7, along with the last  points of a corresponding MCMC sample after  iterations. This graph shows a very satisfactory repartition of the simulated parameters over the likelihood surface, with higher concentrations near the largest posterior regions. For another simulation, Fig. 11.8 details the first  steps, when started at . Although each step contains both a  and proposal, some moves are either horizontal or vertical: this corresponds to cases when either the or the proposals have been rejected. Note also the fairly rapid convergence to a modal zone of the posterior distribution in this case.

Obviously, this is only a toy example and more realistic probit models do not fare so well with down-the-shelf random walk Metropolis-Hastings algorithms, as shown for instance in [32] (see also [37], Sect. 10.3.2).11.8

The difficulty inherent to random walk Metropolis-Hastings algorithms is the scaling of the proposal distribution: it must be adapted to the shape of the target distribution so that, in a reasonable number of steps, the whole support of this distribution can be visited. If the scale of the proposal is too small, this will not happen as the algorithm stays ''too local'' and, if there are several modes on the posterior, the algorithm may get trapped within one modal region because it cannot reach other modal regions with jumps of too small magnitude. The larger the dimension  is, the harder it is to set up the right scale, though, because

(a)
the curse of dimension implies that there are more and more empty regions in the space, that is, regions with zero posterior probability;
(b)
the knowledge and intuition about the modal regions get weaker and weaker;
(c)
the proper scaling involves a symmetric matrix  in the proposal . Even when the matrix is diagonal, it gets harder to scale as the dimension increases (unless one resorts to a Gibbs like implementation, where each direction is scaled separately).
Note also that the on-line scaling of the algorithm against the empirical acceptance rate is inherently flawed in that the attraction of a modal region may give a false sense of convergence and lead to a choice of too small a scale, simply because other modes will not be visited during the scaling experiment.

## 11.4.2 Gibbs Sampling and Latent Variable Models

The Gibbs sampler is a definitely attractive algorithm for Bayesian problems because it naturally fits the hierarchical structures so often found in such problems. ''Natural'' being a rather vague notion from a simulation point of view, it routinely happens that other algorithms fare better than the Gibbs sampler. Nonetheless, Gibbs sampler is often worth a try (possibly with other Metropolis-Hastings refinements at a later stage) in well-structured objects like Bayesian hierarchical models and more general graphical models.

A very relevant illustration is made of latent variable models, where the observational model is itself defined as a mixture model,

Such models were instrumental in promoting the Gibbs sampler in the sense that they have the potential to make Gibbs sampling sound natural very easily. (See also Chap. II.3.) For instance, [43] wrote a precursor article to [14] that designed specific two-stage Gibbs samplers for a variety of latent variable models. And many of the first applications of Gibbs sampling in the early 90's were actually for models of that kind. The usual implementation of the Gibbs sampler in this case is to simulate the missing variables  conditional on the parameters and reciprocally, as follows:

Latent Variable Gibbs Algorithm

At iteration

1. Generate
2. Generate

While we could have used the probit case as an illustration (Example 11), as done in Chap. II.3, we choose to pick the case of mixtures (Example 6) as a better setting.

Example 15 (Continuation of Example 6)   The natural missing data structure of a mixture of distribution is historical. In one of the first mixtures to be ever studied by Bertillon, in 1863, a bimodal structure on the height of conscripts in south eastern France (Doubs) can be explained by the mixing of two populations of military conscripts, one from the plains and one from the mountains (or hills). Therefore, in the analysis of data from distributions of the form

a common missing data representation is to associate with each observation a missing multinomial variable such that . In heterogeneous populations made of several homogeneous subgroups or subpopulations, it makes sense to interpret as the index of the population of origin of , which has been lost in the observational process.

However, mixtures are also customarily used for density approximations, as a limited dimension proxy to non-parametric approaches. In such cases, the components of the mixture and even the number  of components in the mixture are often meaningless for the problem to be analysed. But this distinction between natural and artificial completion (by the 's) is lost to the MCMC sampler, whose goal is simply to provide a Markov chain that converges to the posterior as stationary distribution. Completion is thus, from a simulation point of view, a mean to generate such a chain.

The most standard Gibbs sampler for mixture models ([11]) is thus based on the successive simulation of the 's and of the 's, conditional on one another and on the data:

1. Generate
2. Generate
Given that the density is most often from an exponential family, the simulation of the 's is generally straightforward.

As an illustration, consider the case of a normal mixture with two components, with equal known variance and fixed weights,

 (11.10)

Assume in addition a normal prior on both means and . It is easy to see that and are independent, given , and the respective conditional distributions are

where denotes the number of 's equal to . Even more easily, it comes that the conditional posterior distribution of given is a product of binomials, with

Figure 11.9 illustrates the behavior of the Gibbs sampler in that setting, with a simulated dataset of  points from the distribution. The representation of the MCMC sample after  iterations is quite in agreement with the posterior surface, represented via a grid on the space and some contours. The sequence of consecutive steps represented on the left graph also shows that the mixing behavior is satisfactory, since the jumps are in scale with the modal region of the posterior.

This experiment gives a wrong sense of safety, though, because it does not point out the fairly large dependence of the Gibbs sampler to the initial conditions, already signaled in [11] under the name of trapping states. Indeed, the conditioning of on implies that the new simulations of the means will remain very close to the previous values, especially if there are many observations, and thus that the new allocations will not differ much from the previous allocations. In other words, to see a significant modification of the allocations (and thus of the means) would require a very very large number of iterations. Figure 11.10 illustrates this phenomenon for the same sample as in Fig. 11.9, for a wider scale: there always exists a second mode in the posterior distribution, which is much lower than the first mode located around . Nonetheless, a Gibbs sampler initialized close to the second and lower mode will not be able to leave the vicinity of this (irrelevant) mode, even after a large number of iterations. The reason is as given above: to jump to the other mode, a majority of 's would need to change simultaneously and the probability of such a jump is too close to 0 to let the event occur.11.9

This example illustrates quite convincingly that, while the completion is natural from a model point of view (since it is a part of the definition of the model), it does not necessarily transfer its utility for the simulation of the posterior. Actually, when the missing variable model allows for a closed form likelihood, as is the case for mixtures, probit models (Examples 11 and 14) and even hidden Markov models (see [8]), the whole range of the MCMC technology can be used as well. The appeal of alternatives like random walk Metropolis-Hastings schemes is that they remain in a smaller dimension space, since they avoid the completion step(s), and that they are not restricted in the range of their moves.11.10

Example 16 (Continuation of Example 15)   Given that the likelihood of a sample from the mixture distribution (11.10) can be computed in O time, a regular random walk Metropolis-Hastings algorithm can be used in this setup. Figure 11.11 shows how quickly this algorithm escapes the attraction of the poor mode, as opposed to the Gibbs sampler of Fig. 11.10: within a few iterations of the algorithm, the chain drifts over the poor mode and converges almost deterministically to the proper region of the posterior surface. The random walk is based on proposals, although other scales would work as well but would require more iterations to reach the proper model regions. For instance, a scale of in the Normal proposal above needs close to iterations to attain the main mode.

The secret of a successful MCMC implementation in such latent variable models is to maintain the distinction between latency in models and latency in simulation (the later being often called use of auxiliary variables). When latent variables can be used with adequate mixing of the resulting chain and when the likelihood cannot be computed in a closed form (as in hidden semi-Markov models, [6]), a Gibbs sampler is a still simple solution that is often easy to simulate from. Adding well-mixing random walk Metropolis-Hastings steps in the simulation scheme cannot hurt the overall mixing of the chain ([37], Chap. 13), especially when several scales can be used at once (see Sect. 11.5). A final word is that the completion can be led in an infinity of ways and that several of these should be tried or used in parallel to increase the chances of success.

## 11.4.3 Reversible Jump Algorithms for Variable Dimension Models

As described in Sect. 11.2.3, model choice is computationally different from testing in that it considers at once a (much) wider range of models and parameter spaces . Although early approaches could only go through a pedestrian pairwise comparison, a more adequate perspective is to envision the model index as part of the parameter to be estimated, as in (11.3). The (computational) difficulty is that we are then dealing with a possibly infinite space that is the collection of unrelated sets: how can we then simulate from the corresponding distribution?11.11

The MCMC solution proposed by [20] is called reversible jump MCMC, because it is based on a reversibility constraint on the transitions between the sets . In fact, the only real difficulty compared with previous developments is to validate moves (or jumps) between the 's, since proposals restricted to a given follow from the usual (fixed-dimensional) theory. Furthermore, reversibility can be processed at a local level: since the model indicator  is a integer-valued random variable, we can impose reversibility for each pair of possible values of . The idea at the core of reversible jump MCMC is then to supplement each of the spaces and with adequate artificial spaces in order to create a bijection between them. For instance, if and if the move from to can be represented by a deterministic transformation of

Green (1995) imposes a dimension matching condition which is that the opposite move from to is concentrated on the curve

In the general case, if is completed by a simulation into and by into so that the mapping between and is a bijection,

 (11.11)

the probability of acceptance for the move from model to model is then

involving
• the Jacobian of the transform ,
• the probability of choosing a jump to while in , and
• , the density of .
The acceptance probability for the reverse move is based on the inverse ratio if the move from to also satisfies (11.11) with .11.12

The pseudo-code representation of Green's algorithm is thus as follows:

Green's Algorithm

At iteration , if ,

1. Select model with probability
2. Generate
3. Set
4. Take with probability

and take otherwise.

As for previous methods, the implementation of this algorithm requires a certain skillfulness in picking the right proposals and the appropriate scales. This art of reversible jump MCMC is illustrated on the two following examples, extracted from Robert and Casella (2004, Sect. 14.2.3).

Example 17 (Continuation of Example 6)   If we consider for model the  component normal mixture distribution,

moves between models involve changing the number of components in the mixture and thus adding new components or removing older components or yet again changing several components. As in [34], we can restrict the moves when in model to only models and . The simplest solution is to use a birth-and-death process: The birth step consists in adding a new normal component in the mixture generated from the prior and the death step is the opposite, removing one of the components at random. In this case, the corresponding birth acceptance probability is

where denotes the likelihood of the component mixture model and is the prior probability of model .11.13

While this proposal can work well in some setting, as in Richardson and Green (1997) when the prior is calibrated against the data, it can also be inefficient, that is, leading to a high rejection rate, if the prior is vague, since the birth proposals are not tuned properly. A second proposal, central to the solution of [34], is to devise more local jumps between models, called split and combine moves, since a new component is created by splitting an existing component into two, under some moment preservation conditions, and the reverse move consists in combining two existing components into one, with symmetric constraints that ensure reversibility. (See, e.g., [37], for details.)

Figures 11.12-11.14 illustrate the implementation of this algorithm for the so-called Galaxy dataset used by [34] (see also [38]), which contains observations on the speed of galaxies. On Fig. 11.12, the MCMC output on the number of components is represented as a histogram on , and the corresponding sequence of 's. The prior used on is a uniform distribution on : as shown by the lower plot, most values of are explored by the reversible jump algorithm, but the upper bound does not appear to be restrictive since the 's hardly ever reach this upper limit. Figure 11.13 illustrates the fact that conditioning the output on the most likely value of  ( here) is possible. The nine graphs in this figure show the joint variation of the three types of parameters, as well as the stability of the Markov chain over the iterations: the cumulated averages are quite stable, almost from the start.

The density plotted on top of the histogram in Fig. 11.14 is another good illustration of the inferential possibilities offered by reversible jump algorithms, as a case of model averaging: this density is obtained as the average over iterations  of

which approximates the posterior expectation , where denotes the data .

Example 18 (Continuation of Example 3)   For the model of Example 3, the best way to include the stationarity constraints is to use the lag-polynomial representation

of model , and to constrain the inverse roots, , to stay within the unit circle if complex and within if real (see, e.g. [35], Sect. 4.5.2). The associated uniform priors for the real and complex roots  is

where is the number of different values of . This factor must be included within the posterior distribution when using reversible jump since it does not vanish in the acceptance probability of a move between models and  . Otherwise, this results in a modification of the prior probability of each model.

Once again, a simple choice is to use a birth-and-death scheme where the birth moves either create a real or two conjugate complex roots. As in the birth-and-death proposal for Example 17, the acceptance probability simplifies quite dramatically since it is for instance

in the case of a move from to . (As for the above mixture example, the factorials are related to the possible choices of the created and the deleted roots.)

Figure 11.15 presents some views of the corresponding reversible jump MCMC algorithm. Besides the ability of the algorithm to explore a range of values of , it also shows that Bayesian inference using these tools is much richer, since it can, for instance, condition on or average over the order , mix the parameters of different models and run various tests on these parameters. A last remark on this graph is that both the order and the value of the parameters are well estimated, with a characteristic trimodality on the histograms of the 's, even when conditioning on different from , the value used for the simulation.

Next: 11.5 More Monte Carlo Up: 11. Bayesian Computational Methods Previous: 11.3 Monte Carlo Methods