As described precisely in Chap. III.3 and in [37], MCMC methods try to overcome the limitation of regular Monte Carlo methods by mean of a Markov chain with stationary distribution the posterior distribution. There exist rather generic ways of producing such chains, including Metropolis-Hastings and Gibbs algorithms. Besides the fact that stationarity of the target distribution is enough to justify a simulation method by Markov chain generation, the idea at the core of MCMC algorithms is that local exploration, when properly weighted, can lead to a valid representation of the distribution of interest, as for instance, the Metropolis-Hastings algorithm.
The Metropolis-Hastings, presented in [37] and
Chap. II.3, offers
a straightforward solution to the problem of simulating from the
posterior distribution
: starting from an arbitrary point
, the corresponding Markov chain explores the surface of
this posterior distribution by a random walk proposal
that progressively visits the whole range of
the possible values of
.
Metropolis-Hastings Algorithm
At iteration
![]() |
![]() |
![]() |
![]() |
![]() |
Obviously, this is only a toy example and more realistic probit models do not fare so well with down-the-shelf random walk Metropolis-Hastings algorithms, as shown for instance in [32] (see also [37], Sect. 10.3.2).11.8
![]() |
![]() |
The difficulty inherent to random walk Metropolis-Hastings algorithms
is the scaling
of the proposal distribution: it must be adapted to the shape of the
target distribution so that, in a reasonable number of steps, the
whole support of this distribution can be visited. If the scale of the
proposal is too small, this will not happen as the algorithm stays
''too local'' and, if there are several modes on the posterior, the
algorithm may get trapped within one modal region because it cannot
reach other modal regions with jumps of too small magnitude. The
larger the dimension is, the harder it is to set up the right
scale, though, because
The Gibbs sampler is a definitely attractive algorithm for Bayesian problems because it naturally fits the hierarchical structures so often found in such problems. ''Natural'' being a rather vague notion from a simulation point of view, it routinely happens that other algorithms fare better than the Gibbs sampler. Nonetheless, Gibbs sampler is often worth a try (possibly with other Metropolis-Hastings refinements at a later stage) in well-structured objects like Bayesian hierarchical models and more general graphical models.
A very relevant illustration is made of latent variable models, where the observational model is itself defined as a mixture model,
![]() |
Latent Variable Gibbs Algorithm
At iteration
While we could have used the probit case as an illustration (Example 11), as done in Chap. II.3, we choose to pick the case of mixtures (Example 6) as a better setting.
![]() |
However, mixtures are also customarily used for density
approximations, as a limited dimension proxy to non-parametric
approaches. In such cases, the components of the mixture and even the
number of components in the mixture are often meaningless for the
problem to be analysed. But this distinction between natural and
artificial completion (by the
's) is lost to the MCMC sampler,
whose goal is simply to provide a Markov chain that converges to the
posterior as stationary distribution. Completion is thus, from
a simulation point of view, a mean to generate such
a chain.
The most standard Gibbs sampler for mixture models
([11]) is thus based on the successive
simulation of the 's and of the
's, conditional on one
another and on the data:
As an illustration, consider the case of a normal mixture with two components, with equal known variance and fixed weights,
![]() |
![]() |
||
![]() |
This experiment gives a wrong sense of safety, though, because it does
not point out the fairly large dependence of the Gibbs sampler to the
initial conditions, already signaled in [11]
under the name of trapping states.
Indeed, the conditioning of
on
implies
that the new simulations of the means will remain very close to the
previous values, especially if there are many observations, and thus
that the new allocations
will not differ much from the
previous allocations. In other words, to see a significant
modification of the allocations (and thus of the means) would require
a very very large number of iterations. Figure 11.10
illustrates this phenomenon for the same sample as in
Fig. 11.9, for a wider scale: there always exists
a second mode in the posterior distribution, which is much lower than
the first mode located around
. Nonetheless, a Gibbs
sampler initialized close to the second and lower mode will not be
able to leave the vicinity of this (irrelevant) mode, even after
a large number of iterations. The reason is as given above: to jump to
the other mode, a majority of
's would need to change
simultaneously and the probability of such a jump is too close to 0
to let the event occur.11.9
![]() |
![]() |
This example illustrates quite convincingly that, while the completion is natural from a model point of view (since it is a part of the definition of the model), it does not necessarily transfer its utility for the simulation of the posterior. Actually, when the missing variable model allows for a closed form likelihood, as is the case for mixtures, probit models (Examples 11 and 14) and even hidden Markov models (see [8]), the whole range of the MCMC technology can be used as well. The appeal of alternatives like random walk Metropolis-Hastings schemes is that they remain in a smaller dimension space, since they avoid the completion step(s), and that they are not restricted in the range of their moves.11.10
![]() |
The secret of a successful MCMC implementation in such latent variable models is to maintain the distinction between latency in models and latency in simulation (the later being often called use of auxiliary variables). When latent variables can be used with adequate mixing of the resulting chain and when the likelihood cannot be computed in a closed form (as in hidden semi-Markov models, [6]), a Gibbs sampler is a still simple solution that is often easy to simulate from. Adding well-mixing random walk Metropolis-Hastings steps in the simulation scheme cannot hurt the overall mixing of the chain ([37], Chap. 13), especially when several scales can be used at once (see Sect. 11.5). A final word is that the completion can be led in an infinity of ways and that several of these should be tried or used in parallel to increase the chances of success.
As described in Sect. 11.2.3, model choice is
computationally different from testing in that it considers at once a
(much) wider range of models
and parameter
spaces
.
Although early approaches could only go through a pedestrian pairwise
comparison, a more adequate perspective is to envision the model index
as part of the parameter to be estimated, as in (11.3).
The (computational) difficulty is that we are then dealing with
a possibly infinite space that is the collection of unrelated sets:
how can we then simulate from the corresponding
distribution?11.11
The MCMC solution proposed
by [20] is called reversible jump MCMC, because
it is based on a reversibility constraint on the transitions
between the sets . In fact, the only real difficulty
compared with previous developments is to validate moves (or
jumps) between the
's, since proposals restricted
to a given
follow from the usual (fixed-dimensional)
theory. Furthermore, reversibility can be processed at
a local level: since the model indicator
is a integer-valued
random variable, we can impose reversibility for each pair
of possible values of
. The idea at the core of reversible jump
MCMC is then to supplement each of the spaces
and
with adequate artificial spaces in order to create
a bijection
between them. For instance, if
and if the move from
to
can be represented by
a deterministic transformation of
![]() |
![]() |
![]() |
The pseudo-code representation of Green's algorithm is thus as follows:
Green's Algorithm
At iteration , if
,
![]() |
As for previous methods, the implementation of this algorithm requires a certain skillfulness in picking the right proposals and the appropriate scales. This art of reversible jump MCMC is illustrated on the two following examples, extracted from Robert and Casella (2004, Sect. 14.2.3).
![]() |
![]() |
||
![]() |
While this proposal can work well in some setting, as in Richardson and Green (1997) when the prior is calibrated against the data, it can also be inefficient, that is, leading to a high rejection rate, if the prior is vague, since the birth proposals are not tuned properly. A second proposal, central to the solution of [34], is to devise more local jumps between models, called split and combine moves, since a new component is created by splitting an existing component into two, under some moment preservation conditions, and the reverse move consists in combining two existing components into one, with symmetric constraints that ensure reversibility. (See, e.g., [37], for details.)
Figures 11.12-11.14 illustrate the
implementation of this algorithm for the so-called Galaxy dataset used
by [34] (see also [38]), which
contains observations on the speed of galaxies. On
Fig. 11.12, the MCMC output on the number of components
is represented as a histogram on
, and the corresponding
sequence of
's. The prior used on
is a uniform distribution on
: as shown by the lower plot, most values of
are
explored by the reversible jump algorithm, but the upper bound does
not appear to be restrictive since the
's hardly ever reach
this upper limit. Figure 11.13 illustrates the fact that
conditioning the output on the most likely value of
(
here) is
possible. The nine graphs in this figure show the joint variation of
the three types of parameters, as well as the stability of the Markov
chain over the
iterations: the cumulated averages are
quite stable, almost from the start.
The density plotted on top of the histogram in Fig. 11.14
is another good illustration of the inferential possibilities offered
by reversible jump algorithms, as a case of model
averaging:
this density is obtained as the average over iterations of
![]() |
![]() |
![]() |
![]() |
![]() |
Once again, a simple choice is to use a birth-and-death scheme where the birth moves either create a real or two conjugate complex roots. As in the birth-and-death proposal for Example 17, the acceptance probability simplifies quite dramatically since it is for instance
![]() |
Figure 11.15 presents some views of the corresponding reversible
jump MCMC algorithm. Besides the ability of the algorithm to explore
a range of values of , it also shows that Bayesian inference using
these tools is much richer, since it can, for instance, condition on
or average over the order
, mix the parameters of different models
and run various tests on these parameters. A last remark on this graph
is that both the order and the value of the parameters are well
estimated, with a characteristic trimodality on the histograms of the
's, even when conditioning on
different from
, the
value used for the simulation.
![]() |