Next: 3.2 Markov Chains Up: 3. Markov Chain Monte Previous: 3. Markov Chain Monte

3.1.1 Organization

3.1 Introduction

In the past fifteen years computational statistics has been enriched by a powerful, somewhat abstract method of generating variates from a target probability distribution that is based on Markov chains whose stationary distribution is the probability distribution of interest. This class of methods, popularly referred to as Markov chain Monte Carlo methods, or simply MCMC methods, have been influential in the modern practice of Bayesian statistics where these methods are used to summarize the posterior distributions that arise in the context of the Bayesian prior-posterior analysis ([58,23,56,59,3,14], 1996; [31,57,22,50,5,10,12,18,34,49,25]). MCMC methods have proved useful in practically all aspects of Bayesian inference, for example, in the context of prediction problems and in the computation of quantities, such as the marginal likelihood, that are used for comparing competing Bayesian models.

A central reason for the widespread interest in MCMC methods is that these methods are extremely general and versatile and can be used to sample univariate and multivariate distributions when other methods (for example classical methods that produce independent and identically distributed draws) either fail or are difficult to implement. The fact that MCMC methods produce dependent draws causes no substantive complications in summarizing the target distribution. For example, if $\{\boldsymbol{\psi}^{(1)}, \ldots, \boldsymbol{\psi}^{(M)}\}$ are draws from a (say continuous) target distribution $\pi (\boldsymbol{\psi})$ , where $\boldsymbol{\psi}\in \Re^{d}$ , then the expectation of $h(\boldsymbol{\psi})$ under $\pi$ can be estimated by the average

$\displaystyle M^{-1}\sum_{j=1}^{M}h\left(\boldsymbol{\psi}^{(j)}\right)\;,$

(3.1)

as in the case of random samples, because suitable laws of large numbers for Markov chains can be used to show that

$\displaystyle M^{-1}\sum_{j=1}^{M}h\left(\boldsymbol{\psi}^{(j)}\right)\rightarrow \int\limits_{\Re ^{d}}h( \boldsymbol{\psi})\pi (\boldsymbol{\psi})$ d $\displaystyle \boldsymbol{\psi}\;,$

as the simulation sample size

becomes large.

Another reason for the interest in MCMC methods is that, somewhat surprisingly, it is rather straightforward to construct one or more Markov chains whose limiting invariant distribution is the desired target distribution. One way to construct the appropriate Markov chain is by a method called the Metropolis method which was introduced by [39] in connection with work related to the hydrogen bomb project. In this method, the Markov chain simulation is constructed by a recursive two step process. Given the current iterate $\boldsymbol{\psi}^{(j)}$ , a proposal value $\boldsymbol{\psi}^{\prime}$ is drawn from a distribution $q(\boldsymbol{\psi}^{(j)},\cdot)$ , such that $\boldsymbol{\psi}^{\prime}$ is symmetrically distributed about the current value $\boldsymbol{\psi}^{(j)}$ . In the second step, this proposal value is accepted as the next iterate $\boldsymbol{\psi}^{(j+1)}$ of the Markov chain with probability

$\displaystyle \alpha \left(\boldsymbol{\psi}^{(j)},\boldsymbol{\psi}^{\prime}\r... ...oldsymbol{\psi}^{\prime})}{\pi \left(\boldsymbol{\psi}^{(j)}\right)}\right\}\;.$

If the proposal value is rejected, then $\boldsymbol{\psi}^{(j+1)}$ is taken to be the current value. The method is simple to implement, even in multivariate settings, and was widely used by physicists in computational statistical mechanics and quantum field theory to sample the coordinates of a point in phase space. In those settings, and in subsequent statistical problems, it is helpful that the method can be implemented without knowledge of the normalizing constant of $\pi$ since that constant cancels in the determination of the probability $\alpha (\boldsymbol{\psi}^{(j)}, \boldsymbol{\psi}^{\prime})$ .

The requirement that the proposal distribution be symmetric about the current value was relaxed by [32]. The resulting method, commonly called the Metropolis-Hastings (M-H) method, relies on the same two steps of the Metropolis method except that the probability of move is given by

$\displaystyle \alpha \left(\boldsymbol{\psi}^{(j)},\boldsymbol{\psi}^{\prime}\r... ...ht)}{q\left(\boldsymbol{\psi}^{(j)},\boldsymbol{\psi} ^{\prime}\right)}\right\}$

which clearly reduces to the Metropolis probability of move when the proposal distribution is symmetric in its arguments. Starting with an arbitrary value $\boldsymbol{\psi}^{(0)}$ in the support of the target distributions, iterations of this two step process produce the (correlated) sequence of values

$\displaystyle \left\{ \boldsymbol{\psi}^{(0)},\boldsymbol{\psi}^{(1)},\boldsymbol{\psi} ^{(2)},\ldots\right\}\;.$

Typically, a certain number of values (say $n_{0})$ at the start of this sequence are discarded and the subsequent (say)

values are used as variates from the target distribution.

In applications when the dimension of $\boldsymbol{\psi}$ is large it may be preferable to construct the Markov chain simulation by first grouping the variables $\boldsymbol{\psi}$ into smaller blocks. For simplicity suppose that two blocks are adequate and that $\boldsymbol{\psi}$ is written as $(\boldsymbol{\psi}_{1},\boldsymbol{\psi}_{2})$ , with $\boldsymbol{\psi}_{k}\in \Omega _{k}\subseteq \Re ^{d_{k}}$ . In that case, the M-H chain can be constructed by

updating $\boldsymbol{\psi}_{1}$ given $\left(\boldsymbol{\psi}_{1}^{(j),}\boldsymbol{% \psi}_{2}^{(j)}\right)$ to produce $\boldsymbol{\psi}_{1}^{(j)}$ and then
updating $\boldsymbol{\psi}_{2}$ given $\left(\boldsymbol{\psi}_{1}^{(j+1)},% \boldsymbol{\psi}_{2}^{(j)}\right)$ to produce $\boldsymbol{\psi}_{2}^{(j+1)}$ ,

which completes one cycle through two sub-moves. [14] who emphasized and highlighted such M-H chains have referred to them as multiple-block M-H algorithms.

Despite the long vintage of the M-H method, the contemporary interest in MCMC methods was sparked by work on a related MCMC method, the Gibbs sampling algorithm. The Gibbs sampling algorithm is one of the simplest Markov chain Monte Carlo algorithms and has its origins in the work of [2] on spatial lattice systems, [26] on the problem of image processing, and [58] on missing data problems. The paper by [23] helped to demonstrate the value of the Gibbs algorithm for a range of problems in Bayesian analysis. In the Gibbs sampling method, the Markov chain is constructed by simulating the conditional distributions that are implied by $\pi (\boldsymbol{\psi})$ . In particular, if $\boldsymbol{\psi}$ is split into two components $\boldsymbol{\psi}_{1}$ and $\boldsymbol{\psi}_{2}$ , then the Gibbs method proceeds through the recursive sampling of the conditional distributions $\pi (\boldsymbol{\psi}_{1}\vert \boldsymbol{\psi}_{2})$ and $\pi (\boldsymbol{\psi}_{2}\vert\boldsymbol{\psi}_{1})$ , where the most recent value of $\boldsymbol{\psi}_{2}$ is used in the first simulation and the most recent value of $\boldsymbol{\psi}_{1}$ in the second simulation. This method is most simple to implement when each conditional distribution is a known distribution that is easy to sample. As we show below, the Gibbs sampling method is a special case of the multiple block M-H algorithm.

3.1.1 Organization

The rest of the chapter is organized as follows. In Sect. 3.2 we summarize the relevant Markov chain theory that justifies simulation by MCMC methods. In particular, we provide the conditions under which discrete-time and continuous state space Markov chains satisfy a law of large numbers and a central limit theorem. The M-H algorithm is discussed in Sect. 3.3 followed by the Gibbs sampling algorithm in Sect. 3.4. Section 3.5 deals with MCMC methods with latent variables and Sect. 3.6 with ways of estimating the marginal densities based on the MCMC output. Issues related to sampler performance are considered in Sect. 3.7 and strategies for improving the mixing of the Markov chains in Sect. 3.8. Section 3.9 concludes with brief comments about new and emerging directions in MCMC methods.

Next: 3.2 Markov Chains Up: 3. Markov Chain Monte Previous: 3. Markov Chain Monte