Next: 3.4 The Gibbs Sampling Up: 3. Markov Chain Monte Previous: 3.2 Markov Chains

Subsections

# 3.3 Metropolis-Hastings Algorithm

This powerful algorithm provides a general approach for producing a correlated sequence of draws from the target density that may be difficult to sample by a classical independence method. The goal is to simulate the -dimensional distribution , that has density with respect to some dominating measure. To define the algorithm, let denote a source density for a candidate draw given the current value  in the sampled sequence. The density is referred to as the proposal or candidate generating density. Then, the M-H algorithm is defined by two steps: a first step in which a proposal value is drawn from the candidate generating density and a second step in which the proposal value is accepted as the next iterate in the Markov chain according to the probability , where

 (3.10)

If the proposal value is rejected, then the next sampled value is taken to be the current value. In algorithmic form, the simulated values are obtained by the following recursive procedure.

Algorithm 1 (Metropolis-Hastings)
1. Specify an initial value :

2. Repeat for .
a)
Propose

b)
Let

3. Return the values .

Typically, a certain number of values (say ) at the start of this sequence are discarded after which the chain is assumed to have converged to it invariant distribution and the subsequent draws are taken as approximate variates from . Because theoretical calculation of the burn-in is not easy it is important that the proposal density is chosen to ensure that the chain makes large moves through the support of the invariant distribution without staying at one place for many iterations. Generally, the empirical behavior of the M-H output is monitored by the autocorrelation time of each component of and by the acceptance rate, which is the proportion of times a move is made as the sampling proceeds.

One should observe that the target density appears as a ratio in the probability and therefore the algorithm can be implemented without knowledge of the normalizing constant of . Furthermore, if the candidate-generating density is symmetric, i.e. , the acceptance probability only contains the ratio ; hence, if , the chain moves to , otherwise it moves with probability given by . The latter is the algorithm originally proposed by [39]. This version of the algorithm is illustrated in Fig. 3.1.

Different proposal densities give rise to specific versions of the M-H algorithm, each with the correct invariant distribution . One family of candidate-generating densities is given by . The candidate is thus drawn according to the process , where follows the distribution . Since the candidate is equal to the current value plus noise, this case is called a random walk M-H chain. Possible choices for include the multivariate normal density and the multivariate-. The random walk M-H chain is perhaps the simplest version of the M-H algorithm (and was the one used by [39]) and popular in applications. One has to be careful, however, in setting the variance of  ; if it is too large it is possible that the chain may remain stuck at a particular value for many iterations while if it is too small the chain will tend to make small moves and move inefficiently through the support of the target distribution. In both cases the generated draws that will be highly serially correlated. Note that when is symmetric, and the probability of move only contains the ratio . As mentioned earlier, the same reduction occurs if .

[32] considers a second family of candidate-generating densities that are given by the form . [59] refers to this as an independence M-H chain because, in contrast to the random walk chain, the candidates are drawn independently of the current location  . In this case, the probability of move becomes

where is the ratio of the target and proposal densities. For this method to work and not get stuck in the tails of , it is important that the proposal density have thicker tails than . A similar requirement is placed on the importance sampling function in the method of importance sampling ([27]). In fact, [38] show that if is uniformly bounded then the resulting Markov chain is ergodic.

Chib and Greenberg (1994, 1995) discuss a way of formulating proposal densities in the context of time series autoregressive-moving average models that has a bearing on the choice of proposal density for the independence M-H chain. They suggest matching the proposal density to the target at the mode by a multivariate normal or multivariate- distribution with location given by the mode of the target and the dispersion given by inverse of the Hessian evaluated at the mode. Specifically, the parameters of the proposal density are taken to be

 and (3.11)

where is a tuning parameter that is adjusted to control the acceptance rate. The proposal density is then specified as , where is some multivariate density. This may be called a tailored M-H chain.

Another way to generate proposal values is through a Markov chain version of the accept-reject method. In this version, due to [59], and considered in detail by [14], a pseudo accept-reject step is used to generate candidates for an M-H algorithm. Suppose is a known constant and a source density. Let denote the set of value for which dominates the target density and assume that this set has high probability under . Given , the next value is obtained as follows: First, a candidate value is obtained, independent of the current value  , by applying the accept-reject algorithm with as the ''pseudo dominating'' density. The candidates that are produced under this scheme have density . If we let then it can be shown that the M-H probability of move is given by

 (3.12)

## 3.3.1 Convergence Results

In the M-H algorithm the transition kernel of the chain is given by

 (3.13)

where if and 0 otherwise and

Thus, transitions from to ( ) are made according to the density

while transitions from to occur with probability  . In other words, the density function implied by this transition kernel is of mixed type,

 (3.14)

having both a continuous and discrete component, where now, with change of notation, is the Dirac delta function defined as for and .

[14] provide a way to derive and interpret the probability of move . Consider the proposal density . This proposal density  is not likely to be reversible for  (if it were then we would be done and M-H sampling would not be necessary). Without loss of generality, suppose that implying that the rate of transitions from  to exceed those in the reverse direction. To reduce the transitions from  to  one can introduce a function such that . Solving for yields the probability of move in the M-H algorithm. This calculation reveals the important point that the function is reversible by construction, i.e., it satisfies the condition

 (3.15)

It immediately follows, therefore, from the argument in (3.6) that the M-H kernel has as its invariant density.

It is not difficult to provide conditions under which the Markov chain generated by the M-H algorithm satisfies the conditions of Propositions 1-2. The conditions of Proposition 1 are satisfied by the M-H chain if is positive for and continuous and the set is connected. In addition, the conditions of Proposition 2 are satisfied if is not reversible (which is the usual situation) which leads to a chain that is aperiodic. Conditions for ergodicity, required for use of the central limit theorem, are satisfied if in addition is bounded. Other similar conditions are provided by [50].

## 3.3.2 Example

To illustrate the M-H algorithm, consider the binary response data in Table 3.1, taken from Fahrmeir and Tutz (1997), on the occurrence or non-occurrence of infection following birth by caesarean section. The response variable  is one if the caesarean birth resulted in an infection, and zero if not. There are three covariates: , an indicator of whether the caesarean was non-planned; , an indicator of whether risk factors were present at the time of birth and , an indicator of whether antibiotics were given as a prophylaxis. The data in the table contains information from  births. Under the column of the response, an entry such as means that there were deliveries with covariates of whom developed an infection and did not.

 0 0 0 0 0 0 0 0 0 0 0

Suppose that the probability of infection for the th birth is

 (3.16) (3.17)

where is the covariate vector, is the vector of unknown coefficients, is the cdf of the standard normal random variable and is the four-dimensional identity matrix. The target posterior density, under the assumption that the outcomes are conditionally independent, is

where is the density of the distribution.

### 3.3.2.1 Random Walk Proposal Density

To define the proposal density, let

be the MLE found using the Newton-Raphson algorithm and let

be the symmetric matrix obtained by inverting the negative of the Hessian matrix (the matrix of second derivatives) of the log-likelihood function evaluated at  . Now generate the proposal values by the random walk:

 N (3.18)

which leads to the original Metropolis method. From a run of  iterations of the algorithm beyond a burn-in of a  iterations we get the prior-posterior summary that is reported in Table 3.2, which contains the first two moments of the prior and posterior and the th (lower) and th (upper) percentiles of the marginal densities of  .

 Prior Posterior Mean Std dev Mean Std dev Lower Upper

As expected, both the first and second covariates increase the probability of infection while the third covariate (the antibiotics prophylaxis) reduces the probability of infection.

To get an idea of the form of the posterior density we plot in Fig. 3.1 the four marginal posterior densities. The density plots are obtained by smoothing the histogram of the simulated values with a Gaussian kernel. In the same plot we also report the autocorrelation functions (correlation against lag) for each of the sampled parameter values. The autocorrelation plots provide information of the extent of serial dependence in the sampled values. Here we see that the serial correlations start out high but decline to almost zero by lag twenty.

### 3.3.2.2 Tailored Proposal Density

 Prior Posterior Mean Std dev Mean Std dev Lower Upper

To see the difference in results, the M-H algorithm is next implemented with a tailored proposal density. In this scheme one utilizes both and that were defined above. We let the proposal density be , a multivariate- density with fifteen degrees of freedom. This proposal density is similar to the random-walk proposal except that the distribution is centered at the fixed point . The prior-posterior summary based on draws of the M-H algorithm with this proposal density is given in Table 3.3. We see that the marginal posterior moments are similar to those in Table 3.1. The marginal posterior densities are reported in the top panel of Fig. 3.2. These are virtually identical to those computed using the random-walk M-H algorithm. The most notable difference is in the serial correlation plots which decline much more quickly to zero indicating that the algorithm is mixing well. The same information is revealed by the inefficiency factors which are much closer to one than those from the previous algorithm.

The message from this analysis is that the two proposal densities produce similar results, with the differences appearing only in the autocorrelation plots (and inefficiency factors) of the sampled draws.

## 3.3.3 Multiple-Block M-H Algorithm

In applications when the dimension of is large, it can be difficult to construct a single block M-H algorithm that converges rapidly to the target density. In such cases, it is helpful to break up the variate space into smaller blocks and to then construct a Markov chain with these smaller blocks. Suppose, for illustration, that is split into two vector blocks . For example, in a regression model, one block may consist of the regression coefficients and the other block may consist of the error variance. Next, for each block, let

denote the corresponding proposal density. Here each proposal density is allowed to depend on the data and the current value of the remaining block. Also define (by analogy with the single-block case)

 (3.19)

and

 (3.20)

as the probability of move for block conditioned on the other block. The conditional densities

 and

that appear in these functions are called the full conditional densities. By Bayes theorem each is proportional to the joint density. For example,

and, therefore, the probabilities of move in (3.19) and (3.20) can be expressed equivalently in terms of the kernel of the joint posterior density because the normalizing constant of the full conditional density (the norming constant in the latter expression) cancels in forming the ratio.

With these inputs, one sweep of the multiple-block M-H algorithm is completed by updating each block, say sequentially in fixed order, using a M-H step with the above probabilities of move, given the most current value of the other block.

Algorithm 2 (Multiple-Block Metropolis-Hastings)
1. Specify an initial value :

2. Repeat for .

a)
Repeat for

I.
Propose a value for the th block, conditioned on the previous value of th block, and the current value of the other block :

II.
Calculate the probability of move

III.
Update the th block as

3. Return the values

The extension of this method to more than two blocks is straightforward.

The transition kernel of the resulting Markov chain is given by the product of transition kernels

 (3.21)

This transition kernel is not reversible, as can be easily checked, because under fixed sequential updating of the blocks updating in the reverse order never occurs. The multiple-block M-H algorithm, however, satisfies the weaker condition of invariance. To show this, we utilize the fact that each sub-move satisfies local reversibility ([17]) and therefore the transition kernel has as its local invariant distribution with density , i.e.,

 (3.22)

Similarly, the conditional transition kernel has as its invariant distribution, for a given value of . Then, the kernel formed by multiplying the conditional kernels is invariant for :

where the third line follows from (3.22), the fourth from Bayes theorem, the sixth from assumed invariance of , and the last from the law of total probability.

The implication of this result is that it allows us to take draws in succession from each of the kernels, instead of having to run each to convergence for every value of the conditioning variable.

Remark 1   Versions of either random-walk or tailored proposal densities can be used in this algorithm, analogous to the single-block case. For example, [14] determine the proposal densities  by tailoring to in which case the proposal density is not fixed but varies across iterations. An important special case occurs if each proposal density is taken to be the full conditional density of that block. Specifically, if we set

and

then an interesting simplification occurs. The probability of move (for the first block) becomes

and similarly for the second block, implying that if proposal values are drawn from their full conditional densities then the proposal values are accepted with probability one. This special case of the multiple-block M-H algorithm (in which each block is proposed using its full conditional distribution) is called the Gibbs sampling algorithm.

Next: 3.4 The Gibbs Sampling Up: 3. Markov Chain Monte Previous: 3.2 Markov Chains