Next: 3.3 Metropolis-Hastings Algorithm Up: 3. Markov Chain Monte Previous: 3.1 Introduction

3.2 Markov Chains

Markov chain Monte Carlo is a method to sample a given multivariate distribution $\pi ^{\ast}$ by constructing a suitable Markov chain with the property that its limiting, invariant distribution, is the target distribution $\pi ^{\ast}$ . In most problems of interest, the distribution $\pi ^{\ast}$ is absolutely continuous and, as a result, the theory of MCMC methods is based on that of Markov chains on continuous state spaces outlined, for example, in [42] and [40]. [59] is the fundamental reference for drawing the connections between this elaborate Markov chain theory and MCMC methods. Basically, the goal of the analysis is to specify conditions under which the constructed Markov chain converges to the invariant distribution, and conditions under which sample path averages based on the output of the Markov chain satisfy a law of large numbers and a central limit theorem.

3.2.1 Definitions and Results

A Markov chain is a collection of random variables (or vectors) $\boldsymbol{\Phi}=\{\boldsymbol{\Phi}_{i}:i\in T\}$ where $T=\{0,1,2,\ldots \}$ . The evolution of the Markov chain on a space $\Omega \subseteq \Re ^{p}$ is governed by the transition kernel

$\displaystyle P(\boldsymbol{x},A)$	$\displaystyle \equiv \Pr (\boldsymbol{\Phi}_{i+1}\in A\vert\boldsymbol{\Phi}_{i}= \boldsymbol{x},\boldsymbol{\Phi}_{j},\,j<i)$
	$\displaystyle =\Pr (\boldsymbol{\Phi}_{i+1}\in A\vert\boldsymbol{\Phi}_{i}=\boldsymbol{x})\;, \quad\boldsymbol{x}\in \Omega \;,\quad A\subset \Omega \;,$

where the second line embodies the Markov property that the distribution of each succeeding state in the sequence, given the current and the past states, depends only on the current state.

Generally, the transition kernel in Markov chain simulations has both a continuous and discrete component. For some function $p( \boldsymbol{x},\boldsymbol{y}):\Omega \times \Omega \rightarrow \Re ^{+}$ , the kernel can be expressed as

$\displaystyle P(\boldsymbol{x},d\boldsymbol{y})=p(\boldsymbol{x},\boldsymbol{y})d\boldsymbol{y}+r(\boldsymbol{x} )\delta _{\boldsymbol{x}}(d\boldsymbol{y})\;,$

(3.2)

where $p(\boldsymbol{x},\boldsymbol{x})=0$ , $\delta _{\boldsymbol{x}}(d\boldsymbol{y})=1$ if $\boldsymbol{x}\in d\boldsymbol{y}$ and 0 otherwise, $r(\boldsymbol{x})=1-\int_\Omega p( \boldsymbol{x},\boldsymbol{y})\,d\boldsymbol{y}$ . This transition kernel specifies that transitions from $\boldsymbol {x}$ to $\boldsymbol {y}$ occur according to $p(\boldsymbol{x},\boldsymbol{y})$ and transitions from $\boldsymbol {x}$ to $\boldsymbol {x}$ occur with probability $r(\boldsymbol{x})$ .

The transition kernel is thus the distribution of $\boldsymbol{\Phi}_{i+1}$ given that $\boldsymbol{\Phi}_{i}=\boldsymbol{x}$ . The th step ahead transition kernel is given by

$\displaystyle P^{(n)}(\boldsymbol{x},A)=\int\limits_{\Omega}P(\boldsymbol{x},d\boldsymbol{y})\,P^{(n-1)}(\boldsymbol{y},A)\;,$

where $P^{(1)}(\boldsymbol{x},d\boldsymbol{y})=P(\boldsymbol{x},d\boldsymbol{y})$ and

$\displaystyle P(\boldsymbol{x},A)=\int\limits_{A}P(\boldsymbol{x},d\boldsymbol{y})\;.$

(3.3)

The goal is to find conditions under which the

th iterate of the transition kernel converges to the invariant distribution $\pi ^{\ast}$ as $n\rightarrow \infty$ . The invariant distribution is one that satisfies

$\displaystyle \pi ^{\ast}(d\boldsymbol{y})=\int\limits_{\Omega}P(\boldsymbol{x},d\boldsymbol{y})\pi ( \boldsymbol{x})\,d\boldsymbol{x}\;,$

(3.4)

where $\pi$ is the density of $\pi ^{\ast}$ with respect to the Lebesgue measure. The invariance condition states that if $\boldsymbol{\Phi}_{i}$ is distributed according to $\pi ^{\ast}$ , then all subsequent elements of the chain are also distributed as $\pi ^{\ast}$ . Markov chain samplers are invariant by construction and therefore the existence of the invariant distribution does not have to be checked.

A Markov chain is reversible if the function $p(\boldsymbol{x},\boldsymbol{y})$ in (3.2) satisfies

$\displaystyle f(\boldsymbol{x})p(\boldsymbol{x},\boldsymbol{y})=f(\boldsymbol{y})p(\boldsymbol{y},\boldsymbol{x})\;,$

(3.5)

for a density $f(\cdot )$ . If this condition holds, it can be shown that $f(\cdot )=\pi (\cdot)$ and has $\pi ^{\ast}$ as an invariant distribution ([59]). To verify this we evaluate the right hand side of (3.4):

$\displaystyle \int P(\boldsymbol{x},A)\pi (\boldsymbol{x})\,d\boldsymbol{x}$	$\displaystyle =\int \left\{ \int_Ap( \boldsymbol{x},\boldsymbol{y})\,d\boldsymb... ...\boldsymbol{x})\delta _{\boldsymbol{x}}(A)\pi (\boldsymbol{x})\,d\boldsymbol{x}$
	$\displaystyle =\int_A\left\{ \int p(\boldsymbol{x},\boldsymbol{y})\pi (\boldsym... ...\} d\boldsymbol{y}+\int_Ar(\boldsymbol{x})\pi (\boldsymbol{x})\,d\boldsymbol{x}$
	$\displaystyle =\int_A\left\{ \int p(\boldsymbol{y},\boldsymbol{x})\pi (\boldsym... ...\} d\boldsymbol{y}+\int_Ar(\boldsymbol{x})\pi (\boldsymbol{x})\,d\boldsymbol{x}$
	$\displaystyle =\int_A(1-r(\boldsymbol{y}))\pi (\boldsymbol{y})\,d\boldsymbol{y}+\int_Ar(\boldsymbol{x} )\pi (\boldsymbol{x})\,d\boldsymbol{x}$
	$\displaystyle =\int_A\pi (\boldsymbol{y})\,d\boldsymbol{y}\;.$	(3.6)

A minimal requirement on the Markov chain for it to satisfy a law of large numbers is the requirement of $\pi ^{\ast}$ -irreducibility. This means that the chain is able to visit all sets with positive probability under $\pi ^{\ast}$ from any starting point in $\Omega$ . Formally, a Markov chain is said to be $\pi ^{\ast}$ -irreducible if for every $\boldsymbol{x} \in \Omega$ ,

$\displaystyle \pi ^{\ast}(A)>0\Rightarrow P(\boldsymbol{\Phi}_i\in A\vert\boldsymbol{\Phi}_0={\boldsymbol{x}}) > 0$

for some $i\geq 1$ . If the space $\Omega$ is connected and the function $p(\boldsymbol{x},\boldsymbol{y})$ is positive and continuous, then the Markov chain with transition kernel given by (3.3) and invariant distribution $\pi ^{\ast}$ is $\pi ^{\ast}$ -irreducible.

Another important property of a chain is aperiodicity, which ensures that the chain does not cycle through a finite number of sets. A Markov chain is aperiodic if there exists no partition of $\Omega =(D_0,D_1,\ldots, D_{p-1})$ for some $p\geq 2$ such that $P(\boldsymbol{\Phi}^i\in D_{i\,\text{mod}\,(p)}\vert\boldsymbol{\Phi}_0\in D_0)=1$ for all .

These definitions allow us to state the following results from [59] which form the basis for Markov chain Monte Carlo methods. The first of these results gives conditions under which a strong law of large numbers holds and the second gives conditions under which the probability density of the th iterate of the Markov chain converges to its unique, invariant density.

Theorem 1 Suppose $\{\boldsymbol{\Phi}_i\}$ is a $\pi ^{\ast}$ -irreducible Markov chain with transition kernel $P(\cdot ,\cdot )$ and invariant distribution $\pi ^{\ast}$ , then $\pi ^{\ast}$ is the unique invariant distribution of $P(\cdot ,\cdot )$ and for all $\pi ^{\ast}$ -integrable real-valued functions ,

$\displaystyle \frac 1M\sum_{i=1}^Mh(\boldsymbol{\Phi}_i)\rightarrow \int h(\boldsymbol{x})\pi ( \boldsymbol{x})d\boldsymbol{x}$ as $\displaystyle \quad M \rightarrow \infty\;,\;a.s.$

Theorem 2 Suppose $\{\boldsymbol{\Phi}_i\}$ is a $\pi ^{\ast}$ -irreducible, aperiodic Markov chain with transition kernel $P(\cdot ,\cdot )$ and invariant distribution $\pi ^{\ast}$ . Then for $\pi ^{\ast}$ -almost every $\boldsymbol{x} \in \Omega$ , and all sets

$\displaystyle \parallel P^M(\boldsymbol{x},A)-\pi ^{\ast}(A)\parallel \rightarrow 0$ as $\displaystyle \quad M\rightarrow \infty\;,$

where $\parallel \cdot \parallel$ denotes the total variation distance.

A further strengthening of the conditions is required to obtain a central limit theorem for sample-path averages. A key requirement is that of an ergodic chain, i.e., chains that are irreducible, aperiodic and positive Harris-recurrent (for a definition of the latter, see [59]. In addition, one needs the notion of geometric ergodicity. An ergodic Markov chain with invariant distribution $\pi ^{\ast}$ is a geometrically ergodic if there exists a non-negative real-valued function (bounded in expectation under $\pi^{\ast})$ and a positive constant such that

$\displaystyle \parallel P^M(\boldsymbol{x},A)-\pi ^{\ast}(A)\parallel \leq C(\boldsymbol{x})r^n$

for all $\boldsymbol {x}$ and all

and sets

. [8] show that if the Markov chain is ergodic, has invariant distribution $\pi ^{\ast}$ , and is geometrically ergodic, then for all

measurable functions

, taken to be scalar-valued for simplicity, and any initial distribution, the distribution of $\sqrt{M}(\hat{h}_M-$ E

converges weakly to a normal distribution with mean zero and variance $\sigma _h^2\geq 0$ , where

$\displaystyle \hat{h}_M =\frac 1M\sum_{i=1}^Mh(\boldsymbol{\Phi}_i)$
E $\displaystyle h =\int h(\boldsymbol{\Phi})\pi (\boldsymbol{\Phi})d\boldsymbol{\Phi}\,$

and

$\displaystyle \sigma _h^2=$ Var $\displaystyle \,h(\boldsymbol{\Phi}_0)+2\sum_{k=1}^\infty$ Cov $\displaystyle \, \left[ \left\{ h(\boldsymbol{\Phi}_0),h(\boldsymbol{\Phi}_k)\right\}\right]\;.$

(3.7)

3.2.2 Computation of Numerical Accuracy
and Inefficiency Factor

The square root of $\sigma _{h}^{2}$ is the numerical standard error of $\hat{h}_{M}$ . To describe estimators of $\sigma _{h}^{2}$ that are consistent in , let $Z_{i}=h(\boldsymbol{\Phi}_{i})$ $(i\leq M)$ . Then, due to the fact that $\{Z_{i}\}$ is a dependent sequence

Var $\displaystyle \,\left(\hat{h}_{M}\right)$	$\displaystyle = M^{-2}\sum_{j,k}$ Cov $\displaystyle \,(Z_{j},Z_{k})$
	$\displaystyle = s^{2}M^{-2}\sum_{j,k=1}^{M}\rho _{\vert j-k\vert}$
	$\displaystyle = s^{2}M^{-1}\left\{ 1+2\sum_{s=1}^{M}\left(1-\frac{s}{M}\right)\rho _{s}\right\}\;,$

where $s^{2}$ is the sample variance of $\{Z_{i}\}$ and $\rho _{s}$ is the estimated autocorrelation at lag

(see [46], Chap. 6). If $\rho _{s}>0$ for each

, then this variance is larger than $s^{2}/M$ which is the variance under independence. Another estimate of the variance can be found by consistently estimating the spectral density

of $\{Z_{i}\}$ at frequency zero and using the fact that Var $\,(\hat{h}_{M})=\tau ^{2}/M$ , where $\tau^{2}=2\pi f(0)$ . Finally, a traditional approach to finding the variance is by the method of batch means. In this approach, the data $(Z_{1},\ldots,Z_{M})$ is divided into

batches of length

with means $B_{i}=m^{-1}[Z_{(i-1)m+1}+\ldots+Z_{im}]$ and the variance of $\hat{h}_{M}$ estimated as

Var $\displaystyle \,\left(\hat{h}_{M}\right)= \frac{1}{k(k-1)}\sum_{i=1}^{k}\left(B_{i}-\bar{B}\right)^{2}\;,$

(3.8)

where the batch size

is chosen to ensure that the first order serial correlation of the batch means is less than

Given the numerical variance it is common to calculate the inefficiency factor, which is also called the autocorrelation time, defined as

$\displaystyle \kappa _{\hat{h}}=\frac{\text{Var}\,\left(\hat{h}_{M}\right)}{s^{2}/M}\;.$

(3.9)

This quantity is interpreted as the ratio of the numerical variance of $\hat{h}_{M}$ to the variance of $\hat{h}_{M}$ based on independent draws, and its inverse is the relative numerical efficiency defined in [28]. Because independence sampling produces an autocorrelation time that is theoretically equal to one and Markov chain sampling produces autocorrelation times that are bigger than one, the inefficiency factor serves to quantify the relative efficiency loss in the computation of $\hat{h}_{M}$ from correlated versus independent samples.

Next: 3.3 Metropolis-Hastings Algorithm Up: 3. Markov Chain Monte Previous: 3.1 Introduction

3.2 Markov Chains

3.2.1 Definitions and Results

3.2.2 Computation of Numerical Accuracy and Inefficiency Factor

3.2.2 Computation of Numerical Accuracy
and Inefficiency Factor