next up previous contents index
Next: 11.6 Conclusion Up: 11. Bayesian Computational Methods Previous: 11.4 Markov Chain Monte

Subsections



11.5 More Monte Carlo Methods

While MCMC algorithms considerably expanded the range of applications of Bayesian analysis, they are not, by any means, the end of the story! Further developments are taking place, either at the fringe of the MCMC realm or far away from it. We indicate below a few of the directions in Bayesian computational Statistics, omitting many more that also are of interest...


11.5.1 Adaptivity for MCMC Algorithms

Given the range of situations where MCMC applies, it is unrealistic to hope for a generic MCMC sampler that would function in every possible setting. The more generic proposals like random-walk Metropolis-Hastings algorithms are known to fail in large dimension and disconnected supports, because they take too long to explore the space of interest ([31]). The reason for this impossibility theorem is that, in realistic problems, the complexity of the distribution to simulation is the very reason why MCMC is used! So it is difficult to ask for a prior opinion about this distribution, its support or the parameters of the proposal distribution used in the MCMC algorithm: intuition is close to void in most of these problems.

However, the performances of off-the-shelve algorithms like the random-walk Metropolis-Hastings scheme bring information about the distribution of interest and, as such, should be incorporated in the design of better and more powerful algorithms. The problem is that we usually miss the time to train the algorithm on these previous performances and are looking for the Holy Grail of automated MCMC procedures! While it is natural to think that the information brought by the first steps of an MCMC algorithm should be used in later steps, there is a severe catch: using the whole past of the ''chain'' implies that this is not a Markov chain any longer. Therefore, usual convergence theorems do not apply and the validity of the corresponding algorithms is questionable. Further, it may be that, in practice, such algorithms do degenerate to point masses because of a too rapid decrease in the variation of their proposal.

Figure 11.16: Output of the adaptive scheme for the $ t$-distribution posterior with a sample of $ 10 x_j\sim \cal {T}_3$ and initial variances of (top) 0.1, (middle) 0.5, and (bottom) 2.5. The left column plots the sequence of $ \theta ^{(i)}$'s while the right column compares its histogram against the true posterior distribution (with a different scale for the upper graph)
\includegraphics[width=11.7cm]{text/3-11/adap.eps}

Example 19 (Continuation of Example 9)   For the $ t$-distribution sample, we could fit a normal proposal from the empirical mean and variance of the previous values of the chain,

$\displaystyle \mu_t = \frac{1}{t}\,\sum_{i=1}^t \theta^{(i)}$   and$\displaystyle \quad \sigma^2_t = \frac{1}{t}\,\sum_{i=1}^t \left(\theta^{(i)} - \mu_t\right)^2\;.$    

This leads to a Metropolis-Hastings algorithm with acceptance probability

$\displaystyle \prod_{j=2}^n \left[ \frac{\nu + \left(x_j-\theta^{(t)}\right)^2}...
...t(\mu_t-\theta^{(t)}\right)^2/2\sigma^2_t}{ \exp -(\mu_t-\xi)^2/2\sigma^2_t}\;,$    

where $ \xi$ is the proposed value from $ \mathcal{N}(\mu_t,\sigma^2_t)$. The invalidity of this scheme (because of the dependence on the whole sequence of $ \theta ^{(i)}$'s till iteration $ t$) is illustrated in Fig. 11.16: when the range of the initial values is too small, the sequence of $ \theta ^{(i)}$'s cannot converge to the target distribution and concentrates on too small a support. But the problem is deeper, because even when the range of the simulated values is correct, the (long-term) dependence on past values modifies the distribution of the sequence. Figure 11.17 shows that, for an initial variance of $ 2.5$, there is a bias in the histogram, even after $ 25{,}000$ iterations and stabilisation of the empirical mean and variance.

Figure 11.17: Comparison of the distribution of an adaptive scheme sample of $ 25{,}000$ points with initial variance of $ 2.5$ and of the target distribution
\includegraphics[width=8.7cm]{text/3-11/adap25.eps}

Even though the Markov chain is converging in distribution to the target distribution (when using a proper, i.e. time-homogeneous updating scheme), using past simulations to create a non-parametric approximation to the target distribution does not work either. Figure 11.18 shows for instance the output of an adaptive scheme in the setting of Example 19 when the proposal distribution is the Gaussian kernel based on earlier simulations. A very large number of iterations is not sufficient to reach an acceptable approximation of the target distribution.

Figure 11.18: Sample produced by $ 50{,}000$ iterations of a nonparametric adaptive MCMC scheme and comparison of its distribution with the target distribution
\includegraphics[width=10cm]{text/3-11/nopada.eps}

The overall message is thus that one should not constantly adapt the proposal distribution on the past performances of the simulated chain. Either the adaptation must cease after a period of burnin (not to be taken into account for the computations of expectations and quantities related to the target distribution), or the adaptive scheme must be theoretically assess on its own right. This later path is not easy and only a few examples can be found (so far) in the literature. See, e.g., [17] who use regeneration to create block independence and preserve Markovianity on the paths rather than on the values, Haario et al. ([22],[23]) who derive a proper adaptation scheme in the spirit of Example 19 by using a ridge-like correction to the empirical variance, and [4] who propose a more general framework of valid adaptivity based on stochastic optimisation and the Robbin-Monro algorithm. (The latter actually embeds the chain of interest $ \theta^{(t)}$ in a larger chain $ (\theta^{(t)},\xi^{(t)},\partial^{(t)})$ that also includes the parameter of the proposal distribution as well as the gradient of a performance criterion.)


11.5.2 Population Monte Carlo

To reach acceptable adaptive algorithms, while avoiding an extended study of their theoretical properties, a better alternative is to leave the structure of Markov chains and to consider sequential or population Monte Carlo methods ([25], [6]) that have much more in common with importance sampling than with MCMC. They are inspired from particle systems that were introduced to handle rapidly changing target distributions like those found in signal processing and imaging ([19,39,13]) but primarily handle fixed but complex target distributions by building a sequence of increasingly better proposal distributions.11.14 Each iteration of the population Monte Carlo (PMC) algorithm thus produces a sample approximately simulated from the target distribution but the iterative structure allows for adaptivity toward the target distribution. Since the validation is based on importance sampling principles, dependence on the past samples can be arbitrary and the approximation to the target is valid (unbiased) at each iteration and does not require convergence times nor stopping rules.

If $ t$ indexes the iteration and $ i$ the sample point, consider proposal distributions $ q_{it}$ that simulate the $ x_i^{(t)}$'s and associate to each $ x_i^{(t)}$ an importance weight

$\displaystyle \varrho_i^{(t)} = \pi\left(x_i^{(t)}\right) \big/ q_{it}\left(x_i^{(t)}\right)\;, \quad i=1,\ldots,n\;.$    

Approximations of the form

$\displaystyle \mathfrak{I}_t = \frac{1}{n}\,\sum_{i=1}^n \varrho_i^{(t)} \,h\left(x_i^{(t)}\right)$    

are then unbiased estimators of $ \mathbb{E}^{\pi}[h(X)]$, even when the importance distribution $ q_{it}$ depends on the entire past of the experiment. Indeed, if $ \zeta$ denotes the vector of past random variates that contribute to $ q_{it}$, and $ g(\zeta)$ its arbitrary distribution, we have

$\displaystyle ~~\,\iint \frac{\pi(x)}{q_{it}(x\vert\zeta)} \, h(x) q_{it}(x) {\...
... x\,g(\zeta) {\text{d}}\zeta = \mathbb{E}^{\pi}[h(X)]\;.%\label{eq:expectadec}
$    

Furthermore, assuming that the variances

$\displaystyle {\text{var}}\left(\varrho_i^{(t)} h\left(x_i^{(t)}\right)\right)$    

exist for every $ 1\le i\le n$, we have

$\displaystyle {\text{var}}\left( \mathfrak{I}_t \right) = \frac{1}{n^2}\ \sum_{i=1}^n {\text{var}}\left(\varrho_i^{(t)} h\left(x_i^{(t)}\right)\right) \;,$    

due to the canceling effect of the weights $ \varrho_i^{(t)}$.

Since, usually, the density $ \pi $ is unscaled, we use instead

$\displaystyle \varrho_i^{(t)} \propto \frac{\pi\left(x_i^{(t)}\right)}{q_{it}\big(x_i^{(t)}\big)}\;, \quad i=1,\ldots,n\;,$    

scaled so that the $ \varrho_i^{(t)}$'s sum up to $ 1$. In this case, the unbiasedness is lost, although it approximately holds. In fact, the estimation of the normalizing constant of $ \pi $ improves with each iteration $ t$, since the overall average

$\displaystyle \varpi_t = \frac{1}{tn}\,\sum_{\tau=1}^t\,\sum_{i=1}^n\, \frac{\pi\left(x_i^{(\tau)}\right)}{q_{i\tau}\big(x_i^{(\tau)}\big)}$    

is convergent. Therefore, as $ t$ increases, $ \varpi_t$ contributes less and less to the variability of  $ \mathfrak{I}_t$.

Since the above establishes that an simulation scheme based on sample dependent proposals is fundamentally a specific kind of importance sampling, the following algorithm is validated by the same principles as regular importance sampling:

Population Monte Carlo Algorithm

For $ t=1,\ldots,T$

  1. For $ i=1,\ldots,n$,
    (1)
    Select the generating distribution $ q_{it}(\cdot)$
    (2)
    Generate $ x_i^{(t)}\sim q_{it}(x)$
    (3)
    Compute $ \varrho_i^{(t)} = \pi(x_i^{(t)})/q_{it}(x_i^{(t)})$
  2. Normalize the $ \varrho_i^{(t)}$'s to sum up to $ 1$
  3. Resample $ n$ values from the $ x_i^{(t)}$'s with replacement, using the weights $ \varrho_i^{(t)}$, to create the sample $ (x_1^{(t)},\ldots,x_n^{(t)})$

Step (1) is singled out because it is the central property of the PMC algorithm, namely that adaptivity can be extended to the individual level and that the $ q_{it}$'s can be picked based on the performances of the previous  $ q_{i(t-1)}$'s or even on all the previously simulated samples, if storage allows. For instance, the $ q_{it}$'s can include large tails proposals as in the defensive sampling strategy of [24], to ensure finite variance. Similarly, Warnes' (2001) non-parametric Gaussian kernel approximation can be used as a proposal.11.15 (See also [42] smooth bootstrap as an earlier example of PMC algorithm.)

The major difference between the PMC algorithm and earlier proposals in the particle system literature is that past dependent moves as those of [16] remain within the MCMC framework, with Markov transition kernels with stationary distribution equal to $ \pi $.

Example 20 (Continuation of Example 15)   We consider here the implementation of the PMC algorithm in the case of the normal mixture (11.10). As in Example 16, a PMC sampler can be efficiently implemented without the (Gibbs) augmentation step, using normal random walk proposals based on the previous sample of $ \boldsymbol{\mu}=(\mu_1,\mu_2)$'s. Moreover, the difficulty inherent to random walks, namely the selection of a ''proper'' scale, can be bypassed because of the adaptivity of the PMC algorithm. Indeed, the proposals can be associated with a range of variances $ v_k$ $ (1\le k\le K)$ ranging from, e.g., $ 10^3$ down to $ 10^{-3}$. At each step of the algorithm, the new variances can be selected proportionally to the performances of the scales $ v_k$ on the previous iterations. For instance, a scale can be chosen proportionally to its non-degeneracy rate in the previous iteration, that is, the percentage of points generated with the scale $ v_k$ that survived after resampling.11.16 The weights are then of the form

$\displaystyle \varrho_j\propto\frac{f\left(\boldsymbol{x}\left\vert\left(\mu_1\...
...mu_2\right)_j^{(i)}\left\vert\left(\mu_2\right)_j^{(i-1)},v_k\right.\right)}\;,$    

where $ \varphi(q\vert s,v)$ is the density of the normal distribution with mean $ s$ and variance $ v$ at the point $ q$.

Compared with an MCMC algorithm in the same setting (see Examples 15 and 16), the main feature of this algorithm is its ability to deal with multiscale proposals in an unsupervised manner. The upper row of Fig. 11.21 produces the frequencies of the five variances $ v_k$ used in the proposals along iterations: The two largest variances $ v_k$ most often have a zero survival rate, but sometimes experience bursts of survival. In fact, too large a variance mostly produces points that are irrelevant for the posterior distribution, but once in a while a point $ \theta_j^{(t)}$ gets close to one of the modes of the posterior. When this occurs, the corresponding $ \varrho_j$ is large and $ \theta_j^{(t)}$ is thus heavily resampled. The upper right graph shows that the other proposals are rather evenly sampled along iterations. The influence of the variation in the proposals on the estimation of the means $ \mu _1$ and $ \mu _2$ can be seen on the middle and lower panels of Fig. 11.21. First, the cumulative averages quickly stabilize over iterations, by virtue of the general importance sampling proposal. Second, the corresponding variances take longer to stabilize but this is to be expected, given the regular reappearance of subsamples with large variances.

In comparison with Figs. 11.10 and 11.11, Fig. 11.19 shows that the sample produced by the PMC algorithm is quite in agreement with the modal zone of the posterior distribution. The second mode, which is much lower, is not preserved in the sample after the first iteration. Figure 11.20 also shows that the weights are quite similar, with no overwhelming weight in the sample.

Figure 11.19: Representation of the log-posterior distribution with the PMC weighted sample after $ 30$ iterations (the weights are proportional to the circles at each point) (Source: [6])
\includegraphics[width=10.5cm]{text/3-11/pmc2mix.eps}

Figure 11.20: Histograms of the PMC sample: sample at iteration $ 5$ (left) before resampling and (right) after resampling
\includegraphics[width=10.5cm]{text/3-11/hpmc2mix.eps}

Figure 11.21: Performances of the mixture PMC algorithm for $ 1000$ observations from a  $ 0.2 \mathcal{N}(0,1)+0.8\mathcal{N}(2,1)$ distribution, with $ \theta =1$ $ \lambda =0.1$, $ v_k=5,2,0.1,0.05,0.01$, and a population of $ 1050$ particles: (upper left) Number of resampled points for the variances $ v_1=5$ (darker) and $ v_2=2$; (upper right) Number of resampled points for the other variances, $ v_3=0.1$ is the darkest one; (middle left) Variance of the simulated $ \mu _1$'s along iterations; (middle right) Cumulated average of the simulated $ \mu _1$'s over iterations; (lower left) Variance of the simulated $ \mu _2$'s along iterations; (lower right) Cumulated average of the simulated $ \mu _2$'s over iterations (Source: [6])
\includegraphics[width=11cm]{text/3-11/itersamp.eps}

The generality in the choice of the proposal distributions $ q_{it}$ is obviously due to the abandonment of the MCMC framework. The difference with an MCMC framework is not simply a theoretical advantage: as seen in Sect. 11.5.1, proposals based on the whole past of the chain do not often work. Even algorithms validated by MCMC steps may have difficulties: in one example of [6], a Metropolis-Hastings scheme does not work well, while a PMC algorithm based on the same proposal produces correct answers.


next up previous contents index
Next: 11.6 Conclusion Up: 11. Bayesian Computational Methods Previous: 11.4 Markov Chain Monte