While MCMC algorithms considerably expanded the range of applications of Bayesian analysis, they are not, by any means, the end of the story! Further developments are taking place, either at the fringe of the MCMC realm or far away from it. We indicate below a few of the directions in Bayesian computational Statistics, omitting many more that also are of interest...
Given the range of situations where MCMC applies, it is unrealistic to hope for a generic MCMC sampler that would function in every possible setting. The more generic proposals like random-walk Metropolis-Hastings algorithms are known to fail in large dimension and disconnected supports, because they take too long to explore the space of interest ([31]). The reason for this impossibility theorem is that, in realistic problems, the complexity of the distribution to simulation is the very reason why MCMC is used! So it is difficult to ask for a prior opinion about this distribution, its support or the parameters of the proposal distribution used in the MCMC algorithm: intuition is close to void in most of these problems.
However, the performances of off-the-shelve algorithms like the random-walk Metropolis-Hastings scheme bring information about the distribution of interest and, as such, should be incorporated in the design of better and more powerful algorithms. The problem is that we usually miss the time to train the algorithm on these previous performances and are looking for the Holy Grail of automated MCMC procedures! While it is natural to think that the information brought by the first steps of an MCMC algorithm should be used in later steps, there is a severe catch: using the whole past of the ''chain'' implies that this is not a Markov chain any longer. Therefore, usual convergence theorems do not apply and the validity of the corresponding algorithms is questionable. Further, it may be that, in practice, such algorithms do degenerate to point masses because of a too rapid decrease in the variation of their proposal.
and |
|
Even though the Markov chain is converging in distribution to the target distribution (when using a proper, i.e. time-homogeneous updating scheme), using past simulations to create a non-parametric approximation to the target distribution does not work either. Figure 11.18 shows for instance the output of an adaptive scheme in the setting of Example 19 when the proposal distribution is the Gaussian kernel based on earlier simulations. A very large number of iterations is not sufficient to reach an acceptable approximation of the target distribution.
|
The overall message is thus that one should not constantly adapt the proposal distribution on the past performances of the simulated chain. Either the adaptation must cease after a period of burnin (not to be taken into account for the computations of expectations and quantities related to the target distribution), or the adaptive scheme must be theoretically assess on its own right. This later path is not easy and only a few examples can be found (so far) in the literature. See, e.g., [17] who use regeneration to create block independence and preserve Markovianity on the paths rather than on the values, Haario et al. ([22],[23]) who derive a proper adaptation scheme in the spirit of Example 19 by using a ridge-like correction to the empirical variance, and [4] who propose a more general framework of valid adaptivity based on stochastic optimisation and the Robbin-Monro algorithm. (The latter actually embeds the chain of interest in a larger chain that also includes the parameter of the proposal distribution as well as the gradient of a performance criterion.)
To reach acceptable adaptive algorithms, while avoiding an extended study of their theoretical properties, a better alternative is to leave the structure of Markov chains and to consider sequential or population Monte Carlo methods ([25], [6]) that have much more in common with importance sampling than with MCMC. They are inspired from particle systems that were introduced to handle rapidly changing target distributions like those found in signal processing and imaging ([19,39,13]) but primarily handle fixed but complex target distributions by building a sequence of increasingly better proposal distributions.11.14 Each iteration of the population Monte Carlo (PMC) algorithm thus produces a sample approximately simulated from the target distribution but the iterative structure allows for adaptivity toward the target distribution. Since the validation is based on importance sampling principles, dependence on the past samples can be arbitrary and the approximation to the target is valid (unbiased) at each iteration and does not require convergence times nor stopping rules.
If indexes the iteration and the sample point, consider proposal distributions that simulate the 's and associate to each an importance weight
Since, usually, the density is unscaled, we use instead
Since the above establishes that an simulation scheme based on sample dependent proposals is fundamentally a specific kind of importance sampling, the following algorithm is validated by the same principles as regular importance sampling:
Population Monte Carlo Algorithm
For
Step (1) is singled out because it is the central property of the PMC algorithm, namely that adaptivity can be extended to the individual level and that the 's can be picked based on the performances of the previous 's or even on all the previously simulated samples, if storage allows. For instance, the 's can include large tails proposals as in the defensive sampling strategy of [24], to ensure finite variance. Similarly, Warnes' (2001) non-parametric Gaussian kernel approximation can be used as a proposal.11.15 (See also [42] smooth bootstrap as an earlier example of PMC algorithm.)
The major difference between the PMC algorithm and earlier proposals in the particle system literature is that past dependent moves as those of [16] remain within the MCMC framework, with Markov transition kernels with stationary distribution equal to .
Compared with an MCMC algorithm in the same setting (see Examples 15 and 16), the main feature of this algorithm is its ability to deal with multiscale proposals in an unsupervised manner. The upper row of Fig. 11.21 produces the frequencies of the five variances used in the proposals along iterations: The two largest variances most often have a zero survival rate, but sometimes experience bursts of survival. In fact, too large a variance mostly produces points that are irrelevant for the posterior distribution, but once in a while a point gets close to one of the modes of the posterior. When this occurs, the corresponding is large and is thus heavily resampled. The upper right graph shows that the other proposals are rather evenly sampled along iterations. The influence of the variation in the proposals on the estimation of the means and can be seen on the middle and lower panels of Fig. 11.21. First, the cumulative averages quickly stabilize over iterations, by virtue of the general importance sampling proposal. Second, the corresponding variances take longer to stabilize but this is to be expected, given the regular reappearance of subsamples with large variances.
In comparison with Figs. 11.10 and 11.11, Fig. 11.19 shows that the sample produced by the PMC algorithm is quite in agreement with the modal zone of the posterior distribution. The second mode, which is much lower, is not preserved in the sample after the first iteration. Figure 11.20 also shows that the weights are quite similar, with no overwhelming weight in the sample.
|
|
|
The generality in the choice of the proposal distributions is obviously due to the abandonment of the MCMC framework. The difference with an MCMC framework is not simply a theoretical advantage: as seen in Sect. 11.5.1, proposals based on the whole past of the chain do not often work. Even algorithms validated by MCMC steps may have difficulties: in one example of [6], a Metropolis-Hastings scheme does not work well, while a PMC algorithm based on the same proposal produces correct answers.