Next: 11.3 Monte Carlo Methods Up: 11. Bayesian Computational Methods Previous: 11.1 Introduction

11.2 Bayesian Computational Challenges

Bayesian Statistics being a complete inferential methodology, its scope encompasses the whole range of standard statistician inference (and design), from point estimation to testing, to model selection, and to non-parametrics. In principle, once a prior distribution has been chosen on the proper space, the whole inferential machinery is set and the computation of estimators is usually automatically derived from this setup. Obviously, the practical or numerical derivation of these procedures may be exceedingly difficult or even impossible, as we will see in a few selected examples. Before, we proceed with an incomplete typology of the categories and difficulties met by Bayesian inference. First, let us point out that computational difficulties may originate from one or several of the following items:

(1): use of a complex parameter space, as for instance in constrained parameter sets like those resulting from imposing stationarity constraints in dynamic models;
(2): use of a complex sampling model with an intractable likelihood, as for instance in missing data and graphical models;
(3): use of a huge dataset;
(4): use of a complex prior distribution (which may be the posterior distribution associated with an earlier sample);
(5): use of a complex inferential procedure.

11.2.1 Bayesian Point Estimation

In a formalised representation of Bayesian inference, the statistician is given (or she selects) a triplet

a sampling distribution, $f(x\vert\theta)$ , usually associated with an observation (or a sample) ;
a prior distribution $\pi(\theta)$ , defined on the parameter space $\Theta$ ;
a loss function $L(\theta,d)$ that compares the decisions (or estimations) for the true value $\theta$ of the parameter.

Using $(f,\pi,L)$ and an observation

, the Bayesian inference is always given as the solution to the minimisation programme

$\displaystyle \min_d\,\int_{\Theta} L(\theta,d)\,f(x\vert\theta)\,\pi(\theta)\,{\text{d}}\theta\;,$

equivalent to the minimisation programme

$\displaystyle \min_d\,\int_{\Theta} L(\theta,d)\,\pi(\theta\vert x)\,{\text{d}}\theta\;.$

The corresponding procedure is thus associated, for every

, to the solution of the above programme (see, e.g. [35], Chap. 2).

There are therefore two levels of computational difficulties with this resolution: first the above integral must be computed. Second, it must be minimised in . For the most standard losses, like the squared error loss,

$\displaystyle L(\theta,d) = \left\vert \theta -d\right\vert^2\;,$

the solution to the minimisation problem is universally^11.1 known. For instance, for the squared error loss, it is the posterior mean,

$\displaystyle \int_{\Theta} \theta \,\pi(\theta\vert x)\,{\text{d}}\theta = \in... ...\theta \bigg/ \int_{\Theta} f (x\vert\theta)\,\pi(\theta)\,{\text{d}}\theta \;,$

which still requires the computation of both integrals and thus whose complexity depends on the complexity of both $f(x\vert\theta)$ and $\pi(\theta)$ .

Example 1 For a normal distribution $\mathcal{N}(\theta,1)$ , the use of a so-called conjugate prior (see, e.g., [35], Chap. 3)

$\displaystyle \theta\sim\mathcal{N}(\mu,\epsilon)\;,$

leads to a closed form expression for the mean, since

$\displaystyle \int_{\Theta}$	$\displaystyle \theta \,f(x\vert\theta)\,\pi(\theta)\,{\text{d}}\theta \bigg/ \int_{\Theta} f(x\vert\theta)\,\pi(\theta) \,{\text{d}}\theta =$
$\displaystyle \int_{\mathbb{R}}$	$\displaystyle \theta \exp\frac{1}{2}\left\{ -\theta^2\left(1+\epsilon^{-2}\right) +2\theta\left(x +\epsilon^{-2}\mu\right)\right\} {\text{d}}\theta$
	$\displaystyle \qquad \bigg/ \int_{\mathbb{R}} \exp\frac{1}{2}\left\{ -\theta^2\... ...right)\right\} {\text{d}}\theta = \frac{x+\epsilon^{-2}\mu}{1+\epsilon^{-2}}\;.$

On the other hand, if we use instead a more involved prior distribution like a poly- distribution ([5]),

$\displaystyle \pi(\theta) = \prod_{i=1}^k \left[ \alpha_i+(\theta-\beta_i)^2 \right]^{-\nu_i} \qquad \alpha,\nu>0$

the above integrals cannot be computed in closed form anymore. This is not a toy example in that the problem may occur after a sequence of observations, or with a sequence of normal observations whose variance is unknown.

The above example is one-dimensional, but, obviously, bigger challenges await the Bayesian statistician when she wants to tackle high-dimensional problems.

Example 2 In a generalised linear model, a conditional distribution of $y\in\mathbb{R}$ given $x\in\mathbb{R}^p$ is defined via a density from an exponential family

$\displaystyle y\vert x\sim\exp\left\{ y\cdot\theta(x) - \psi(\theta(x)) \right\}$

whose natural parameter $\theta(x)$ depends on the conditioning variable ,

$\displaystyle \theta(x) = g\left(\beta^{\text{T}} x\right)\;,\quad\beta\in\mathbb{R}^p$

that is, linearly modulo the transform . Obviously, in practical applications like Econometrics, can be quite large. Inference on $\beta$ (which is the true parameter of the model) proceeds through the posterior distribution (where $\boldsymbol{x}=(x_1,\ldots,x_T)$ and $\boldsymbol{y}=(y_1,\ldots,y_T)$ )

$\displaystyle \pi(\beta\vert\boldsymbol{x},\boldsymbol{y})$	$\displaystyle \propto \prod_{t=1}^T \exp\left\{ y_t\cdot\theta(x_t) - \psi(\theta(x_t)) \right\} \,\pi(\beta)$
	$\displaystyle = \exp\left\{ \sum_{t=1}^T y_t\cdot\theta(x_t) - \sum_{t=1}^T \psi(\theta(x_t)) \right\} \,\pi(\beta) \;,$

which rarely is available in closed form. In addition, in some cases $\psi$ may be costly simply to compute and in others may be large or even very large. Take for instance the case of the dataset processed by [1], which covers twenty years of employment histories for over a million workers, with including indicator variables for over one hundred thousand companies.

A related, although conceptually different, inferential issue concentrates upon prediction, that is, the approximation of a distribution related with the parameter of interest, say $g(y\vert\theta)$ , based on the observation of $x\sim f(x\vert\theta)$ . The predictive distribution is then defined as

$\displaystyle \pi(y\vert x) = \int_{\Theta} g(y\vert\theta) \pi(\theta\vert x) {\text{d}}\theta\;.$

A first difference with the standard point estimation perspective is obviously that the parameter $\theta$ vanishes through the integration. A second and more profound difference is that this parameter is not necessarily well-defined anymore. As will become clearer in a following section, this is a paramount feature in setups where the model is not well-defined and where the statistician hesitates between several (or even an infinity of) models. It is also a case where the standard notion of identifiability is irrelevant, which paradoxically is a ''plus'' from the computational point of view, as seen below in, e.g., Example 14.

Example 3 Recall that an model is given as the auto-regressive representation of a time series,

$\displaystyle x_t = \sum_{i=1}^p \theta_{i} x_{t-i} + \sigma\varepsilon_t \;.$

It is often the case that the order of the model is not fixed a priori, but has to be determined from the data itself. Several models are then competing for the ''best'' fit of the data, but if the prediction of the next value $x_{t+1}$ is the most important part of the inference, the order chosen for the best fit is not really relevant. Therefore, all models can be considered in parallel and aggregated through the predictive distribution

$\displaystyle \pi(x_{t+1}\vert x_t,\ldots,x_1) \propto \int f(x_{t+1}\vert x_t,... ...,x_{t-p+1}) \pi(\theta,p\vert x_t,\ldots,x_1) {\text{d}} p\,{\text{d}}\theta\,,$

which thus amounts to integrating over the parameters of all models, simultaneously:

$\displaystyle \sum_{p=0}^{\infty} \int f\left(x_{t+1}\vert x_t,\ldots,x_{t-p+1}... ...\ldots,x_1\right) \,{\text{d}}\theta \,\pi\left(p\vert x_t,\ldots,x_1\right)\;.$

Note the multiple layers of complexity in this case:

(1): if the prior distribution on has an infinite support, the integral simultaneously considers an infinity of models, with parameters of unbounded dimensions;
(2): the parameter $\theta$ varies from model to model , so must be evaluated differently from one model to another. In particular, if the stationarity constraint usually imposed in these models is taken into account, the constraint on $(\theta_1,\ldots,\theta_p)$ varies^11.2 between model and model ;
(3): prediction is usually used sequentially: every tick/second/hour/day, the next value is predicted based on the past values ( $x_t,\ldots,x_1$ ). Therefore when moves to , the entire posterior distribution $\pi(\theta,p\vert x_t,\ldots,x_1)$ must be re-evaluated again, possibly with a very tight time constraint as for instance in financial or radar applications.

We will discuss this important problem in deeper details after the testing section, as part of the model selection problematic.

11.2.2 Testing Hypotheses

A domain where both the philosophy and the implementation of Bayesian inference are at complete odds with the classical approach is the area of testing of hypotheses. At a primary level, this is obvious when opposing the Bayesian evaluation of an hypothesis $H_0:\theta\in\Theta_0$

Pr $\displaystyle ^\pi (\theta\in\Theta_0\vert x)$

with a Neyman-Pearson

-value

$\displaystyle \sup_{\theta\in\Theta_0}\,$ Pr $\displaystyle _{\theta}(T(X)\ge T(x))\;,$

where

is an appropriate statistic, with observed value

. The first quantity involves an integral over the parameter space, while the second provides an evaluation over the observational space. At a secondary level, the two answers may also strongly disagree even when the number of observations goes to infinity, although there exist cases and priors for which they agree to the order O $(n^{-1})$ or even O $(n^{-3/2})$ . (See [35], Sect. 3.5.5 and Chap. 5, for more details.)

From a computational point of view, most Bayesian evaluations involve marginal distributions

$\displaystyle \int_{\Theta_i} f(x\vert\theta_i) \pi_i(\theta_i)\,{\text{d}}\theta_i\;,$

(11.1)

where $\Theta_i$ and $\pi_i$ denote the parameter space and the corresponding prior, respectively, under hypothesis

. For instance, the Bayes factor is defined as the ratio of the posterior probabilities of the null and the alternative hypotheses over the ratio of the prior probabilities of the null and the alternative hypotheses, i.e.,

$\displaystyle B^{\pi}_{01}(x) = \frac{P(\theta \in \Theta_ 0\mid x) }{ P(\theta... ...mid x)} \bigg/ \frac{\pi(\theta \in \Theta_ 0) }{ \pi(\theta \in \Theta_ 1)}\;.$

This quantity is instrumental in the computation of the posterior probability

$\displaystyle P(\theta \in \Theta_ 0\mid x) = \frac{1}{1+B^{\pi}_{10}(x)}$

under equal prior probabilities for both $\Theta_0$ and $\Theta_1$ . It is also the central tool in practical (as opposed to decisional) Bayesian testing ([26]) as the Bayesian equivalent of the likelihood ratio.

The first ratio in $B^{\pi}_{01}(x)$ is then the ratio of integrals of the form (11.1) and it is rather common to face difficulties in the computation of both integrals.^11.3

Example 4 (Continuation of Example 2) In the case of the generalised linear model, a standard testing situation is to decide whether or not a factor, say, is influential on the dependent variable . This is often translated as testing whether or not the corresponding component of $\beta$ , $\beta_1$ , is equal to 0, i.e. $\Theta_0=\{\beta; \beta_1=0\}$ . If we denote by $\beta_{-1}$ the other components of $\beta$ , the Bayes factor for this hypothesis will be

$\displaystyle \int_{\mathbb{R}^p} \exp$	$\displaystyle \left\{ \sum_{t=1}^T y_t\cdot g\left(\beta^{\text{T}} x_t\right) ... ...beta^{\text{T}} x_t\right)\right) \right\} \,\pi(\beta) \,{\text{d}}\beta\bigg/$
$\displaystyle \int_{\mathbb{R}^{p-1}}$	$\displaystyle \exp\left\{ \sum_{t=1}^T y_t\cdot g\left(\beta^{\text{T}}_{-1}(x_... ...}(x_t)_{-1} \right) \right\} \,\pi_{-1}(\beta_{-1} ) \,{\text{d}}\beta_{-1} \;,$

when $\pi_{-1}$ is the prior constructed for the null hypothesis and when the prior weights of and of the alternative are both equal to . Obviously, besides the normal conjugate case, both integrals cannot be computed in a closed form.

In a related manner, confidence regions are also mostly intractable, being defined through the solution to an implicit equation. Indeed, the Bayesian confidence region for a parameter $\theta$ is defined as the highest posterior region,

$\displaystyle \left\{ \theta; \pi(\theta\vert x) \ge k(x) \right\}\;,$

(11.2)

where

is determined by the coverage constraint

Pr $\displaystyle ^{\pi}(\pi(\theta\vert x) \ge k(x) \vert x ) = \alpha\;,$

$\alpha$ being the confidence level. While the normalising constant is not necessary to construct a confidence region, the resolution of the implicit equation (11.2) is rarely straightforward!

Example 5 Consider a binomial observation $x\sim\mathcal{B}(n,\theta)$ with a conjugate prior distribution, $\theta\sim\mathcal{B}e(\gamma_1,\gamma_2)$ . In this case, the posterior distribution is available in closed form,

$\displaystyle \theta\vert x \sim \mathcal{B}e(\gamma_1+x,\gamma_2+n-x)\;.$

However, the determination of the $\theta$ 's such that

$\displaystyle \theta^{\gamma_1+x-1}(1-\theta)^{\gamma_2+n-x-1} \ge k(x)$

with

Pr $\displaystyle ^{\pi}\left( \theta^{\gamma_1+x-1}(1-\theta)^{\gamma_2+n-x-1} \ge k(x) \vert x \right) = \alpha$

is not possible analytically. It actually implies two levels of numerical difficulties:

Step 1: find the solution(s) to $\theta^{\gamma_1+x-1}(1-\theta)^{\gamma_2+n-x-1}=k$ ,
Step 2: find the corresponding to the right coverage,

and each value of examined in Step 2. requires a new resolution of Step 1.

The setting is usually much more complex when $\theta$ is a multidimensional parameter, because the interest is usually in getting marginal confidence sets. Example 2 is an illustration of this setting: deriving a confidence region on one component, $\beta_1$ say, first involves computing the marginal posterior distribution of this component. As in Example 4, the integral

$\displaystyle \int_{\mathbb{R}^{p-1}} \exp\left\{ \sum_{t=1}^T y_t\cdot g\left(... ...ext{T}} x_t \right) \right\} \, \pi_{-1}(\beta_{-1} )\,{\text{d}}\beta_{-1} \;,$

which is proportional to $\pi(\beta_1\vert x)$ , is most often intractable.

11.2.3 Model Choice

We distinguish model choice from testing, not only because it leads to further computational difficulties, but also because it encompasses a larger scope of inferential goals than mere testing. Note first that model choice has been the subject of considerable effort in the past years, and has seen many advances, including the coverage of problems of higher complexity and the introduction of new concepts. We stress that such advances mostly owe to the introduction of new computational methods.

As discussed in further details in Robert (2001, Chap. 7), the inferential action related with model choice does take place on a wider scale: it covers and compares models, rather than parameters, which makes the sampling distribution ''more unknown'' than simply depending on an undetermined parameter. In some respect, it is thus closer to estimation than to regular testing. In any case, it requires a more precise evaluation of the consequences of choosing the ''wrong'' model or, equivalently of deciding which model is the most appropriate to the data at hand. It is thus both broader and less definitive as deciding whether $H_0:\,\theta_1=0$ is true. At last, the larger inferential scope mentioned in the first point means that we are leaving for a while the well-charted domain of solid parametric models.

From a computational point of view, model choice involves more complex structures that, almost systematically, require advanced tools, like simulation methods which can handle collections of parameter spaces (also called spaces of varying dimensions), specially designed for model comparison.

Example 6 A mixture of distributions is the representation of a distribution (density) as the weighted sum of standard distributions (densities). For instance, a mixture of Poisson distributions, denoted as

$\displaystyle \sum_{i=1}^k p_i\mathcal{P}(\lambda_i)$

has the following density:

Pr $\displaystyle (X=k) = \sum_{i=1}^k p_i\,\frac{\lambda_i^k}{k!}\,\mathrm{e}^{-\lambda_i}\;.$

This representation of distributions is multi-faceted and can be used in populations with known heterogeneities (in which case a component of the mixture corresponds to an homogeneous part of the population) as well as a non-parametric modelling of unknown populations. This means that, in some cases, is known and, in others, it is both unknown and part of the inferential problem.

First, consider the setting where several (parametric) models are in competition,

$\displaystyle \mathfrak{M}_i : x \sim f_i(x\vert\theta_i)\;, \quad \theta_i\in\Theta_i\;,\quad i \in I\;,$

the index set

being possibly infinite. From a Bayesian point of view, a prior distribution must be constructed for each model $\mathfrak{M}_i$ as if it were the only and true model under consideration since, in most perspectives except model averaging, one of these models will be selected and used as the only and true model. The parameter space associated with the above set of models can be written as

$\displaystyle \boldsymbol{\Theta} = \bigcup_{i\in I} \{i\}\times\Theta_i \;,$

(11.3)

the model indicator $\mu\in I$ being now part of the parameters. So, if the statistician allocates probabilities

to the indicator values, that is, to the models $\mathfrak{M}_i$ $(i\in I)$ , and if she then defines priors $\pi_i(\theta_i)$ on the parameter subspaces $\Theta_i$ , things fold over by virtue of Bayes's theorem, as usual, since she can compute

$\displaystyle p(\mathfrak{M}_i\vert x) = P(\mu = i \vert x) = \frac{ \displayst... ...left(x\vert\theta_j\right) \pi_j\left(\theta_j\right) {\text{d}}\theta_j} } \;.$

While a common solution based on this prior modeling is simply to take the (marginal) MAP estimator of $\mu$ , that is, to determine the model with the largest $p(\mathfrak{M}_i\vert x)$ , or even to use directly the average

$\displaystyle \displaystyle{ \sum_j p_j \int_{\Theta_j} f_j\left(y\vert\theta_j... ...ht) {\text{d}}\theta_j } = \sum_j p\left(\mathfrak{M}_j\vert x\right) \, m_j(y)$

as a predictive density in

in model averaging, a deeper-decision theoretic evaluation is often necessary.

Example 7 (Continuation of Example 3) In the setting of the models, when the order of the dependence is unknown, model averaging as presented in Example 3 is not always a relevant solution when the statistician wants to estimate this order for different purposes. Estimation is then a more appropriate perspective than testing, even though care must be taken because of the discrete nature of . (For instance, the posterior expectation of is not an appropriate estimator!)

Example 8 [40] have developed a Bayesian approach to model choice that appears like an alternative to both Akaike's and Schwartz Information Criterion, called DIC (for Deviance Information Criterion). For a model with density $f(x\vert\theta)$ and a prior distribution $\pi(\theta)$ , the deviance is defined as $D(\theta) = -2\log(\kern.5pt f(x\vert\theta))$ but this is not a good discriminating measure between models because of its bias toward higher dimensional models. The penalized deviance of [40] is

DIC $\displaystyle = \mathbb{E}\left[D(\theta)\vert x\right] + \left\{ \mathbb{E}[D(\theta)\vert x] - D(\mathbb{E}[\theta\vert x ]) \right\}\;, \nonumber$

with the ''best'' model associated with the smallest DIC. Obviously, the computation of the posterior expectation $\mathbb{E}[D(\theta)\vert x]=-2\mathbb{E}[\log(\kern.5pt f(x\vert\theta))\vert x]$ is complex outside exponential families.

As stressed earlier in this section, the computation of predictive densities, marginals, Bayes factors, and other quantities related to the model choice procedures is generally very involved, with specificities that call for tailor-made solutions:

The computation of integrals is increased by a factor corresponding to the number of models under consideration.
Some parameter spaces are infinite-dimensional, as in non-parametric settings and that may cause measure-theoretic complications.
The computation of posterior or predictive quantities involves integration over different parameter spaces and thus increases the computational burden, since there is no time savings from one subspace to the next.
In some settings, the size of the collection of models is very large or even infinite and some models cannot be explored. For instance, in Example 4, the collection of all submodels is of size and some pruning method must be found in variable selection to avoid exploring the whole tree of all submodels.

Next: 11.3 Monte Carlo Methods Up: 11. Bayesian Computational Methods Previous: 11.1 Introduction