next up previous contents index
Next: 1.6 Impact of Heteroscedasticity Up: 1. Model Selection Previous: 1.4 Cross-Validation and Generalized


1.5 Bayes Factor

Let $ P(M_\lambda)$ be the prior probability for model $ M_\lambda$. For any two models $ M_{\lambda_1}$ and $ M_{\lambda_2}$, the Bayes factor

$\displaystyle B(\boldsymbol{\lambda}_1,\boldsymbol{\lambda}_2) = \frac{P(M_{\la...
...{\lambda_2}\vert\boldsymbol{y})} \div \frac{P(M_{\lambda_1})}{P(M_{\lambda_2})}$ (1.29)

is the posterior odds in favor of model $ M_{\lambda_1}$ divided by the prior odds in favor of model $ M_{\lambda_1}$ ([31]). The Bayes factor provides a scale of evidence in favor of one model versus another. For example, $ B(\boldsymbol{\lambda}_1,\boldsymbol{\lambda}_2) = 2$ indicates that the data favor model $ M_{\lambda_1}$ over model $ M_{\lambda_2}$ at odds of two to one. Table 1.1 lists a possible interpretation for Bayes factor suggested by [29].


Table 1.1: Jeffreys' scale of evidence for Bayes factors
Bayes factor Interpretation
$ B(\boldsymbol{\lambda}_1,\boldsymbol{\lambda}_2) < 1/10$ Strong evidence for $ M_{\lambda_2}$
$ 1/10 < B(\boldsymbol{\lambda}_1,\boldsymbol{\lambda}_2) < 1/3$ Moderate evidence for $ M_{\lambda_2}$
$ 1/3 < B(\boldsymbol{\lambda}_1,\boldsymbol{\lambda}_2) < 1$ Weak evidence for $ M_{\lambda_2}$
$ 1 < B(\boldsymbol{\lambda}_1,\boldsymbol{\lambda}_2) < 3$ Weak evidence for $ M_{\lambda_1}$
$ 3 < B(\boldsymbol{\lambda}_1,\boldsymbol{\lambda}_2) < 10$ Moderate evidence for $ M_{\lambda_1}$
$ B(\boldsymbol{\lambda}_1,\boldsymbol{\lambda}_2) > 10$ Strong evidence for $ M_{\lambda_1}$

The Bayes factor is easy to understand and applicable to a wide range of problems. Methods based on the Bayes factor behave like an Occam's razor ([30]). Non-Bayesian analysis typically selects a model and then proceeds as if the data is generated by the chosen model. Ignoring the fact that the model has been selected from the same data, this approach often leads to under-estimation of the uncertainty in quantities of interest, a problem know as the model selection bias ([11]). Specifically, the estimates of parameters based on the selected model are biased and their variances are usually too optimistic. The Bayesian approach accounts for model uncertainty with the posterior probability $ P(M_\lambda \vert \boldsymbol{y})$. For example, to predict a new observation $ y^+$, the best prediction under squared loss is

$\displaystyle \notag \mathrm{E}(y^+\vert\boldsymbol{y}) = \sum_{\lambda \in \La...
...rm{E} (y^+\vert M_\lambda,\boldsymbol{y}) P(M_\lambda \vert \boldsymbol{y}) \,,$    

a weighted average of predictions from all models with weights equal to the posterior probabilities. Instead of using a single model, such model averaging incorporates model uncertainty. It also indicates that selecting a single model may not be desirable or necessary for some applications such as prediction ([27]).

The practical implementation of Bayesian model selection is, however, far from straightforward. In order to compute the Bayes factor (1.29), ones needs to specify priors $ P(M_{\lambda})$ as well as priors for parameters in each model. While providing a way to incorporating other information into the model and model selection, these priors may be hard to set in practice, and standard non-informative priors for parameters cannot be used ([6,18]). See [31], [12] and [7] for more discussions on the choice of priors.

After deciding on priors, one needs to compute (1.29) which can be re-expressed as

$\displaystyle B(\boldsymbol{\lambda}_1,\boldsymbol{\lambda}_2) = \frac{P(\boldsymbol{y}\vert M_{\lambda_1})}{P(\boldsymbol{y}\vert M_{\lambda_2})}\,,$ (1.30)

where $ P(\boldsymbol{y}\vert M_{\lambda})$ is the marginal likelihood. The marginal likelihood usually involves an integral which can be evaluated analytically only for some special cases. When the marginal likelihood does not have a closed form, several methods for approximation are available including Laplace approximation, importance sampling, Gaussian quadrature and Markov chain Monte Carlo (MCMC) simulations. Details about these methods are out of the scope of this chapter. References can be found in [31].

Under certain conditions, [32] showed that

$\displaystyle \notag -2 \log P(\boldsymbol{y}\vert M_{\lambda}) \approx -2 \log (\mathrm{maximum likelihood}) + \vert M_\lambda\vert \log n \,.$    

Thus the $ \mathrm{BIC}$ is an approximation to the Bayes factor.

In the following we discuss selection of the smoothing parameter $ \boldsymbol{\lambda}$ for the periodic spline. Based on (1.30), our goal is to find $ \boldsymbol{\lambda}$ which maximizes the marginal likelihood $ P(\boldsymbol{y}\vert M_{\lambda})$, or equivalently, $ P(\tilde{\boldsymbol{y}}\vert M_{\lambda})$ where $ \tilde{\boldsymbol{y}}$ is the discrete Fourier transformation of $ \boldsymbol {y}$. Note that

$\displaystyle \tilde{\boldsymbol{y}} = \boldsymbol{\alpha} + \tilde{\boldsymbol{\epsilon}}\,,$ (1.31)

where $ \tilde{\boldsymbol{\epsilon}}=X_K^{\top} \boldsymbol{\epsilon}/n \sim
N(\boldsymbol{0},\sigma^2 I/n)$. Let $ b=\sigma^2/n \lambda$. Assume the following prior for $ \boldsymbol {\alpha }$:

$\displaystyle \boldsymbol{\alpha}_1$ $\displaystyle \propto 1,$    
$\displaystyle \boldsymbol{\alpha}_{2\nu}$ $\displaystyle \sim N(0,b(2\pi \nu)^{-4}),\qquad \nu=1,\cdots,K\,,$    
$\displaystyle \boldsymbol{\alpha}_{2\nu+1}$ $\displaystyle \sim N(0,b(2\pi \nu)^{-4}),\qquad \nu=1,\cdots,K\,,$ (1.32)

where $ \alpha_i$ are mutually independent and are independent of $ \tilde{\boldsymbol{\epsilon}}$. An improper prior is assumed for $ \alpha_1$. It is not difficult to check that $ \mathrm{E}
(\boldsymbol{\alpha} \vert \tilde{\boldsymbol{y}})=\widehat{\boldsymbol{\alpha}}$. Thus the posterior means of the Bayes model (1.31) and (1.32) are the same as the periodic spline estimates.

Let $ \boldsymbol{z}=(\tilde{y}_2,\cdots,\tilde{y}_n)^{\top}$ and write $ P(\tilde{\boldsymbol{y}}\vert M_\lambda)=P(\tilde{y}_1\vert M_\lambda)
P(\boldsymbol{z}\vert M_\lambda)$. Since $ P(\tilde{y}_1\vert M_\lambda)$ is independent of $ \lambda $, we will estimate $ \lambda $ using the marginal likelihood $ P(\boldsymbol{z}\vert M_\lambda)$. Since $ \tilde{y}_{2\nu}$ or $ \tilde{y}_{2\nu+1} \sim N(0,b((2\pi
\nu)^{-4}+\lambda))$, the log marginal likelihood of $ \boldsymbol{z}$ is

$\displaystyle \notag l(b,\boldsymbol{\lambda}) \!=\! -\frac{n-1}{2} \log 2\pi -...
...tilde{y}_{2\nu}^2+ \tilde{y}_{2\nu+1}^2}{(2\pi \nu)^{-4}+\boldsymbol{\lambda}}.$    

Fixing $ \boldsymbol{\lambda}$ and maximizing with respect to $ b$, we have

$\displaystyle \notag \widehat{b} = \frac{1}{n-1} \sum_{\nu=1}^K \frac{\tilde{y}_{2\nu}^2+ \tilde{y}_{2\nu+1}^2} {(2\pi \nu)^{-4}+\boldsymbol{\lambda}}\,.$    

Plugging back, we have

$\displaystyle \notag l(\,\widehat{b},\boldsymbol{\lambda}) = \mathrm{constant} ...
...}} - \sum_{\nu=1}^K \log \left[ (2\pi \nu)^{-4}+\boldsymbol{\lambda} \right]\,.$    

Thus maximizing the log likelihood is equivalent to minimizing

$\displaystyle M(\lambda) = \frac{\sum_{\nu=1}^K (\tilde{y}_{2\nu}^2+\tilde{y}_{...
...( \prod_{\nu=1}^K ((2\pi \nu)^{-4}+\boldsymbol{\lambda}) \right)^{2/(n-1)}} \,,$    

It is not difficult to check that

$\displaystyle M(\lambda) = \frac{\boldsymbol{y}^{\top} (I-A(\boldsymbol{\lambda...
...l{y}} {\left( \mathrm{det}^+ (I-A(\boldsymbol{\lambda})) \right)^{1/(n-1)}} \,,$ (1.33)

where $ \mathrm{det}^+$ is the product of non-zero eigenvalues. The criterion (1.33) is called the generalized maximum likelihood method in smoothing spline literature ([50]). It is the same as the restricted maximum likelihood (REML) method in the mixed effects literature ([53]). Note that the marginal likelihood is approximated by plugging-in $ \widehat{b}$ rather than averaging over a prior distribution for $ b$.

For the climate data, the GML scores for the periodic spline and the corresponding fits are plotted in the left and right panels of Fig. 1.7 respectively. The fits with three different choices of the smoothing parameter are very similar.


next up previous contents index
Next: 1.6 Impact of Heteroscedasticity Up: 1. Model Selection Previous: 1.4 Cross-Validation and Generalized