next up previous contents index
Next: 1.4 Cross-Validation and Generalized Up: 1. Model Selection Previous: 1.2 Basic Concepts -


1.3 AIC, BIC, C$ _{p}$ and Their Variations

Let $ \mathrm{GOF}$($ M_\lambda$) and $ \vert M_\lambda\vert$ be measures of the $ \mathrm{GOF}$ and complexity for model $ M_\lambda$. A direct compromise between these two conflicting quantities is

$\displaystyle \mathrm{GOF}(M_\lambda) + \xi \vert M_\lambda\vert\,,$ (1.19)

where the parameter $ \xi$ controls how this compromise should be achieved. Note that the penalized LS (1.10) may be considered as a special case with the LS regarded as a measure of GOF and the squared integral regarded as a measure of complexity.

To be concrete, let us consider the regression model (1.2). For a fixed $ \lambda $, the estimates are linear, $ \widehat{\boldsymbol{f}}_\lambda=H(\lambda)
\boldsymbol{y}$, where $ H(\lambda)=P(\lambda)$ for the trigonometric model and $ H(\lambda)=A(\lambda)$ for the periodic spline. Suppose that LS is used as the measure of GOF and $ \vert M_\lambda\vert=tr(H(\lambda))$. Let us first consider the case when the error variance $ \sigma ^2$ is known. Then the criterion (1.19) can be re-expressed as

$\displaystyle U(\lambda,\theta)=\frac{1}{n} \vert\vert \boldsymbol{y} - \wideha...
...dsymbol{f}}_{\lambda}\vert\vert^2 + \frac{\theta}{n} \sigma^2 tr H(\lambda) \,.$ (1.20)

(1.20) is known as the final prediction error (FPE) criterion ([1]). Many existing model selection methods corresponds to special choices of $ \theta$: $ \theta=2$ for Akaike's AIC and Mallow's $ C_p$, and $ \theta=\log n$ for Schwarz's BIC. The C$ _p$ method is also called the unbiased risk method (UBR) in smoothing spline literature ([50]). The following simple argument provides the motivation behind the C$ _p$ (UBR) method.

$\displaystyle \mathrm{E} \left(\frac{1}{n} \vert\vert\boldsymbol{y} - \widehat{\boldsymbol{f}}_{\lambda}\vert\vert^2\right)$ $\displaystyle = \mathrm{E} \left( \frac{1}{n} \vert\vert\boldsymbol{y} - \bolds...
...c{1}{n} \vert\vert\boldsymbol{f} - \widehat{\boldsymbol{f}}\vert\vert^2 \right)$    
  $\displaystyle = \sigma^2 - \frac{2}{n} \sigma^2 trH(\lambda) + \mathrm{MSE}(\lambda).$ (1.21)

Therefore, ignoring the constant $ \sigma ^2$, $ U(\lambda,2)$ is an unbiased estimate of $ \mathrm{MSE}$$ (\lambda)$.

Other choices of $ \theta$ were motivated from different principles: AIC is an estimate of the expected Kullback-Leibler discrepancy where the second term in (1.20) is considered as a bias correction ([10]) and BIC is an asymptotic Bayes factor (Sect. 1.5). Since each method was derived with different motivations, it is not surprising that they have quite different theoretical properties ([47]). $ \theta$ in (1.20) can be considered as a penalty to the model complexity. A larger penalty ($ \theta$) leads to a simpler model. As a result, AIC and C$ _p$ perform well for ''complex'' true models and poorly for ''simple'' true models, while BIC does just the opposite. In practice the nature of the true model, ''simple'' or ''complex'', is never known. Thus a data driven choice of model complexity penalty $ \theta$ would be desirable. Several methods have been proposed to estimate $ \theta$ ([41,43,4,42,48]). We now discuss [48]'s method based on the generalized degrees of freedom. We will discuss the cross-validation method ([43]) in the next section.

Now consider both $ \lambda $ and $ \theta$ in (1.20) as unknown parameters. Denote $ \widehat{\lambda}(\theta)$ as the selected model index based on (1.20) for a fixed $ \theta$, and $ \widehat{f}_{\;\widehat{\lambda}(\theta)}$ as the estimate based on the selected model. The dependence on $ \theta$ is made explicit. We now want to find $ \theta$ which minimizes the $ \mathrm{MSE}$

$\displaystyle \notag \mathrm{MSE}(\theta) = \mathrm{E} \left(\frac{1}{n} \vert\...
...{\boldsymbol{f}}_{\widehat{\lambda}(\theta)}-\boldsymbol{f}\vert\vert^2\right).$    

Again, we need to estimate $ \mathrm{MSE}(\theta)$. As (1.21), we have

$\displaystyle \mathrm{E} \left(\frac{1}{n} \vert\vert\boldsymbol{y} - \widehat{\boldsymbol{f}}_{\widehat{\lambda}(\theta)}\vert\vert^2\right)$ $\displaystyle = \mathrm{E} \left( \frac{1}{n} \vert\vert\boldsymbol{y} - \bolds...
...l{f} - \widehat{\boldsymbol{f}}_{\widehat{\lambda}(\theta)}\vert\vert^2 \right)$    
  $\displaystyle = \sigma^2 - \frac{2}{n} \sigma^2 g_0(\theta) + \mathrm{MSE}(\theta) ,$    

where

$\displaystyle \notag g_0(\theta) = \mathrm{E} \boldsymbol{\epsilon}^{\top} \lef...
...t{\boldsymbol{f}} _{\widehat{\lambda}(\theta)}-\boldsymbol{f}\right) / \sigma^2$    

is the generalized degrees of freedom (gdf) defined in [58]. Thus, ignoring a constant $ \sigma ^2$,

$\displaystyle G(\theta) = \frac{1}{n} \vert\vert \boldsymbol{y} - \widehat{\bol...
...f}} _{\widehat{\lambda}(\theta)}\vert\vert^2 + \frac{2}{n} \sigma^2 g_0(\theta)$ (1.22)

is an unbiased estimate of $ \mathrm{MSE}(\theta)$. More rigorous justification can be found in [48]. Note that $ \widehat{\lambda}(\theta)$ depends on $ \boldsymbol {y}$. Thus $ \widehat{\boldsymbol{f}}_{\widehat{\lambda}(\theta)}=H(\,\widehat{\lambda}(\theta))
\boldsymbol{y}$ is a non-linear estimator. Usually $ g_0(\theta) \ne
tr H(\,\widehat{\lambda}(\theta))$. We need to estimate the gdf $ g_0(\theta)$. It can be shown that ([58])

$\displaystyle \notag \widehat{g}_0(\theta) = \int \boldsymbol{\delta}^{\top} \w...
...ldsymbol{\delta}) \phi_\tau(\boldsymbol{\delta}) \mathrm{d} \boldsymbol{\delta}$    

is an approximately unbiased estimate of $ g_0(\theta)$, where $ \boldsymbol{\delta} \sim N(\boldsymbol{0},\boldsymbol{\tau}^2 I)$, $ \phi_\tau(\boldsymbol{\delta})$ is the n-dimensional density of $ N(\boldsymbol{0},\boldsymbol{\tau}^2 I)$, and $ \widehat{\boldsymbol{f}}_{\widehat{\lambda}(\theta)}(\boldsymbol{y}+\boldsymbol{\delta})$ is the fit to the perturbed data $ \boldsymbol{y}+\boldsymbol{\delta}$. [58] suggested the following Monte Carlo approach to compute $ \widehat{g}_0(\theta) $: for a fixed $ \theta$,
  1. draw a $ n$-dimensional sample $ \boldsymbol{\delta} \sim N(\boldsymbol{0},\boldsymbol{\tau}^2 I)$, use the perturbed sample $ \boldsymbol{y}+\boldsymbol{\delta}$ to select a model based on (1.20) with fixed $ \theta$ and calculate the fits $ \widehat{\boldsymbol{f}}_{\widehat{\lambda}(\theta)}(\boldsymbol{y}+\boldsymbol{\delta})$.
  2. Repeating above step $ T$ times, one has $ \boldsymbol{\delta}_t=(\delta_{t1}\cdots,\delta_{tn})^{\top}$ and $ \widehat{\boldsymbol{f}}_{\widehat{\lambda}(\theta)}(\boldsymbol{y}+\boldsymbo...
...thop{=}\limits^{\triangle}{}(\,\widehat{f}_{t1},\cdots,\widehat{f}_{tn})^{\top}$, $ t=1,\cdots,T$.
  3. For a fixed $ i$, $ i=1,\cdots,n$, calculate the regression slope $ \widehat{h}_i$ of the following linear model

    $\displaystyle \notag \widehat{f}_{ti} = \mu + \widehat{h}_i \delta_{ti},\qquad t=1,\cdots,T.$    

  4. Estimate $ g_0(\theta)$ by $ \sum_{i=1}^n \widehat{h}_i$.
$ \boldsymbol{\tau} \in [0.5\sigma,\sigma]$ is generally a good choice and the results are usually insensitive to the choice of $ \tau$ when it is in this region ([58]). A data-driven choice of $ \theta$ is the minimum of (1.22) with $ g_0(\theta)$ replaced by its estimate $ \sum_{i=1}^n \widehat{h}_i$ ([48]).

When $ \sigma ^2$ is unknown, one may replace $ \sigma ^2$ in (1.20) and (1.22) by a consistent estimate. Many estimators were proposed in literature ([44,17,14,23,15]). The Rice's estimator is one of the simplest. For model (1.2), [44] proposed to estimate $ \sigma ^2$ by

$\displaystyle \notag \tilde{\sigma}^2 = \frac{1}{2(n-1)} \sum_{i=2}^n (y_i-y_{i-1})^2\,.$    

In the remaining of this chapter, $ \sigma ^2$ is replaced by $ \tilde{\sigma}^2$ whenever necessary.

Another option, assuming the distribution of $ y_i$'s is known, is to replace $ \mathrm{GOF}(M_\lambda)$ in (1.19) by $ -2
\log(\mathrm{maximum likelihood})$. For the regression models with Gaussian random errors, this leads to

$\displaystyle n \log(\vert\vert\boldsymbol{y} - \widehat{\boldsymbol{f}}_{\lambda}\vert\vert^2) + \theta tr H(\lambda)\,.$ (1.23)

Again, $ \theta=2$ and $ \theta=\log n$ correspond to $ \mathrm{AIC}$ and $ \mathrm{BIC}$ criteria respectively. The same data-driven procedure discussed above may also be used to select $ \theta$.

Derived from asymptotic argument, the $ \mathrm{AIC}$ method may lead to over-fitting for small samples ([10,28]). The following $ \mathrm{AIC}_c$ criterion modifies (1.23) with a second order bias adjustment ([28])

$\displaystyle \notag \mathrm{AIC}_c = n \log(\vert\vert\boldsymbol{y} - \wideha...
...bol{f}}_{\lambda}\vert\vert^2) + 2 tr H(\lambda) \frac{n}{n-tr H(\lambda)-1}\,.$    

$ \mathrm{AIC}_c$ should be used when the ratio between $ n$ and the number of parameters in the largest candidate model is small, say less than 40 ([10]). In our trigonometric model, the highest dimension may reach $ n$. Thus we will use $ \mathrm{AIC}_c$ in our computations.

Now consider the trigonometric model. It is easy to check that criterion (1.20) reduces to

$\displaystyle \notag \sum_{\nu=\lambda+1}^K \left(\tilde{y}_{2\nu}^2+\tilde{y}_{2\nu+1}^2\right) + \frac{\theta}{n} \sigma^2 (2\lambda+1)\,.$    

Thus adding the $ \lambda $th frequency reduces RSS by $ \tilde{y}_{2\lambda}^2+\tilde{y}_{2\lambda+1}^2$ and increases the complexity part by $ 2\theta \sigma^2/n$. When $ \tilde{y}_{2\lambda}^2+\tilde{y}_{2\lambda+1}^2$ decreases with increasing $ \lambda $, one should keeping adding frequencies until $ \tilde{y}_{2\lambda}^2+\tilde{y}_{2\lambda+1}^2 \le 2\theta
\sigma^2/n$. It is not difficult to see that the $ \mathrm{C}_p$ criterion corresponds to applying rule (1.18) with $ \alpha_{2\lambda}^2+\alpha_{2\lambda+1}^2$ replaced by its unbiased estimate $ \tilde{y}_{2\lambda}^2+\tilde{y}_{2\lambda+1}^2-
2\sigma^2/n$. Other data-based thresholding can be found in [15], [5], [61] and [26].

Figure 1.6: Left: Scores of $ \textrm {AIC}_c$, $ \textrm {BIC}$ and $ \textrm {C}_p$ criteria marked as letters ''a'', ''b'' and ''c'' respectively. Middle: degrees of freedom (solid line) and estimated gdf (dashed line). Right: RSS (dotted line), model complexity part (dashed line) and the G score (solid line) in (1.22)
\includegraphics[width=11.7cm]{text/3-1/criteria.eps}

Fitting trigonometric models to the climate data, we plot scores of $ \mathrm{AIC}_c$, $ \mathrm{BIC}$ and $ \mathrm{C}_p$ criteria as functions of the frequency in the left panel of Fig. 1.6. The $ \mathrm{AIC}_c$ and $ \mathrm{C}_p$ criteria reach minimum at $ \lambda=2$ and the $ \mathrm{BIC}$ criterion reaches the minimum at $ \lambda=1$. For a grid of $ \theta$ in the interval $ [0,\log n]$, we calculate the optimal $ \lambda $, $ \widehat{\lambda}(\theta)$, based on (1.20). We also calculate the estimated gdf using $ T=1000$ and $ \tau=0.75 \tilde{\sigma}$. The middle panel of Fig. 1.6 shows the estimated gdf together with the degrees of freedom based on the selected model, $ 2\widehat{\lambda}(\theta)+1$. The gdf is intended to account for the extra cost for estimating $ \lambda $. As expected, the gdf is almost always larger than the degrees of freedom. The gdf is close to the degrees of freedom when $ \theta$ is small or large. In the middle, it can have significant corrections to the degrees of freedom. Overall, the gdf smoothes out the corners in the discrete degrees of freedom. The RSS, complexity $ 2\widehat{g}_0
\tilde{\sigma}^2$ and $ G(\theta)$ are plotted in the right panel of Fig. 1.6. The minimum of $ G(\theta)$ is reached at $ \theta=3.68$ with $ \widehat{\lambda}(3.68)=2$. Trigonometric model fits with $ \lambda=1$ and $ \lambda=2$ are shown in Fig. 1.2.

Fitting periodic spline models to the climate data, we plot the $ \mathrm{C}_p$ (UBR) criterion in the left panel of Fig. 1.7. Fits with the UBR choice of the smoothing parameter is shown in the right panel of Fig. 1.7.

Figure 1.7: Left: Scores of the $ \textrm {C}_p$ (UBR), GCV and GML criteria plotted as dotted, solid and dashed lines respectively. Minimum points are marked with vertical bars. Right: Observations (circles) and fits from the periodic spline with the UBR (dotted line), GCV (solid line) and GML (dashed line) choices of smoothing parameter
\includegraphics[width=11cm]{text/3-1/psfit.eps}


next up previous contents index
Next: 1.4 Cross-Validation and Generalized Up: 1. Model Selection Previous: 1.2 Basic Concepts -