next up previous contents index
Next: 1.3 AIC, BIC, C Up: 1. Model Selection Previous: 1.1 Introduction

Subsections



1.2 Basic Concepts - Trade-Offs

We illustrate in this section that model selection boils down to compromises between different aspects of a model. Occam's razor has been the guiding principle for the compromises: the model that fits observations sufficiently well in the least complex way should be preferred. Formalization of this principle is, however, nontrivial.

To be precise on fits observations sufficiently well, one needs a quantity that measures how well a model fits the data. This quantity is often called the goodness-of-fit (GOF). It usually is the criterion used for estimation, after deciding on a model. For example, we have used the LS as a measure of the GOF for regression models in Sect. 1.1. Other GOF measures include likelihood for density estimation problems and classification error for pattern recognition problems.

To be precise on the least complex way, one needs a quantity that measures the complexity of a model. For a parametric model, a common measure of model complexity is the number of parameters in the model, often called the degrees of freedom (df). For a non-parametric regression model like the periodic spline, $ tr A(\lambda)$, a direct extension from its parametric version, is often used as a measure of model complexity ([24]). $ tr A(\lambda)$ will also be refered to as the degrees of freedom. The middle panel of Fig. 1.3 depicts how $ tr A(\lambda)$ for the periodic spline depends on the smoothing parameter $ \lambda $. Since there is an one-to-one correspondence between $ \lambda $ and $ tr A(\lambda)$, both of them are used as model index ([24]). See [21] for discussions on some subtle issues concerning model index for smoothing spline models. For some complicated models such as tree-based regression, there may not be an obvious measure of model complexity ([58]). In these cases the generalized degrees of freedom defined in [58] may be used. Section 1.3 contains more details on the generalized degrees of freedom.

To illustrate the interplay between the GOF and model complexity, we fit trigonometric regression models from the smallest model with $ \lambda=0$ to the largest model with $ \lambda=K$. The square root of residual sum of squares (RSS) are plotted against the degrees of freedom ( $ =2\lambda+1$) as circles in the left panel of Fig. 1.3. Similarly, we fit the periodic spline with a wide range of values for the smoothing parameter $ \lambda $. Again, we plot the square root of RSS against the degrees of freedom ( $ =trA(\lambda)$) as the solid line in the left panel of Fig. 1.3. Obviously, RSS decreases to zero (interpolation) as the degrees of freedom increases to $ n$. RSS keeps declining almost linearly after the initial big drop. It is quite clear that the constant model does not fit data well. But it is unclear which model fits observations sufficiently well.

Figure 1.3: Left: square root of RSS from the trigonometric model (circles) and periodic spline (line) plotted against the degrees of freedom. Middle: degrees of freedom of periodic spline plotted against the smoothing parameter on the logarithm base 10 scale. Right: weights of the periodic spline filter, $ 1/(1+\lambda (2\pi \nu )^4)$, plotted as a function of frequency $ \nu $. Six curves from top down corresponds to six different $ \lambda $: 0, $ 10^{-8}$, $ 10^{-6}$, $ 10^{-4}$, $ 10^{-2}$ and $ \infty $
\includegraphics[width=\textwidth,clip]{text/3-1/ssr.eps}

The previous example shows that the GOF and complexity are two opposite aspects of a model: the approximation error decreases as the model complexity increases. On the other hand, the Occam's razor suggests that simple models should be preferred to more complicated ones, other things being equal. Our goal is to find the ''best'' model that strikes a balance between these two conflicting aspects. To make the word ''best'' meaningful, one needs a target criterion which quantifies a model's performance. It is clear that the GOF cannot be used as the target because it will lead to the most complex model. Even though there is no universally accepted measure, some criteria are widely accepted and used in practice. We now discuss one of them which is commonly used for regression models.

Let $ \widehat{f}_{\lambda}$ be an estimate of the function in model (1.2) based on the model space $ M_{\lambda}$. Let $ \boldsymbol{f} = (\,f(t_1),\cdots,f(t_n))^{\top}$ and $ \widehat{\boldsymbol{f}}_{\lambda} = (\,\widehat{f}_{\lambda}(t_1),\cdots,
\widehat{f}_{\lambda}(t_n))^{\top}$. Define the mean squared error (MSE) by

$\displaystyle \notag \mathrm{MSE}(\lambda) = \mathrm{E} \left(\frac{1}{n} \vert\vert\,\widehat{\boldsymbol{f}}_{\lambda}-\boldsymbol{f}\vert\vert^2\right)\,.$    

We want the estimate $ \widehat{f}_{\lambda}$ to be as close to the true function $ f$ as possible. Obviously MSE is the expectation of the Euclidean distance between the estimates and the true values. $ L_2$ distance between $ \widehat{f}_\lambda$ and $ f$ may also be used. MSE can be decomposed into two components:

$\displaystyle \mathrm{MSE}(\lambda) %\nonumber \\
$ $\displaystyle = \frac{1}{n} \mathrm{E} \vert\vert (\mathrm{E} \widehat{\boldsym...
...symbol{f}}_{\lambda}-\mathrm{E} \widehat{\boldsymbol{f}}_{\lambda})\vert\vert^2$    
  $\displaystyle = \frac{1}{n} \mathrm{E} \vert\vert \mathrm{E} \widehat{\boldsymb...
...dsymbol{f}}_{\lambda}-\mathrm{E} \widehat{\boldsymbol{f}}_{\lambda}\vert\vert^2$    
  $\displaystyle = \frac{1}{n} \vert\vert \mathrm{E} \widehat{\boldsymbol{f}}_{\la...
...dsymbol{f}}_{\lambda}-\mathrm{E} \widehat{\boldsymbol{f}}_{\lambda}\vert\vert^2$    
  $\displaystyle \mathop{=}^{\triangle}{}\; \mathrm{Bias}^2 + \mathrm{Variance} \,.$ (1.15)

The $ \mathrm{Bias}^2$ measures how well the model $ M_\lambda$ approximates the true function $ f$, and the Variance measures how well the function can be estimated in $ M_\lambda$. Usually larger model space leads to smaller $ \mathrm{Bias}^2$ but larger Variance. Thus, the MSE represents a trade-off between $ \mathrm{Bias}^2$ and Variance.

Another closely related target criterion is the average predictive squared error (PSE)

$\displaystyle \mathrm{PSE}(\lambda) = \mathrm{E} \left(\frac{1}{n} \vert\vert\boldsymbol{y}^+-\widehat{\boldsymbol{f}}_\lambda\vert\vert^2\right),$ (1.16)

where $ \boldsymbol{y}^+=\boldsymbol{f}+\boldsymbol{\epsilon}^+$ are new observations at the same design points, $ \boldsymbol{\epsilon}^+=(\epsilon_1^+,\cdots,\epsilon_n^+)^{\top}$ are independent of $ \boldsymbol{\epsilon}$, and $ \epsilon_i^+$'s are independent and identically distributed with mean zero and variance $ \sigma ^2$. PSE measures the performance of a model's prediction for new observations. We have

$\displaystyle \notag \mathrm{PSE}(\lambda) = \mathrm{E} \left(\frac{1}{n} \vert...
...\boldsymbol{f}}_\lambda)\vert\vert^2\right)=\sigma^2 + \mathrm{MSE}(\lambda)\,.$    

Thus PSE differs from MSE only by a constant $ \sigma ^2$. When justifying some criteria including the C$ _p$ in Sect. 1.3, we will ignore a constant $ \sigma ^2$. Thus the targets of these criteria are really PSE rather than MSE.

To illustrate the bias-variance trade-off, we now calculate MSE for the trigonometric regression and periodic spline models. For notational simplicity, we assume that $ f \in M_K$:

$\displaystyle f(t) = \alpha_1 + \sum_{\nu=1}^K \left(\alpha_{2\nu} \sqrt{2} \sin 2 \pi \nu t + \alpha_{2\nu+1} \sqrt{2} \cos 2 \pi \nu t \right) \,.$ (1.17)

Then $ \boldsymbol{f}=X_K \boldsymbol{\alpha}$ where $ \boldsymbol{\alpha}=(\alpha_1,\cdots,\alpha_n)^{\top}$. From the orthogonality relations (1.6), it is easy to check that $ \boldsymbol{\alpha} = X_K^{\top} \boldsymbol{f}/n$, the discrete Fourier transformation of $ \boldsymbol{f}$.

Bias-Variance Trade-Off for the Trigonometric Regression. $ X_\lambda$ consists of the first $ 2\lambda+1$ columns of the orthogonal matrix $ X_K$. Thus $ X_\lambda^{\top}
X_K=(nI_{2\lambda+1},\boldsymbol{0})$. $ \mathrm{E}(\,\widehat{\boldsymbol{\beta}}_\lambda)=X_\lambda^{\top} X_K
\boldsymbol{\alpha} /n = \boldsymbol{\alpha}_\lambda$, where $ \boldsymbol{\alpha}_\lambda$ consists of the first $ 2\lambda+1$ elements in $ \boldsymbol {\alpha }$. Thus $ \widehat{\boldsymbol{\beta}}_\lambda^{\vphantom{\frac12}}$ is unbiased. Furthermore,

$\displaystyle \mathrm{Bias}^2$ $\displaystyle = \frac{1}{n} \vert\vert(I_n-P(\lambda)) \boldsymbol{f}\vert\vert^2$    
  $\displaystyle = \frac{1}{n} \vert\vert \frac{1}{\sqrt{n}} X_K^{\top} \left(I_n-\frac{1}{n}X_\lambda X_\lambda^{\top}\right) X_K \boldsymbol{\alpha} \vert\vert^2$    
  $\displaystyle = \frac{1}{n^2} \vert\vert \left(nI_n - \left( \begin{array}{cc} ...
...} & \boldsymbol{0} \end{array} \right) \right) \boldsymbol{\alpha} \vert\vert^2$    
  $\displaystyle = \sum_{\nu=\lambda+1}^K \left(\alpha_{2\nu}^2+\alpha_{2\nu+1}^2\right),$    
$\displaystyle \mathrm{Variance}$ $\displaystyle = \frac{1}{n} \mathrm{E} \vert\vert P(\lambda) (\boldsymbol{y}-\b...
...ert^2 = \frac{\sigma^2}{n} tr P^2(\lambda) = \frac{\sigma^2}{n} (2\lambda+1)\,.$    

Thus

$\displaystyle \mathrm{MSE}(\lambda) = \sum_{\nu=\lambda+1}^K \left(\alpha_{2\nu}^2+ \alpha_{2\nu+1}^2\right)+ \frac{1}{n} \sigma^2 (2\lambda+1)\,.$    

1.2.1 Remarks

  1. Adding the $ \lambda $th frequency terms into the model space reduces the $ \mathrm{Bias}^2$ by the amount of $ (\alpha_{2\lambda}^2+\alpha_{2\lambda+1}^2)$ and increases the $ \mathrm{Variance}$ by $ 2 \sigma^2/n$.
  2. The optimal model based on the $ \mathrm{MSE}$ may not be the true model when $ \sigma^2>0$.
  3. Assuming $ \alpha_{2\lambda}^2+\alpha_{2\lambda+1}^2$ decreases with increasing $ \lambda $, one should keep adding frequencies until

    $\displaystyle \alpha_{2\lambda}^2+\alpha_{2\lambda+1}^2 \le 2\sigma^2/n.$ (1.18)

  4. $ \mathrm{Bias}^2$ does not depend on the sample size $ n$ and the $ \mathrm{Variance}$ is inversely proportional to $ n$. Thus as $ n$ increases, more frequencies should be included.

Bias-Variance Trade-Off for Periodic Spline. For the approximate periodic spline estimator, it is easy to check that $ \mathrm{E}(\tilde{\boldsymbol{y}})=\boldsymbol{\alpha}$, $ \mathrm{Var}(\tilde{\boldsymbol{y}})=\sigma^2 I/n$, $ \mathrm{E}(\,\widehat{\boldsymbol{\alpha}})=D \boldsymbol{\alpha}$, $ \mathrm{Var}(\,\widehat{\boldsymbol{\alpha}})=\sigma^2 D^2$, $ \mathrm{E}(\,\widehat{\boldsymbol{f}}_{\lambda})=X_K D \boldsymbol{\alpha}$, and $ \mathrm{Var}(\,\widehat{\boldsymbol{f}}_{\lambda})=\sigma^2 X_K D^2
X_K^{\top}/n$. Thus all coefficients are shrunk to zero except $ \widehat{\alpha}_1$ which is unbiased. The amount of shinkage is controlled by the smoothing parameter $ \lambda $. It is straightforward to calculate the $ \mathrm{Bias}^2$ and $ \mathrm{Variance}$ in (1.15).

$\displaystyle \mathrm{Bias}^2$ $\displaystyle = \frac{1}{n} \vert\vert X_K \boldsymbol{\alpha} - X_K D \boldsym...
...} \left(X_K \boldsymbol{\alpha} - X_K D \boldsymbol{\alpha}\right) \vert\vert^2$    
  $\displaystyle = \vert\vert (I-D) \boldsymbol{\alpha} \vert\vert^2 = \sum_{\nu=1...
...da (2 \pi \nu)^4} \right) ^2 \left(\alpha_{2\nu}^2+\alpha_{2\nu+1}^2\right) \,,$    
$\displaystyle \mathrm{Variance} %\frac{1}{n} tr \mathrm{Var} (\,\widehat{\vec{f}}_{\lambda})
$ $\displaystyle = \frac{1}{n^2} \sigma^2 tr\left(X_K D^2 X_K^{\top}\right) \!=\! ...
...( 1+2 \sum_{\nu=1}^K \left(\frac{1}{1+\lambda (2 \pi \nu)^4}\right)^2 \right) .$    

Thus

$\displaystyle \notag \mathrm{MSE}=\sum_{\nu=1}^K \left( \frac{\lambda (2 \pi \n...
...1+2 \sum_{\nu=1}^K \left(\frac{1}{1+\lambda (2 \pi \nu)^4}\right)^2 \right) \,.$    

It is easy to see that as $ \lambda $ increases from zero to infinity, the $ \mathrm{Bias}^2$ increases from zero to $ \sum_{\nu=1}^K ( \alpha_{2\nu}^2+\alpha_{2\nu+1}^2 )$ and the $ \mathrm{Variance}$ decreases from $ \sigma ^2$ to $ \sigma^2/n$.

To calculate $ \mathrm{MSE}$, one needs to know the true function. We use the following simulation for illustration. We generate responses from model (1.2) with $ f(t)=\sin (4
\pi t^2)$ and $ \sigma=0.5$. The same design points in the climate data is used: $ t_i=i/n,~i=1,\cdots,n$ and $ n=73$. The true function and responses are shown in the left panel of Fig. 1.4. We compute $ b_\nu=\log
(a_{2\nu}^2+a_{2\nu+1}^2)$, $ \nu =1,\cdots ,K$. $ b_\nu $ represents the contribution from frequency $ \nu $. In the right panel of Fig. 1.4, we plot $ b_\nu $ against frequency $ \nu $ with the threshold, $ \log
(2\sigma^2/n)$, marked as the dashed line. Except for $ \nu=1$, $ b_\nu $ decreases as $ \nu $ increases. Values of $ b_\nu $ are above the threshold for the first four frequencies. Thus the optimal choice is $ \nu=4$.

Figure 1.4: Left: true function for the simulation (line) and observations (circle). Right: plots of $ b_\nu $, $ \nu =1,\cdots ,K$, as circles and the threshold, $ \log
(2\sigma^2/n)$, as the dashed line
\includegraphics[width=10.9cm,clip]{text/3-1/simdata.eps}

$ \mathrm{Bias}^2$, $ \mathrm{Variance}$ and $ \mathrm{MSE}$ are plotted against frequency ( $ \log_{10}(\lambda)$) for trigonometric regression (periodic spline) in the left (right) panel of Fig. 1.5. Obviously, as the frequency ($ \lambda $) increases (decreases), the $ \mathrm{Bias}^2$ decreases and the $ \mathrm{Variance}$ increases. The $ \mathrm{MSE}$ represents a balance between $ \mathrm{Bias}^2$ and $ \mathrm{Variance}$. For the trigonometric regression, the minimum of the $ \mathrm{MSE}$ is reached at $ \nu=4$, as expected.

Figure 1.5: Bias$ ^2$ (dashed lines), $ \mathrm{Variance}$ (dotted lines) and $ \mathrm{MSE}$ (solid lines) for trigonometric regression on the left and periodic spline on the right
\includegraphics[width=10.9cm]{text/3-1/mse.eps}

After deciding on a target criterion such as the $ \mathrm{MSE}$, ideally one would select the model to minimize this criterion. This is, however, not practical because the criterion usually involves some unknown quantities. For example, $ \mathrm{MSE}$ depends on the unknown true function $ f$ which one wants to estimate in the first place. Thus one needs to estimate this criterion from the data and then minimize the estimated criterion. We discuss unbiased and cross-validation estimates of $ \mathrm{MSE}$ in Sects 1.3 and 1.4 respectively.


next up previous contents index
Next: 1.3 AIC, BIC, C Up: 1. Model Selection Previous: 1.1 Introduction