1.2 Basic Concepts - Trade-Offs

We illustrate in this section that model selection boils down to compromises between different aspects of a model. Occam's razor has been the guiding principle for the compromises: the model that fits observations sufficiently well in the least complex way should be preferred. Formalization of this principle is, however, nontrivial.

To be precise on fits observations sufficiently well, one needs a quantity that measures how well a model fits the data. This quantity is often called the goodness-of-fit (GOF). It usually is the criterion used for estimation, after deciding on a model. For example, we have used the LS as a measure of the GOF for regression models in Sect. 1.1. Other GOF measures include likelihood for density estimation problems and classification error for pattern recognition problems.

To be precise on the least complex way, one needs a quantity that measures the complexity of a model. For a parametric model, a common measure of model complexity is the number of parameters in the model, often called the degrees of freedom (df). For a non-parametric regression model like the periodic spline, $tr A(\lambda)$ , a direct extension from its parametric version, is often used as a measure of model complexity ([24]). $tr A(\lambda)$ will also be refered to as the degrees of freedom. The middle panel of Fig. 1.3 depicts how $tr A(\lambda)$ for the periodic spline depends on the smoothing parameter $\lambda$ . Since there is an one-to-one correspondence between $\lambda$ and $tr A(\lambda)$ , both of them are used as model index ([24]). See [21] for discussions on some subtle issues concerning model index for smoothing spline models. For some complicated models such as tree-based regression, there may not be an obvious measure of model complexity ([58]). In these cases the generalized degrees of freedom defined in [58] may be used. Section 1.3 contains more details on the generalized degrees of freedom.

To illustrate the interplay between the GOF and model complexity, we fit trigonometric regression models from the smallest model with $\lambda=0$ to the largest model with $\lambda=K$ . The square root of residual sum of squares (RSS) are plotted against the degrees of freedom ( $=2\lambda+1$ ) as circles in the left panel of Fig. 1.3. Similarly, we fit the periodic spline with a wide range of values for the smoothing parameter $\lambda$ . Again, we plot the square root of RSS against the degrees of freedom ( $=trA(\lambda)$ ) as the solid line in the left panel of Fig. 1.3. Obviously, RSS decreases to zero (interpolation) as the degrees of freedom increases to

. RSS keeps declining almost linearly after the initial big drop. It is quite clear that the constant model does not fit data well. But it is unclear which model fits observations sufficiently well.

**Figure 1.3:** *Left*: square root of RSS from the trigonometric model (*circles*) and periodic spline (*line*) plotted against the degrees of freedom. *Middle*: degrees of freedom of periodic spline plotted against the smoothing parameter on the logarithm base 10 scale. *Right*: weights of the periodic spline filter, $1/(1+\lambda (2\pi \nu )^4)$ , plotted as a function of frequency $\nu$ . Six curves from top down corresponds to six different $\lambda$ : 0, $10^{-8}$ , $10^{-6}$ , $10^{-4}$ , $10^{-2}$ and $\infty$
$\includegraphics[width=\textwidth,clip]{text/3-1/ssr.eps}$

The previous example shows that the GOF and complexity are two opposite aspects of a model: the approximation error decreases as the model complexity increases. On the other hand, the Occam's razor suggests that simple models should be preferred to more complicated ones, other things being equal. Our goal is to find the ''best'' model that strikes a balance between these two conflicting aspects. To make the word ''best'' meaningful, one needs a target criterion which quantifies a model's performance. It is clear that the GOF cannot be used as the target because it will lead to the most complex model. Even though there is no universally accepted measure, some criteria are widely accepted and used in practice. We now discuss one of them which is commonly used for regression models.

Let $\widehat{f}_{\lambda}$ be an estimate of the function in model (1.2) based on the model space $M_{\lambda}$ . Let $\boldsymbol{f} = (\,f(t_1),\cdots,f(t_n))^{\top}$ and $\widehat{\boldsymbol{f}}_{\lambda} = (\,\widehat{f}_{\lambda}(t_1),\cdots, \widehat{f}_{\lambda}(t_n))^{\top}$ . Define the mean squared error (MSE) by

$\displaystyle \notag \mathrm{MSE}(\lambda) = \mathrm{E} \left(\frac{1}{n} \vert\vert\,\widehat{\boldsymbol{f}}_{\lambda}-\boldsymbol{f}\vert\vert^2\right)\,.$

$\displaystyle \mathrm{MSE}(\lambda) %\nonumber \\$	$\displaystyle = \frac{1}{n} \mathrm{E} \vert\vert (\mathrm{E} \widehat{\boldsym... ...symbol{f}}_{\lambda}-\mathrm{E} \widehat{\boldsymbol{f}}_{\lambda})\vert\vert^2$
	$\displaystyle = \frac{1}{n} \mathrm{E} \vert\vert \mathrm{E} \widehat{\boldsymb... ...dsymbol{f}}_{\lambda}-\mathrm{E} \widehat{\boldsymbol{f}}_{\lambda}\vert\vert^2$
	$\displaystyle = \frac{1}{n} \vert\vert \mathrm{E} \widehat{\boldsymbol{f}}_{\la... ...dsymbol{f}}_{\lambda}-\mathrm{E} \widehat{\boldsymbol{f}}_{\lambda}\vert\vert^2$
	$\displaystyle \mathop{=}^{\triangle}{}\; \mathrm{Bias}^2 + \mathrm{Variance} \,.$	(1.15)

Another closely related target criterion is the average predictive squared error (PSE)

$\displaystyle \mathrm{PSE}(\lambda) = \mathrm{E} \left(\frac{1}{n} \vert\vert\boldsymbol{y}^+-\widehat{\boldsymbol{f}}_\lambda\vert\vert^2\right),$

(1.16)

$\displaystyle \notag \mathrm{PSE}(\lambda) = \mathrm{E} \left(\frac{1}{n} \vert... ...\boldsymbol{f}}_\lambda)\vert\vert^2\right)=\sigma^2 + \mathrm{MSE}(\lambda)\,.$

To illustrate the bias-variance trade-off, we now calculate MSE for the trigonometric regression and periodic spline models. For notational simplicity, we assume that $f \in M_K$ :

$\displaystyle f(t) = \alpha_1 + \sum_{\nu=1}^K \left(\alpha_{2\nu} \sqrt{2} \sin 2 \pi \nu t + \alpha_{2\nu+1} \sqrt{2} \cos 2 \pi \nu t \right) \,.$

(1.17)

Bias-Variance Trade-Off for the Trigonometric Regression. $X_\lambda$ consists of the first $2\lambda+1$ columns of the orthogonal matrix

. Thus $X_\lambda^{\top} X_K=(nI_{2\lambda+1},\boldsymbol{0})$ . $\mathrm{E}(\,\widehat{\boldsymbol{\beta}}_\lambda)=X_\lambda^{\top} X_K \boldsymbol{\alpha} /n = \boldsymbol{\alpha}_\lambda$ , where $\boldsymbol{\alpha}_\lambda$ consists of the first $2\lambda+1$ elements in $\boldsymbol {\alpha }$ . Thus $\widehat{\boldsymbol{\beta}}_\lambda^{\vphantom{\frac12}}$ is unbiased. Furthermore,

$\displaystyle \mathrm{Bias}^2$	$\displaystyle = \frac{1}{n} \vert\vert(I_n-P(\lambda)) \boldsymbol{f}\vert\vert^2$
	$\displaystyle = \frac{1}{n} \vert\vert \frac{1}{\sqrt{n}} X_K^{\top} \left(I_n-\frac{1}{n}X_\lambda X_\lambda^{\top}\right) X_K \boldsymbol{\alpha} \vert\vert^2$
	$\displaystyle = \frac{1}{n^2} \vert\vert \left(nI_n - \left( \begin{array}{cc} ... ...} & \boldsymbol{0} \end{array} \right) \right) \boldsymbol{\alpha} \vert\vert^2$
	$\displaystyle = \sum_{\nu=\lambda+1}^K \left(\alpha_{2\nu}^2+\alpha_{2\nu+1}^2\right),$
$\displaystyle \mathrm{Variance}$	$\displaystyle = \frac{1}{n} \mathrm{E} \vert\vert P(\lambda) (\boldsymbol{y}-\b... ...ert^2 = \frac{\sigma^2}{n} tr P^2(\lambda) = \frac{\sigma^2}{n} (2\lambda+1)\,.$

$\displaystyle \mathrm{MSE}(\lambda) = \sum_{\nu=\lambda+1}^K \left(\alpha_{2\nu}^2+ \alpha_{2\nu+1}^2\right)+ \frac{1}{n} \sigma^2 (2\lambda+1)\,.$

1.2.1 Remarks

Bias-Variance Trade-Off for Periodic Spline. For the approximate periodic spline estimator, it is easy to check that $\mathrm{E}(\tilde{\boldsymbol{y}})=\boldsymbol{\alpha}$ , $\mathrm{Var}(\tilde{\boldsymbol{y}})=\sigma^2 I/n$ , $\mathrm{E}(\,\widehat{\boldsymbol{\alpha}})=D \boldsymbol{\alpha}$ , $\mathrm{Var}(\,\widehat{\boldsymbol{\alpha}})=\sigma^2 D^2$ , $\mathrm{E}(\,\widehat{\boldsymbol{f}}_{\lambda})=X_K D \boldsymbol{\alpha}$ , and $\mathrm{Var}(\,\widehat{\boldsymbol{f}}_{\lambda})=\sigma^2 X_K D^2 X_K^{\top}/n$ . Thus all coefficients are shrunk to zero except $\widehat{\alpha}_1$ which is unbiased. The amount of shinkage is controlled by the smoothing parameter $\lambda$ . It is straightforward to calculate the $\mathrm{Bias}^2$ and $\mathrm{Variance}$ in (1.15).

$\displaystyle \mathrm{Bias}^2$	$\displaystyle = \frac{1}{n} \vert\vert X_K \boldsymbol{\alpha} - X_K D \boldsym... ...} \left(X_K \boldsymbol{\alpha} - X_K D \boldsymbol{\alpha}\right) \vert\vert^2$
	$\displaystyle = \vert\vert (I-D) \boldsymbol{\alpha} \vert\vert^2 = \sum_{\nu=1... ...da (2 \pi \nu)^4} \right) ^2 \left(\alpha_{2\nu}^2+\alpha_{2\nu+1}^2\right) \,,$
$\displaystyle \mathrm{Variance} %\frac{1}{n} tr \mathrm{Var} (\,\widehat{\vec{f}}_{\lambda})$	$\displaystyle = \frac{1}{n^2} \sigma^2 tr\left(X_K D^2 X_K^{\top}\right) \!=\! ... ...( 1+2 \sum_{\nu=1}^K \left(\frac{1}{1+\lambda (2 \pi \nu)^4}\right)^2 \right) .$

$\displaystyle \notag \mathrm{MSE}=\sum_{\nu=1}^K \left( \frac{\lambda (2 \pi \n... ...1+2 \sum_{\nu=1}^K \left(\frac{1}{1+\lambda (2 \pi \nu)^4}\right)^2 \right) \,.$

It is easy to see that as $\lambda$ increases from zero to infinity, the $\mathrm{Bias}^2$ increases from zero to $\sum_{\nu=1}^K ( \alpha_{2\nu}^2+\alpha_{2\nu+1}^2 )$ and the $\mathrm{Variance}$ decreases from $\sigma ^2$ to $\sigma^2/n$ .

To calculate $\mathrm{MSE}$ , one needs to know the true function. We use the following simulation for illustration. We generate responses from model (1.2) with $f(t)=\sin (4 \pi t^2)$ and $\sigma=0.5$ . The same design points in the climate data is used: $t_i=i/n,~i=1,\cdots,n$ and

. The true function and responses are shown in the left panel of Fig. 1.4. We compute $b_\nu=\log (a_{2\nu}^2+a_{2\nu+1}^2)$ , $\nu =1,\cdots ,K$ . $b_\nu$ represents the contribution from frequency $\nu$ . In the right panel of Fig. 1.4, we plot $b_\nu$ against frequency $\nu$ with the threshold, $\log (2\sigma^2/n)$ , marked as the dashed line. Except for $\nu=1$ , $b_\nu$ decreases as $\nu$ increases. Values of $b_\nu$ are above the threshold for the first four frequencies. Thus the optimal choice is $\nu=4$ .

**Figure 1.4:** *Left*: true function for the simulation (*line*) and observations (*circle*). *Right*: plots of $b_\nu$ , $\nu =1,\cdots ,K$ , as *circles* and the threshold, $\log (2\sigma^2/n)$ , as the *dashed line*
$\includegraphics[width=10.9cm,clip]{text/3-1/simdata.eps}$

$\mathrm{Bias}^2$ , $\mathrm{Variance}$ and $\mathrm{MSE}$ are plotted against frequency ( $\log_{10}(\lambda)$ ) for trigonometric regression (periodic spline) in the left (right) panel of Fig. 1.5. Obviously, as the frequency ( $\lambda$ ) increases (decreases), the $\mathrm{Bias}^2$ decreases and the $\mathrm{Variance}$ increases. The $\mathrm{MSE}$ represents a balance between $\mathrm{Bias}^2$ and $\mathrm{Variance}$ . For the trigonometric regression, the minimum of the $\mathrm{MSE}$ is reached at $\nu=4$ , as expected.

**Figure 1.5:** Bias (*dashed lines*), $\mathrm{Variance}$ (*dotted lines*) and $\mathrm{MSE}$ (*solid lines*) for trigonometric regression on the *left* and periodic spline on the *right*
$\includegraphics[width=10.9cm]{text/3-1/mse.eps}$

After deciding on a target criterion such as the $\mathrm{MSE}$ , ideally one would select the model to minimize this criterion. This is, however, not practical because the criterion usually involves some unknown quantities. For example, $\mathrm{MSE}$ depends on the unknown true function

which one wants to estimate in the first place. Thus one needs to estimate this criterion from the data and then minimize the estimated criterion. We discuss unbiased and cross-validation estimates of $\mathrm{MSE}$ in Sects 1.3 and 1.4 respectively.