Next: 1.2 Basic Concepts - Up: 1. Model Selection Previous: 1. Model Selection

1.1 Introduction

The need for model selection arises when a data-based choice among competing models has to be made. For example, for fitting parametric regression (linear, non-linear and generalized linear) models with multiple independent variables, one needs to decide which variables to include in the model (Chapts. III.7, III.8 and III.12); for fitting non-parametric regression (spline, kernel, local polynomial) models, one needs to decide the amount of smoothing (Chapts. III.5 and III.10); for unsupervised learning, one needs to decide the number of clusters (Chapts. III.13 and III.16); and for tree-based regression and classification, one needs to decide the size of a tree (Chapt. III.14).

Model choice is an integral and critical part of data analysis, an activity which has become increasingly more important as the ever increasing computing power makes it possible to fit more realistic, flexible and complex models. There is a huge literature concerning this subject ([37,39,10,19]) and we shall restrict this chapter to basic concepts and principles. We will illustrate these basics using a climate data, a simulation, and two regression models: parametric trigonometric regression and non-parametric periodic splines. We will discuss some commonly used model selection methods such as Akaike's AIC ([2]), Schwarz's BIC ([45]), Mallow's C ([38]), cross-validation (CV) ([49]), generalized cross-validation (GCV) ([13]) and Bayes factors ([31]). We do not intend to provide a comprehensive review. Readers may find additional model selection methods in the following chapters.

Let $\mathcal{M}=\{ M_{\lambda},~ \lambda \in \Lambda \}$ be a collection of candidate models from which one will select a model for the observed data. $\lambda$ is the model index belonging to a set $\Lambda$ which may be finite, countable or uncountable.

Variable selection in multiple regression is perhaps the most common form of model selection in practice. Consider the following linear model

$\displaystyle y_i = \mathbf{x}_i^{\top} \mathbf{\beta} + \epsilon_i, \qquad i=1,2,\cdots,n\,,$

(1.1)

where $\mathbf{x}_i$ are vectors of

independent variables, $\mathbf{\beta}$ is a

-vector of parameters, and $\epsilon_i$ are random errors. Often a large number of independent variables are investigated in the model (1.1) and it is likely that not all

variable are important. Statistical inferences can be carried out more efficiently with smaller models. The goal of the variable selection is to find a subset of these

independent variables which is optimal in a certain sense. In this case, $\Lambda$ is the collection of all

subsets and $\lambda$ is any particular subset.

For illustration, we will use part of a climate data set downloaded from the Carbon Dioxide Information Analysis Center at http://cdiac.ornl.gov/ftp/ndp070. The data consists of daily maximum temperatures and other climatological variables from stations across the contiguous United States. We choose daily maximum temperatures from the station in Charleston, South Carolina, which has the longest records from 1871 to 1997. We use records in the year 1990 as observations. Records from other years provide information about the population. To avoid correlation (see Sect. 1.6) and simplify the presentation, we divided days in 1990 into five-day periods. The measurements on the third day in each period is selected as observations. Thus the data we use in our analyses is a subset consisting of every fifth day records in 1990 and the total number of observations . For simplicity, we transform the time variable into the interval . The data is shown in the left panel of Fig. 1.1.

**Figure 1.1:** *Left*: observations in the year 1990. *Middle*: observations on the same days from 1871-1997. Averages are marked as the *solid* *line*. *Right*: observations in the year 1990 (*points*) and a periodic spline fit (*line*) to the average temperatures in the middle panel
$\includegraphics[width=11.7cm]{text/3-1/data.eps}$

Our goal is to investigate how maximum temperature changes over time in a year. Consider a regression model

$\displaystyle y_i = f(t_i) + \epsilon_i, \qquad t_i=i/n,\qquad i=1,\cdots,n\,,$

(1.2)

where

is the observed maximum temperature at time

in Fahrenheit,

is the mean temperature function and $\epsilon_i$ 's are random fluctuations in the year 1990. We assume that $\epsilon_i$ 's are independent and identically distributed with mean zero and variance $\sigma ^2$ . Note that even though model (1.2) is very general, certain model assumptions (choices) have already been made implicitly. For example, the random fluctuations are assumed to be additive, independent and homogeneous. Violations of these assumptions such as independence may have severe impacts on model selection procedures (Sect. 1.6).

In the middle panel of Fig. 1.1, we plot observations on the same selected days from 1871 to 1997. Assuming model (1.2) is appropriate for all years, the points represent realizations from model (1.2). The averages reflect the true mean function and the ranges reflect fluctuations. In the right panel, a smoothed version of the averages is shown, together with the observations in 1990. One may imagine that these observations were generated from the smoothed curve plus random errors. Our goal is to recover from the noisy data. Before proceeding to estimation, one needs to decide a model space for the function $\,f$ . Intuitively, a larger space provides greater potential to recover or approximate . At the same time, a larger space makes model identification and estimation more difficult ([57]). Thus the greater potential provided by the larger space is more difficult to reach. One should use as much prior knowledge as possible to narrow down the choice of model spaces. Since represents mean maximum temperature in a year, we will assume that is a periodic function.

Trigonometric Regression Model. It is a common practice to fit the periodic function using a trigonometric model up to a certain frequency, say $\lambda$ , where $0 \le \lambda \le K$ and . Then the model space is

$\displaystyle M_{\lambda} = \mathrm{span} \left\{ 1, \sqrt{2} \sin 2 \pi \nu t, \sqrt{2}\cos 2 \pi \nu t, ~~\nu=1,\cdots,\lambda \right\}\,.$

(1.3)

The order $\lambda$ is unknown in most applications. Thus one needs to select $\lambda$ among $\Lambda=\{ 0, 1, \cdots, K \}$ where $M_0=\mathrm{span} \{ 1 \}$ . For a fixed $\lambda$ , we write the model $M_\lambda$ as

$\displaystyle y_i = \beta_1 + \sum_{\nu=1}^\lambda \left(\beta_{2\nu} \sqrt{2} ... ...a_{2\nu+1} \sqrt{2} \cos 2 \pi \nu t\right) + \epsilon_i, \quad i=1,\cdots,n\,,$

(1.4)

or in a matrix form

$\displaystyle \boldsymbol{y} = X_{\lambda} \boldsymbol{\beta}_{\lambda} + \boldsymbol{\epsilon},$

where $\mathbf{y}=(y_1,\cdots,y_n)^{\top}$ ,

$\displaystyle X_{\lambda} = \left( \begin{array}{cccccc} 1 & \sqrt{2} \sin 2 \p... ...2} \sin 2 \pi \lambda t_n & \sqrt{2} \cos 2 \pi \lambda t_n \end{array} \right)$

is the design matrix, $\boldsymbol{\beta}_{\lambda}=(\beta_1,\cdots,\beta_{2\lambda+1})^{\top}$ and $\boldsymbol{\epsilon}=(\epsilon_1,\cdots,\epsilon_n)^{\top}$ . The coefficients $\boldsymbol{\beta}_\lambda$ are estimated by minimizing the least squares (LS)

$\displaystyle \min _{\boldsymbol{\beta}_{\lambda}} \left\{ \frac{1}{n} \sum_{i=... ...\nu t_i + \beta_{2\nu+1} \sqrt{2} \cos 2 \pi \nu t_i\right) \right) ^2 \right\}$

(1.5)

Since design points are equally spaced, we have the following orthogonality relations:

$\displaystyle \frac{2}{n} \sum_{i=1}^n \cos 2 \pi \nu t_i \cos 2 \pi \mu t_i$	$\displaystyle = \delta_{\nu,\mu}, \quad 1 \le \nu,\ \mu \le K,$
$\displaystyle \frac{2}{n} \sum_{i=1}^n \sin 2 \pi \nu t_i \sin 2 \pi \mu t_i$	$\displaystyle = \delta_{\nu,\mu}, \quad 1 \le \nu,\ \mu \le K,$
$\displaystyle \frac{2}{n} \sum_{i=1}^n \cos 2 \pi \nu t_i \sin 2 \pi \mu t_i$	$\displaystyle = 0, \qquad ~1 \le \nu,\ \mu \le K,$	(1.6)

where $\delta_{\nu,\mu}$ is the Kronecker delta. Thus columns of the design matrix are orthogonal. That is, $X_{\lambda}^{\top} X_{\lambda} = n I_{2\lambda+1}$ where $I_{2\lambda+1}$ is an identity matrix of size $2\lambda+1$ . Let

be the design matrix of the largest model

. Then $X_K/\sqrt{n}$ is an orthonormal matrix. Define the discrete Fourier transformation $\tilde{\boldsymbol{y}}=X_K^{\top} \boldsymbol{y}/n$ . The LS estimate of $\boldsymbol{\beta}_{\lambda}$ is $\widehat{\boldsymbol{\beta}}_{\lambda} = (X_\lambda^{\top} X_\lambda)^{-1}X_{\... ...ol{y} = X_{\lambda}^{\top} \boldsymbol{y} /n = \tilde{\boldsymbol{y}}_{\lambda}$ , where $\tilde{\boldsymbol{y}}_{\lambda}$ consists of the first $2\lambda+1$ elements of $\tilde{\boldsymbol{y}}$ . More explicitly,

$\displaystyle \widehat{\beta}_0$	$\displaystyle = \frac{1}{n} \sum_{i=1}^n y_i = \tilde{y}_1,$
$\displaystyle \widehat{\beta}_{2\nu}$	$\displaystyle = \frac{\sqrt{2}}{n} \sum_{i=1}^n y_i \sin 2 \pi \nu t_i = \tilde{y}_{2\nu}, \qquad 1 \le \nu \le \lambda,$
$\displaystyle \widehat{\beta}_{2\nu+1}$	$\displaystyle = \frac{\sqrt{2}}{n} \sum_{i=1}^n y_i \cos 2 \pi \nu t_i = \tilde{y}_{2\nu+1}, \quad 1 \le \nu \le \lambda.$	(1.7)

Let $\widehat{f}_{\lambda}$ be the estimate of where the dependence on $\lambda$ is expressed explicitly. Then the fits

$\displaystyle \widehat{\boldsymbol{f}}_{\lambda} \mathop{=}^{\triangle}{} \left... ...\lambda} \widehat{\boldsymbol{\beta}}_{\lambda} = P(\lambda) \boldsymbol{y} \,,$

where

$\displaystyle P(\lambda)=X_\lambda\left(X_\lambda^{\top}X_\lambda\right)^{-1}X_\lambda^{\top}= X_{\lambda}X_{\lambda}^{\top}/n$

(1.8)

is the projection matrix. Note that

. Thus model

interpolates the data.

Fits for several $\lambda$ (labeled as in strips) are shown in the top two rows of Fig. 1.2. Obviously as $\lambda$ increases from zero to , we have a family of models ranging from a constant to interpolation. A natural question is that which model ( $\lambda$ ) gives the ''best'' estimate of .

**Figure 1.2:** *Circles* are observations. *Lines* are the estimated functions. *Top* *two* *rows* are fits from trigonometric models with frequencies indicated in the strips (we used instead of $\lambda$ for distinction). *Bottom* *two* *rows* are fits from the periodic spline with smoothing parameters indicated in the strips
$\includegraphics[width=\textwidth]{text/3-1/fits.eps}$

Periodic Spline. In addition to the periodicity, it is often reasonable to assume that is a smooth function of $t \in [0,1]$ . Specifically, we assume the following infinite dimensional space ([50,22])

$\displaystyle W_2(per) = \Big\{ \,f:$	$\displaystyle f~$ and $\displaystyle ~f^{\prime}~$ are absolutely continuous $\displaystyle ,$
	$\displaystyle f(0)=f(1),~f^{\prime}(0)=f^{\prime}(1), \int_0^1 (\,f^{\prime \prime}(t))^2 \mathrm{d}t < \infty \Big\}$	(1.9)

as the model space for

. A smoothing spline estimate of

is the minimizer of the following penalized LS ([50,22])

$\displaystyle \min_{f \in W_2(per)} \left\{\frac{1}{n} \sum_{i=1}^n (y_i-f(t_i))^2 + \lambda \int_0^1 (\,f^{\prime \prime}(t))^2 \mathrm{d}t \right\}\,,$

(1.10)

where the first part (LS) measures the goodness-of-fit, the second part is a penalty to the roughness of the estimate, and $\lambda$ ( $0 \le \lambda \le \infty$ ) is the so called smoothing parameter which controls the trade-off between the goodness-of-fit and the roughness. When $\lambda=0$ , there is no penalty to the roughness of the estimate and the spline estimate interpolates the data. When $\lambda=\infty$ , the spline estimate is forced to be a constant. As $\lambda$ varies from zero to infinity, we have a family of models ranging from interpolation to a constant parametric model. Thus $\lambda$ can be considered as a model index in $\Lambda=[0,\infty]$ . Fits with several choices of $\lambda$ are shown in the bottom two rows of Fig. 1.2.

The exact solution to (1.10) can be found in [50]. To simplify the argument, as in [50], we consider the following approximation of the original problem

$\displaystyle \min_{f \in M_K} \left\{\frac{1}{n} \sum_{i=1}^n (y_i-f(t_i))^2 + \lambda \int_0^1 (\,f^{\prime \prime}(t))^2 \mathrm{d}t \right\}\,,$

(1.11)

where

in (1.10) is replaced by

which is defined in (1.3) with $\lambda=K$ . The following discussions hold true for the exact solution in the infinite dimensional spaces ([50,22]). The approximation makes the following argument transparent and provides insights into smoothing.

Let

$\displaystyle \widehat{f}_\lambda(t) = \widehat{\alpha}_1 + \sum_{\nu=1}^K \lef... ... \sin 2 \pi \nu t + \widehat{\alpha}_{2\nu+1} \sqrt{2} \cos 2 \pi \nu t \right)$

be the solution to (1.11). Then $\widehat{\boldsymbol{f}}_\lambda \mathop{=}\limits^{\triangle}{} (\,\widehat{f... ...(t_1),\cdots,\widehat{f}_\lambda(t_n))^{\top}=X_K \widehat{\boldsymbol{\alpha}}$ , where $\widehat{\boldsymbol{\alpha}}=(\,\widehat{\alpha}_1,\cdots,\widehat{\alpha}_{2K+1})^{\top}$ . The LS

$\displaystyle \notag \frac{1}{n} \vert\vert \boldsymbol{y}-\widehat{\boldsymbol... ...vert\vert \tilde{\boldsymbol{y}}-\widehat{\boldsymbol{\alpha}} \vert\vert^2 \,.$

Thus (1.11) reduces to the following ridge regression problem

$\displaystyle (\,\widehat{\alpha}_1-\tilde{y}_1)^2 + \sum_{\nu=1}^K \left( (\,\... ...nu)^4 \left(\,\widehat{\alpha}_{2\nu}^2 + \widehat{\alpha}_{2\nu+1}^2\right)\,.$

(1.12)

The solutions to (1.12) are

$\displaystyle \widehat{\alpha}_1$	$\displaystyle = \tilde{y}_1,$
$\displaystyle \widehat{\alpha}_{2\nu}$	$\displaystyle = \tilde{y}_{2\nu} / \left(1+\lambda (2 \pi \nu)^4\right), \qquad \nu=1,\cdots,K,$
$\displaystyle \widehat{\alpha}_{2\nu+1}$	$\displaystyle = \tilde{y}_{2\nu+1} / \left(1+\lambda (2 \pi \nu)^4\right), \qquad \nu=1,\cdots,K\,.$	(1.13)

Thus the periodic spline with equally spaced design points is essentially a low-pass filter: components at frequency $\nu$ are down-weighted by a factor of $1+\lambda (2 \pi \nu)^4$ . The right panel of Fig. 1.3 shows how $\lambda$ controls the nature of the filter: more high frequencies are filtered out as $\lambda$ increases. It is clear from (1.7) and (1.13) that selecting an order for the trigonometric model may be viewed as hard thresholding and selecting the smoothing parameter for the periodic spline may be viewed as soft thresholding.

Let $D=\mathrm{diag} (1, 1/(1+\lambda (2 \pi)^4), 1/(1+\lambda (2 \pi)^4),\! \cdots\!, 1/(1+\lambda (2 \pi K)^4), 1/(1+\lambda (2 \pi K)^4))$ . Then $\widehat{\boldsymbol{\alpha}}=D \tilde{\boldsymbol{y}}$ , and the fit

$\displaystyle \widehat{\boldsymbol{f}}_{\lambda} = X_K \widehat{\boldsymbol{\alpha}}= \frac{1}{n} X_K D X_K^{\top} \boldsymbol{y} =A(\lambda) \boldsymbol{y}\,,$

where

$\displaystyle A(\lambda)= X_K D X_K^{\top}/n$

(1.14)

is the hat (smoother) matrix.

We choose the trigonometric regression and periodic spline models for illustration because of their simple model indexing: the first has a finite set of consecutive integers $\Lambda=\{ 0, 1, \cdots, K \}$ and the second has a continuous interval $\Lambda=[0,\infty]$ .

This chapter is organized as follows. In Sect. 1.2, we discuss the trade-offs between the goodness-of-fit and model complexity, and the trade-offs between bias and variance. We also introduce mean squared error as a target criterion. In Sect. 1.3, we introduce some commonly used model selection methods: AIC, BIC, C, AIC and a data-adaptive choice of the penalty. In Sect. 1.4, we discuss the cross-validation and the generalized cross-validation methods. In Sect. 1.5, we discuss Bayes factor and its approximations. In Sect. 1.6, we illustrate potential effects of heteroscedasticity and correlation on model selection methods. The chapter ends with some general comments in Sect. 1.7.

Next: 1.2 Basic Concepts - Up: 1. Model Selection Previous: 1. Model Selection