The need for model selection arises when a data-based choice among competing models has to be made. For example, for fitting parametric regression (linear, non-linear and generalized linear) models with multiple independent variables, one needs to decide which variables to include in the model (Chapts. III.7, III.8 and III.12); for fitting non-parametric regression (spline, kernel, local polynomial) models, one needs to decide the amount of smoothing (Chapts. III.5 and III.10); for unsupervised learning, one needs to decide the number of clusters (Chapts. III.13 and III.16); and for tree-based regression and classification, one needs to decide the size of a tree (Chapt. III.14).
Model choice is an integral and critical part of data analysis, an activity which has become increasingly more important as the ever increasing computing power makes it possible to fit more realistic, flexible and complex models. There is a huge literature concerning this subject ([37,39,10,19]) and we shall restrict this chapter to basic concepts and principles. We will illustrate these basics using a climate data, a simulation, and two regression models: parametric trigonometric regression and non-parametric periodic splines. We will discuss some commonly used model selection methods such as Akaike's AIC ([2]), Schwarz's BIC ([45]), Mallow's C ([38]), cross-validation (CV) ([49]), generalized cross-validation (GCV) ([13]) and Bayes factors ([31]). We do not intend to provide a comprehensive review. Readers may find additional model selection methods in the following chapters.
Let be a collection of candidate models from which one will select a model for the observed data. is the model index belonging to a set which may be finite, countable or uncountable.
Variable selection in multiple regression is perhaps the most common form of model selection in practice. Consider the following linear model
For illustration, we will use part of a climate data set downloaded from the Carbon Dioxide Information Analysis Center at http://cdiac.ornl.gov/ftp/ndp070. The data consists of daily maximum temperatures and other climatological variables from stations across the contiguous United States. We choose daily maximum temperatures from the station in Charleston, South Carolina, which has the longest records from 1871 to 1997. We use records in the year 1990 as observations. Records from other years provide information about the population. To avoid correlation (see Sect. 1.6) and simplify the presentation, we divided days in 1990 into five-day periods. The measurements on the third day in each period is selected as observations. Thus the data we use in our analyses is a subset consisting of every fifth day records in 1990 and the total number of observations . For simplicity, we transform the time variable into the interval . The data is shown in the left panel of Fig. 1.1.
|
Our goal is to investigate how maximum temperature changes over time in a year. Consider a regression model
In the middle panel of Fig. 1.1, we plot observations on the same selected days from 1871 to 1997. Assuming model (1.2) is appropriate for all years, the points represent realizations from model (1.2). The averages reflect the true mean function and the ranges reflect fluctuations. In the right panel, a smoothed version of the averages is shown, together with the observations in 1990. One may imagine that these observations were generated from the smoothed curve plus random errors. Our goal is to recover from the noisy data. Before proceeding to estimation, one needs to decide a model space for the function . Intuitively, a larger space provides greater potential to recover or approximate . At the same time, a larger space makes model identification and estimation more difficult ([57]). Thus the greater potential provided by the larger space is more difficult to reach. One should use as much prior knowledge as possible to narrow down the choice of model spaces. Since represents mean maximum temperature in a year, we will assume that is a periodic function.
Trigonometric Regression Model. It is a common practice to fit the periodic function using a trigonometric model up to a certain frequency, say , where and . Then the model space is
(1.4) |
(1.5) |
Since design points are equally spaced, we have the following orthogonality relations:
Let be the estimate of where the dependence on is expressed explicitly. Then the fits
(1.8) |
Fits for several (labeled as in strips) are shown in the top two rows of Fig. 1.2. Obviously as increases from zero to , we have a family of models ranging from a constant to interpolation. A natural question is that which model () gives the ''best'' estimate of .
|
Periodic Spline. In addition to the periodicity, it is often reasonable to assume that is a smooth function of . Specifically, we assume the following infinite dimensional space ([50,22])
and are absolutely continuous | ||
(1.9) |
The exact solution to (1.10) can be found in [50]. To simplify the argument, as in [50], we consider the following approximation of the original problem
Let
Let . Then , and the fit
(1.14) |
We choose the trigonometric regression and periodic spline models for illustration because of their simple model indexing: the first has a finite set of consecutive integers and the second has a continuous interval .
This chapter is organized as follows. In Sect. 1.2, we discuss the trade-offs between the goodness-of-fit and model complexity, and the trade-offs between bias and variance. We also introduce mean squared error as a target criterion. In Sect. 1.3, we introduce some commonly used model selection methods: AIC, BIC, C, AIC and a data-adaptive choice of the penalty. In Sect. 1.4, we discuss the cross-validation and the generalized cross-validation methods. In Sect. 1.5, we discuss Bayes factor and its approximations. In Sect. 1.6, we illustrate potential effects of heteroscedasticity and correlation on model selection methods. The chapter ends with some general comments in Sect. 1.7.