We illustrate in this section that model selection boils down to compromises between different aspects of a model. Occam's razor has been the guiding principle for the compromises: the model that fits observations sufficiently well in the least complex way should be preferred. Formalization of this principle is, however, nontrivial.
To be precise on fits observations sufficiently well, one needs a quantity that measures how well a model fits the data. This quantity is often called the goodness-of-fit (GOF). It usually is the criterion used for estimation, after deciding on a model. For example, we have used the LS as a measure of the GOF for regression models in Sect. 1.1. Other GOF measures include likelihood for density estimation problems and classification error for pattern recognition problems.
To be precise on the least complex way, one needs
a quantity that measures the complexity of a model. For
a parametric model, a common measure of model complexity is
the number of parameters in the model, often called the
degrees of freedom (df). For
a non-parametric regression model like the periodic spline,
, a direct extension from its parametric
version, is often used as a measure of model complexity
([24]).
will also be
refered to as the degrees of freedom. The middle panel of
Fig. 1.3 depicts how
for the
periodic spline depends on the smoothing parameter
.
Since there is an one-to-one correspondence between
and
, both of them are used as model index
([24]). See [21] for discussions on
some subtle issues concerning model index for smoothing spline
models. For some complicated models such as tree-based
regression, there may not be an obvious measure of model
complexity ([58]). In these cases the generalized
degrees of freedom defined
in [58] may be used. Section 1.3
contains more details on the generalized degrees of freedom.
To illustrate the interplay between the GOF and model
complexity, we fit trigonometric regression models from the
smallest model with to the largest model with
. The square root of residual sum of squares (RSS)
are plotted against the degrees of freedom (
) as
circles in the left panel of Fig. 1.3.
Similarly, we fit the periodic spline with a wide range of
values for the smoothing parameter
. Again, we plot
the square root of RSS against the degrees of freedom
(
) as the solid line in the left panel of
Fig. 1.3. Obviously, RSS decreases to zero
(interpolation) as the degrees of freedom increases to
. RSS keeps declining almost linearly after the initial big
drop. It is quite clear that the constant model does not fit
data well. But it is unclear which model fits
observations sufficiently well.
![]() |
The previous example shows that the GOF and complexity are two opposite aspects of a model: the approximation error decreases as the model complexity increases. On the other hand, the Occam's razor suggests that simple models should be preferred to more complicated ones, other things being equal. Our goal is to find the ''best'' model that strikes a balance between these two conflicting aspects. To make the word ''best'' meaningful, one needs a target criterion which quantifies a model's performance. It is clear that the GOF cannot be used as the target because it will lead to the most complex model. Even though there is no universally accepted measure, some criteria are widely accepted and used in practice. We now discuss one of them which is commonly used for regression models.
Let
be an estimate of the function in
model (1.2) based on the model space
. Let
and
. Define the
mean squared error (MSE) by
![]() |
Another closely related target criterion is the average predictive squared error (PSE)
![]() |
(1.16) |
![]() |
To illustrate the bias-variance trade-off,
we now calculate MSE for the
trigonometric regression and periodic spline models. For
notational simplicity, we assume that :
![]() |
(1.17) |
Bias-Variance Trade-Off for the Trigonometric Regression.
consists of the first
columns of the
orthogonal matrix
. Thus
.
, where
consists of the first
elements in
. Thus
is unbiased. Furthermore,
![]() |
![]() |
|
![]() |
||
![]() |
||
![]() |
||
![]() |
![]() |
![]() |
Bias-Variance Trade-Off for Periodic Spline.
For the
approximate periodic spline estimator, it is easy to check
that
,
,
,
,
, and
. Thus all coefficients are shrunk to zero
except
which is unbiased. The amount of
shinkage is controlled by the smoothing parameter
.
It is straightforward to calculate the
and
in (1.15).
![]() |
![]() |
|
![]() |
||
![]() |
![]() |
![]() |
It is easy to see that as increases from zero to
infinity, the
increases from zero to
and the
decreases from
to
.
To calculate
, one needs to know the true function. We use
the following simulation for illustration. We generate
responses from model (1.2) with
and
. The same design points in the
climate data is used:
and
. The
true function and responses are shown in the left panel of
Fig. 1.4. We compute
,
.
represents the contribution from frequency
. In the
right panel of Fig. 1.4, we plot
against frequency
with the threshold,
, marked as the dashed line. Except for
,
decreases as
increases. Values of
are above the threshold for the first four
frequencies. Thus the optimal choice is
.
![]() |
,
and
are plotted against
frequency (
) for trigonometric regression
(periodic spline) in the left (right) panel of
Fig. 1.5. Obviously, as the frequency
(
) increases (decreases), the
decreases and the
increases. The
represents
a balance between
and
. For the
trigonometric regression, the minimum of the
is reached at
, as expected.
![]() |
After deciding on a target criterion such as the
, ideally
one would select the model to minimize this criterion. This
is, however, not practical because the criterion usually
involves some unknown quantities. For example,
depends on
the unknown true function
which one wants to estimate in
the first place. Thus one needs to estimate this criterion
from the data and then minimize the estimated criterion. We
discuss unbiased and cross-validation estimates of
in
Sects 1.3 and 1.4
respectively.