Much empirical research is concerned with estimating conditional mean, median, or hazard functions. For example, labor economists are interested in estimating the mean wages of employed individuals conditional on characteristics such as years of work experience and education. The most frequently used estimation methods assume that the function of interest is known up to a set of constant parameters that can be estimated from data. Models in which the only unknown quantities are a finite set of constant parameters are called parametric. The use of a parametric model greatly simplifies estimation, statistical inference, and interpretation of the estimation results but is rarely justified by theoretical or other a priori considerations. Estimation and inference based on convenient but incorrect assumptions about the form of the conditional mean function can be highly misleading.
As an illustration, the solid line in Fig 10.1 shows an
estimate of the mean of the logarithm of weekly wages, ,
conditional on years of work experience, EXP, for white males
with
years of education who work full time and live in urban
areas of the North Central U.S. The estimate was obtained by applying
kernel nonparametric regression (see, e.g., Härdle 1990, Fan and
Gijbels 1996) to data from the 1993 Current Population Survey
(CPS). The estimated conditional mean of
increases steadily
up to approximately
years of experience and is flat
thereafter. The dashed and dotted lines in Fig 10.1 show
two parametric estimates of the mean of the logarithm of weekly wages
conditional on years of work experience. The dashed line is the
ordinary least squares (OLS) estimate that is obtained by assuming
that the mean of
conditional on
is the
linear function
. The dotted line is the OLS estimate that is
obtained by assuming that
is the
quadratic function
. The
nonparametric estimate (solid line) places no restrictions on the
shape of
. The linear and
quadratic models give misleading estimates of
. The linear model indicates that
increases steadily as experience increases. The
quadratic model indicates that
decreases after
years of experience. In contrast, the
nonparametric estimate of
becomes
nearly flat at approximately
years of experience. Because the
nonparametric estimate does not restrict the conditional mean function
to be linear or quadratic, it is more likely to represent the true
conditional mean function.
The opportunities for specification error increase if is
binary. For example, consider a model of the choice of travel mode for
the trip to work. Suppose that the available modes are automobile and
transit. Let
if an individual chooses automobile and
if the individual chooses transit. Let
be a vector of explanatory
variables such as the travel times and costs by automobile and
transit. Then
is the probability that
(the
probability that the individual chooses automobile) conditional on
. This probability will be denoted
. In
applications of binary response models, it is often assumed that
, where
is a vector of constant
coefficients and
is a known probability distribution
function. Often,
is assumed to be the cumulative standard normal
distribution function, which yields a binary probit model, or
the cumulative logistic distribution function, which yields
a binary logit model. The coefficients
can then be
estimated by the method of maximum likelihood (Amemiya 1985). However,
there are now two potential sources of specification error. First, the
dependence of
on
may not be through the linear index
. Second, even if the index
is correct, the
response function
may not be the normal or logistic
distribution function. See Horowitz (1993a, 1998) for examples of
specification errors in binary response models and their consequences.
Many investigators attempt to minimize the risk of specification error by carrying out a specification search in which several different models are estimated and conclusions are based on the one that appears to fit the data best. Specification searches may be unavoidable in some applications, but they have many undesirable properties and their use should be minimized. There is no guarantee that a specification search will include the correct model or a good approximation to it. If the search includes the correct model, there is no guarantee that it will be selected by the investigator's model selection criteria. Moreover, the search process invalidates the statistical theory on which inference is based.
The rest of this chapter describes methods that deal with the problem
of specification error by relaxing the assumptions about functional
form that are made by parametric models. The possibility of
specification error can be essentially eliminated through the use of
nonparametric estimation methods. They assume that the function of
interest is smooth but make no other assumptions about its shape or
functional form. However, nonparametric methods have important
disadvantages that seriously limit their usefulness in
applications. One important problem is that the precision of
a nonparametric estimator decreases rapidly as the dimension of the
explanatory variable increases. This phenomenon is called the
curse of dimensionality. As a result of it, impracticably
large samples are usually needed to obtain acceptable estimation
precision if
is multidimensional, as it often is in
applications. For example, a labor economist may want to estimate mean
log wages conditional on years of work experience, years of education,
and one or more indicators of skill levels, thus making the dimension
of
at least
.
Another problem is that nonparametric estimates can be difficult to
display, communicate, and interpret when is
multidimensional. Nonparametric estimates do not have simple analytic
forms. If
is one- or two-dimensional, then the estimate of the
function of interest can be displayed graphically as in
Fig 10.1, but only reduced-dimension projections can
be displayed when
has three or more components. Many such displays
and much skill in interpreting them can be needed to fully convey and
comprehend the shape of an estimate.
A further problem with nonparametric estimation is that it does not
permit extrapolation. For example, in the case of a conditional mean
function it does not provide predictions of
at
points
that are outside of the support (or range) of the random
variable
. This is a serious drawback in policy analysis and
forecasting, where it is often important to predict what might happen
under conditions that do not exist in the available data. Finally, in
nonparametric estimation, it can be difficult to impose restrictions
suggested by economic or other theory. Matzkin (1994) discusses this
issue.
Semiparametric methods offer a compromise. They make assumptions about
functional form that are stronger than those of a nonparametric model
but less restrictive than the assumptions of a parametric model,
thereby reducing (though not eliminating) the possibility of
specification error. Semiparametric methods permit greater estimation
precision than do nonparametric methods when is
multidimensional. They are easier to display and interpret than
nonparametric ones and provide limited capabilities for extrapolation
and imposing restrictions derived from economic or other theory
models. Section 10.2 of this chapter describes some
semiparametric models for conditional mean
functions. Section 10.3 describes semiparametric
estimators for an important class of hazard
models. Section 10.4 is concerned with semiparametric
estimation of a certain binary response model.