Next: 10.2 Semiparametric Models for Up: 10. Semiparametric Models Previous: 10. Semiparametric Models

10.1 Introduction

Much empirical research is concerned with estimating conditional mean, median, or hazard functions. For example, labor economists are interested in estimating the mean wages of employed individuals conditional on characteristics such as years of work experience and education. The most frequently used estimation methods assume that the function of interest is known up to a set of constant parameters that can be estimated from data. Models in which the only unknown quantities are a finite set of constant parameters are called parametric. The use of a parametric model greatly simplifies estimation, statistical inference, and interpretation of the estimation results but is rarely justified by theoretical or other a priori considerations. Estimation and inference based on convenient but incorrect assumptions about the form of the conditional mean function can be highly misleading.

**Figure 10.1:** Nonparametric and Parametric Estimates of Mean Log Wages
$\includegraphics[width=9cm]{text/3-10/abb/31}$

As an illustration, the solid line in Fig 10.1 shows an estimate of the mean of the logarithm of weekly wages, $\log W$ , conditional on years of work experience, EXP, for white males with years of education who work full time and live in urban areas of the North Central U.S. The estimate was obtained by applying kernel nonparametric regression (see, e.g., Härdle 1990, Fan and Gijbels 1996) to data from the 1993 Current Population Survey (CPS). The estimated conditional mean of $\log W$ increases steadily up to approximately years of experience and is flat thereafter. The dashed and dotted lines in Fig 10.1 show two parametric estimates of the mean of the logarithm of weekly wages conditional on years of work experience. The dashed line is the ordinary least squares (OLS) estimate that is obtained by assuming that the mean of $\log W$ conditional on is the linear function $E(\log W\vert {EXP})=\beta _0 +\beta _1 {EXP}$ . The dotted line is the OLS estimate that is obtained by assuming that $E(\log W\vert {EXP})$ is the quadratic function $E(\log W\vert {EXP})=\beta _0 +\beta _1 {EXP}+\beta _2 {EXP}^2$ . The nonparametric estimate (solid line) places no restrictions on the shape of $E(\log W\vert {EXP})$ . The linear and quadratic models give misleading estimates of $E(\log W\vert {EXP})$ . The linear model indicates that $E(\log W\vert {EXP})$ increases steadily as experience increases. The quadratic model indicates that $E(\log W\vert {EXP})$ decreases after years of experience. In contrast, the nonparametric estimate of $E(\log W\vert {EXP})$ becomes nearly flat at approximately years of experience. Because the nonparametric estimate does not restrict the conditional mean function to be linear or quadratic, it is more likely to represent the true conditional mean function.

The opportunities for specification error increase if is binary. For example, consider a model of the choice of travel mode for the trip to work. Suppose that the available modes are automobile and transit. Let if an individual chooses automobile and if the individual chooses transit. Let be a vector of explanatory variables such as the travel times and costs by automobile and transit. Then $E(Y\vert x)$ is the probability that (the probability that the individual chooses automobile) conditional on . This probability will be denoted $P(Y=1\vert x)$ . In applications of binary response models, it is often assumed that $P(Y\vert x)=G({\beta }'x)$ , where $\beta$ is a vector of constant coefficients and is a known probability distribution function. Often, is assumed to be the cumulative standard normal distribution function, which yields a binary probit model, or the cumulative logistic distribution function, which yields a binary logit model. The coefficients $\beta$ can then be estimated by the method of maximum likelihood (Amemiya 1985). However, there are now two potential sources of specification error. First, the dependence of on may not be through the linear index ${\beta }'x$ . Second, even if the index ${\beta }'x$ is correct, the response function may not be the normal or logistic distribution function. See Horowitz (1993a, 1998) for examples of specification errors in binary response models and their consequences.

Many investigators attempt to minimize the risk of specification error by carrying out a specification search in which several different models are estimated and conclusions are based on the one that appears to fit the data best. Specification searches may be unavoidable in some applications, but they have many undesirable properties and their use should be minimized. There is no guarantee that a specification search will include the correct model or a good approximation to it. If the search includes the correct model, there is no guarantee that it will be selected by the investigator's model selection criteria. Moreover, the search process invalidates the statistical theory on which inference is based.

The rest of this chapter describes methods that deal with the problem of specification error by relaxing the assumptions about functional form that are made by parametric models. The possibility of specification error can be essentially eliminated through the use of nonparametric estimation methods. They assume that the function of interest is smooth but make no other assumptions about its shape or functional form. However, nonparametric methods have important disadvantages that seriously limit their usefulness in applications. One important problem is that the precision of a nonparametric estimator decreases rapidly as the dimension of the explanatory variable increases. This phenomenon is called the curse of dimensionality. As a result of it, impracticably large samples are usually needed to obtain acceptable estimation precision if is multidimensional, as it often is in applications. For example, a labor economist may want to estimate mean log wages conditional on years of work experience, years of education, and one or more indicators of skill levels, thus making the dimension of at least .

Another problem is that nonparametric estimates can be difficult to display, communicate, and interpret when is multidimensional. Nonparametric estimates do not have simple analytic forms. If is one- or two-dimensional, then the estimate of the function of interest can be displayed graphically as in Fig 10.1, but only reduced-dimension projections can be displayed when has three or more components. Many such displays and much skill in interpreting them can be needed to fully convey and comprehend the shape of an estimate.

A further problem with nonparametric estimation is that it does not permit extrapolation. For example, in the case of a conditional mean function it does not provide predictions of $E(Y\vert x)$ at points that are outside of the support (or range) of the random variable . This is a serious drawback in policy analysis and forecasting, where it is often important to predict what might happen under conditions that do not exist in the available data. Finally, in nonparametric estimation, it can be difficult to impose restrictions suggested by economic or other theory. Matzkin (1994) discusses this issue.

Semiparametric methods offer a compromise. They make assumptions about functional form that are stronger than those of a nonparametric model but less restrictive than the assumptions of a parametric model, thereby reducing (though not eliminating) the possibility of specification error. Semiparametric methods permit greater estimation precision than do nonparametric methods when is multidimensional. They are easier to display and interpret than nonparametric ones and provide limited capabilities for extrapolation and imposing restrictions derived from economic or other theory models. Section 10.2 of this chapter describes some semiparametric models for conditional mean functions. Section 10.3 describes semiparametric estimators for an important class of hazard models. Section 10.4 is concerned with semiparametric estimation of a certain binary response model.

Next: 10.2 Semiparametric Models for Up: 10. Semiparametric Models Previous: 10. Semiparametric Models