Next: 10.3 The Proportional Hazards Up: 10. Semiparametric Models Previous: 10.1 Introduction

Subsections

# 10.2 Semiparametric Models for Conditional Mean Functions

The term semiparametric refers to models in which there is an unknown function in addition to an unknown finite dimensional parameter. For example, the binary response model is semiparametric if the function and the vector of coefficients  are both treated as unknown quantities. This section describes two semiparametric models of conditional mean functions that are important in applications. The section also describes a related class of models that has no unknown finite-dimensional parameters but, like semiparametric models, mitigates the disadvantages of fully nonparametric models. Finally, this section describes a class of transformation models that is important in estimation of hazard functions among other applications. Powell (1994) discusses additional semiparametric models.

## 10.2.1 Single Index Models

In a semiparametric single index model, the conditional mean function has the form

 (10.1)

where is an unknown constant vector and is an unknown function. The quantity is called an index. The inferential problem is to estimate and from observations of ().  in (10.1) is analogous to a link function in a generalized linear model, except in (10.1) is unknown and must be estimated.

Model (10.1) contains many widely used parametric models as special cases. For example, if is the identity function, then (10.1) is a linear model. If is the cumulative normal or logistic distribution function, then (10.1) is a binary probit or logit model. When is unknown, (10.1) provides a specification that is more flexible than a parametric model but retains many of the desirable features of parametric models, as will now be explained.

One important property of single index models is that they avoid the curse of dimensionality. This is because the index aggregates the dimensions of , thereby achieving dimension reduction. Consequently, the difference between the estimator of and the true function can be made to converge to zero at the same rate that would be achieved if were observable. Moreover,  can be estimated with the same rate of convergence that is achieved in a parametric model. Thus, in terms of the rates of convergence of estimators, a single index model is as accurate as a parametric model for estimating and as accurate as a one-dimensional nonparametric model for estimating . This dimension reduction feature of single index models gives them a considerable advantage over nonparametric methods in applications where is multidimensional and the single index structure is plausible.

A single-index model permits limited extrapolation. Specifically, it yields predictions of at values of that are not in the support of but are in the support of . Of course, there is a price that must be paid for the ability to extrapolate. A single index model makes assumptions that are stronger than those of a nonparametric model. These assumptions are testable on the support of but not outside of it. Thus, extrapolation (unavoidably) relies on untestable assumptions about the behavior of beyond the support of .

Before and can be estimated, restrictions must be imposed that insure their identification. That is, and must be uniquely determined by the population distribution of (). Identification of single index models has been investigated by Ichimura (1993) and, for the special case of binary response models, Manski (1988). It is clear that is not identified if is a constant function or there is an exact linear relation among the components of (perfect multicollinearity). In addition, (10.1) is observationally equivalent to the model , where and are arbitrary and is defined by the relation for all  in the support of . Therefore, and are not identified unless restrictions are imposed that uniquely specify and . The restriction on is called location normalization and can be imposed by requiring to contain no constant (intercept) component. The restriction on is called scale normalization. Scale normalization can be achieved by setting the coefficient of one component of equal to one. A further identification requirement is that must include at least one continuously distributed component whose coefficient is non-zero. Horowitz (1998) gives an example that illustrates the need for this requirement. Other more technical identification requirements are discussed by Ichimura (1993) and Manski (1988).

The main estimation challenge in single index models is estimating . Given an estimator of , can be estimated by carrying out the nonparametric regression of on (e.g, by using kernel estimation). Several estimators of are available. Ichimura (1993) describes a nonlinear least squares estimator. Klein and Spady (1993) describe a semiparametric maximum likelihood estimator for the case in which is binary. These estimators are difficult to compute because they require solving complicated nonlinear optimization problems. Powell, et al. (1989) describe a density-weighted average derivative estimator (DWADE) that is non-iterative and easily computed. The DWADE applies when all components of are continuous random variables. It is based on the relation

 (10.2)

where is the probability density function of and the second equality follows from integrating the first by parts. Thus, can be estimated up to scale by estimating the expression on the right-hand side of the second equality. Powell, et al. (1989) show that this can be done by replacing  with a nonparametric estimator and replacing the population expectation  with a sample average. Horowitz and Härdle (1996) extend this method to models in which some components of are discrete. Hristache, Juditsky, and Spokoiny (2001) developed an iterated average derivative estimator that performs well when is high-dimensional. Ichimura and Lee (1991) and Hristache, Juditsky, Polzehl and Spokoiny (2001) investigate multiple-index generalizations of (10.1).

The usefulness of single-index models can be illustrated with an example that is taken from Horowitz and Härdle (1996). The example consists of estimating a model of product innovation by German manufacturers of investment goods. The data, assembled in 1989 by the IFO Institute of Munich, consist of observations on 1100 manufacturers. The dependent variable is if a manufacturer realized an innovation during 1989 in a specific product category and 0 otherwise. The independent variables are the number of employees in the product category (), the number of employees in the entire firm (), an indicator of the firm's production capacity utilization (), and a variable , which is if a firm expected increasing demand in the product category and 0 otherwise. The first three independent variables are standardized so that they have units of standard deviations from their means. Scale normalization was achieved by setting .

Table 10.1 shows the parameter estimates obtained using a binary probit model and the semiparametric method of Horowitz and Härdle (1996). Figure 10.2 shows a kernel estimate of . There are two important differences between the semiparametric and probit estimates. First, the semiparametric estimate of is small and statistically nonsignificant, whereas the probit estimate is significant at the level and similar in size to . Second, in the binary probit model, is a cumulative normal distribution function, so is a normal density function. Figure 10.2 reveals, however, that is bimodal. This bimodality suggests that the data may be a mixture of two populations. An obvious next step in the analysis of the data would be to search for variables that characterize these populations. Standard diagnostic techniques for binary probit models would provide no indication that is bimodal. Thus, the semiparametric estimate has revealed an important feature of the data that could not easily be found using standard parametric methods.

 EMPLP EMPLF CAP DEM Semiparametric Model 1 0.032 0.346 1.732 (0.023) (0.078) (0.509) Probit Model 1 0.516 0.520 1.895 (0.024) (0.163) (0.387)

## 10.2.2 Partially Linear Models

In a partially linear model, is partitioned into two non-overlapping subvectors, and . The model has the form

 (10.3)

where is an unknown constant vector and is an unknown function. This model is distinct from the class of single index models. A single index model is not partially linear unless is a linear function. Conversely, a partially linear model is a single index model only in this case. Stock (1989, 1991) and Engle et al. (1986) illustrate the use of (10.3) in applications. Identification of requires the exclusion restriction that none of the components of are perfectly predictable by components of . When is identified, it can be estimated with an rate of convergence regardless of the dimensions of and . Thus, the curse of dimensionality is avoided in estimating .

An estimator of can be obtained by observing that (10.3) implies

 (10.4)

where is an unobserved random variable satisfying . Robinson (1988) shows that under regularity conditions, can be estimated by applying OLS to (10.4) after replacing and with nonparametric estimators. The estimator of , , converges at rate and is asymptotically normally distributed.  can be estimated by carrying out the nonparametric regression of on . Unlike , the estimator of suffers from the curse of dimensionality; its rate of convergence decreases as the dimension of increases.

Let have continuously distributed components that are denoted . In a nonparametric additive model of the conditional mean function,

 (10.5)

where is a constant and are unknown functions that satisfy a location normalization condition such as

 (10.6)

where is a non-negative weight function. An additive model is distinct from a single index model unless is a linear function of . Additive and partially linear models are distinct unless is partially linear and in (10.3) is additive.

An estimator of can be obtained by observing that (10.5) and (10.6) imply

 (10.7)

where is the vector consisting of all components of except the 'th and is a weight function that satisfies . The estimator of is obtained by replacing on the right-hand side of (10.7) with nonparametric estimators. Linton and Nielsen (1995) and Linton (1997) present the details of the procedure and extensions of it. Under suitable conditions, the estimator of converges to the true at rate regardless of the dimension of . Thus, the additive model provides dimension reduction. It also permits extrapolation of within the rectangle formed by the supports of the individual components of . Mammen, Linton, and Nielsen (1999) describe a backfitting procedure that is likely to be more precise than the estimator based on (10.7) when is large. See Hastie and Tibshirani (1990) for an early discussion of backfitting.

Linton and Härdle (1996) describe a generalized additive model whose form is

 (10.8)

where are unknown functions and is a known, strictly increasing (or decreasing) function. Horowitz (2001) describes a version of (10.8) in which is unknown. Both forms of (10.8) achieve dimension reduction. When is unknown, (10.8) nests additive and single index models and, under certain conditions, partially linear models.

The use of the nonparametric additive specification (10.5) can be illustrated by estimating the model EDUC, where and EXP are defined as in Sect. 10.1, and EDUC denotes years of education. The data are taken from the 1993 CPS and are for white males with or fewer years of education who work full time and live in urban areas of the North Central U.S. The results are shown in Fig 10.3. The unknown functions and are estimated by the method of Linton and Nielsen (1995) and are normalized so that . The estimates of (Fig 10.3a) and (Fig 10.3b) are nonlinear and differently shaped. Functions and with different shapes cannot be produced by a single index model, and a lengthy specification search might be needed to find a parametric model that produces the shapes shown in Fig 10.3. Some of the fluctuations of the estimates of and may be artifacts of random sampling error rather than features of EDUC. However, a more elaborate analysis that takes account of the effects of random sampling error rejects the hypothesis that either function is linear.

## 10.2.4 Transformation Models

A transformation model has the form

 (10.9)

where is an unknown increasing function, is an unknown finite dimensional vector of constants, and is an unobserved random variable. It is assumed here that is statistically independent of . The aim is to estimate  and . One possibility is to assume that is known up to a finite-dimensional parameter. For example, could be the Box-Cox transformation

where is an unknown parameter. Methods for estimating transformation models in which is parametric have been developed by Amemiya and Powell (1981) and Foster, et al. (2001) among others.

Another possibility is to assume that is unknown but that the distribution of  is known. Cheng, Wei, and Ying (1995, 1997) have developed estimators for this version of (10.9). Consider, first, the problem of estimating . Let denote the (known) cumulative distribution function (CDF) of . Let and be two distinct, independent observations of . Then it follows from (10.9) that

 (10.10)

Let for any real . Then

is a known function because is assumed known. Substituting into (10.10) gives

Define . Then it follows that satisfies the moment condition

 (10.11)

where is a weight function. Cheng, Wei, and Ying (1995) propose estimating by replacing the population moment condition (10.11) with the sample analog

 (10.12)

The estimator of , , is the solution to (10.12). Equation (10.12) has a unique solution if for all and the matrix is positive definite. It also has a unique solution asymptotically if is positive everywhere (Cheng, Wei, and Ying 1995). Moreover, converges almost surely to . Cheng, Wei, and Ying (1995) also give conditions under which is asymptotically normally distributed with a mean of 0.

The problem of estimating the transformation function is addressed by Cheng, Wei, and Ying (1997). Equation (10.11) implies that for any real and vector that is conformable with , . Cheng, Wei, and Ying (1997) propose estimating by the solution to the sample analog of this equation. That is, the estimator solves

where is the solution to (10.12). Cheng, Wei, and Ying (1997) show that if is strictly increasing on its support, then converges to almost surely uniformly over any interval such that . Moreover, converges to a mean-zero Gaussian process over this interval.

A third possibility is to assume that and are both nonparametric in (10.9). In this case, certain normalizations are needed to make identification of (10.9) possible. First, observe that (10.9) continues to hold if is replaced by , is replaced by , and is replaced by for any positive constant . Therefore, a scale normalization is needed to make identification possible. This will be done here by setting , where is the first component of . Observe, also, that when and are nonparametric, (10.9) is a semiparametric single-index model. Therefore, identification of requires to have at least one component whose distribution conditional on the others is continuous and whose coefficient is non-zero. Assume without loss of generality that the components of are ordered so that the first satisfies this condition.

It can also be seen that (10.9) is unchanged if is replaced by and is replaced by for any positive or negative constant . Therefore, a location normalization is also needed to achieve identification when and are nonparametric. Location normalization will be carried out here by assuming that for some finite With this location normalization, there is no centering assumption on and no intercept term in .

Now consider the problem of estimating , , and . Because (10.9) is a single-index model in this case, can be estimated using the methods described in Sect. 10.2.1. Let denote the estimator of . One approach to estimating and is given by Horowitz (1996). To describe this approach, define . Let denote the CDF of conditional on . Set and . Then it follows from (10.9) that and that

 (10.13)

for any such that the denominator of the integrand is non-zero. Now let be a scalar-valued, non-negative weight function with compact support such that the denominator of is bounded away from 0 for all and . Also assume that

Then

 (10.14)

Horowitz (1996) obtains an estimator of from (10.14) by replacing and by kernel estimators. Specifically, is replaced by a kernel estimator of the probability density function of conditional on , and is replaced by a kernel estimator of the derivative with respect to of the CDF of conditional on . Denote these estimators by and . Then the estimator of  is

 (10.15)

Horowitz (1996) gives conditions under which is uniformly consistent for and converges weakly to a mean-zero Gaussian process. Horowitz (1996) also shows how to estimate , the CDF of , and gives conditions under which converges to a mean-zero Gaussian process, where is the estimator. Gørgens and Horowitz (1999) extend these results to a censored version of (10.9). Integration over in (10.14) and (10.15) accelerates the convergence of to . Kernel estimators converge in probability at rates slower than . Therefore, is not -consistent for . However, integration over and in (10.15) creates an averaging effect that causes the integral and, therefore, to converge at the rate . This is the reason for basing the estimator on (10.14) instead of (10.13).

Other estimators of when and are both nonparametric have been proposed by Ye and Duan (1997) and Chen (2002). Chen uses a rank-based approach that is in some ways simpler than that of Horowitz (1996) and may have better finite-sample performance. To describe this approach, define and . Let . Then whenever . This suggests that if were known, then could be estimated by

Since is unknown, Chen (2002) proposes

Chen (2002) gives conditions under which is uniformly consistent for and converges to a mean-zero Gaussian process. Chen (2002) also shows how this method can be extended to a censored version of (10.9).

Next: 10.3 The Proportional Hazards Up: 10. Semiparametric Models Previous: 10.1 Introduction