Next: 10.3 The Proportional Hazards Up: 10. Semiparametric Models Previous: 10.1 Introduction

10.2 Semiparametric Models
for Conditional Mean Functions

The term semiparametric refers to models in which there is an unknown function in addition to an unknown finite dimensional parameter. For example, the binary response model $P(Y=1\vert x)=G({\beta }'x)$ is semiparametric if the function and the vector of coefficients $\beta$ are both treated as unknown quantities. This section describes two semiparametric models of conditional mean functions that are important in applications. The section also describes a related class of models that has no unknown finite-dimensional parameters but, like semiparametric models, mitigates the disadvantages of fully nonparametric models. Finally, this section describes a class of transformation models that is important in estimation of hazard functions among other applications. Powell (1994) discusses additional semiparametric models.

10.2.1 Single Index Models

In a semiparametric single index model, the conditional mean function has the form

$\displaystyle E(Y\vert x)=G({\beta }'x)\;,$

(10.1)

where $\beta$ is an unknown constant vector and

is an unknown function. The quantity ${\beta }'x$ is called an index. The inferential problem is to estimate

and $\beta$ from observations of (

in (10.1) is analogous to a link function in a generalized linear model, except in (10.1)

is unknown and must be estimated.

Model (10.1) contains many widely used parametric models as special cases. For example, if is the identity function, then (10.1) is a linear model. If is the cumulative normal or logistic distribution function, then (10.1) is a binary probit or logit model. When is unknown, (10.1) provides a specification that is more flexible than a parametric model but retains many of the desirable features of parametric models, as will now be explained.

One important property of single index models is that they avoid the curse of dimensionality. This is because the index ${\beta }'x$ aggregates the dimensions of , thereby achieving dimension reduction. Consequently, the difference between the estimator of and the true function can be made to converge to zero at the same rate that would be achieved if ${\beta }'x$ were observable. Moreover, $\beta$ can be estimated with the same rate of convergence that is achieved in a parametric model. Thus, in terms of the rates of convergence of estimators, a single index model is as accurate as a parametric model for estimating $\beta$ and as accurate as a one-dimensional nonparametric model for estimating . This dimension reduction feature of single index models gives them a considerable advantage over nonparametric methods in applications where is multidimensional and the single index structure is plausible.

A single-index model permits limited extrapolation. Specifically, it yields predictions of $E(Y\vert x)$ at values of that are not in the support of but are in the support of ${\beta }'X$ . Of course, there is a price that must be paid for the ability to extrapolate. A single index model makes assumptions that are stronger than those of a nonparametric model. These assumptions are testable on the support of but not outside of it. Thus, extrapolation (unavoidably) relies on untestable assumptions about the behavior of $E(Y\vert x)$ beyond the support of .

Before $\beta$ and can be estimated, restrictions must be imposed that insure their identification. That is, $\beta$ and must be uniquely determined by the population distribution of (, ). Identification of single index models has been investigated by Ichimura (1993) and, for the special case of binary response models, Manski (1988). It is clear that $\beta$ is not identified if is a constant function or there is an exact linear relation among the components of (perfect multicollinearity). In addition, (10.1) is observationally equivalent to the model $E(Y\vert X)=G^{\ast} (\gamma +\delta {\beta }'x)$ , where $\gamma$ and $\delta \ne 0$ are arbitrary and $G^{\ast}$ is defined by the relation $G^{\ast} (\gamma +\delta v)=G(v)$ for all in the support of ${\beta }'X$ . Therefore, $\beta$ and are not identified unless restrictions are imposed that uniquely specify $\gamma$ and $\delta$ . The restriction on $\gamma$ is called location normalization and can be imposed by requiring to contain no constant (intercept) component. The restriction on $\delta$ is called scale normalization. Scale normalization can be achieved by setting the $\beta$ coefficient of one component of equal to one. A further identification requirement is that must include at least one continuously distributed component whose $\beta$ coefficient is non-zero. Horowitz (1998) gives an example that illustrates the need for this requirement. Other more technical identification requirements are discussed by Ichimura (1993) and Manski (1988).

The main estimation challenge in single index models is estimating $\beta$ . Given an estimator of $\beta$ , can be estimated by carrying out the nonparametric regression of on $b_n ^{\prime} X$ (e.g, by using kernel estimation). Several estimators of $\beta$ are available. Ichimura (1993) describes a nonlinear least squares estimator. Klein and Spady (1993) describe a semiparametric maximum likelihood estimator for the case in which is binary. These estimators are difficult to compute because they require solving complicated nonlinear optimization problems. Powell, et al. (1989) describe a density-weighted average derivative estimator (DWADE) that is non-iterative and easily computed. The DWADE applies when all components of are continuous random variables. It is based on the relation

$\displaystyle \beta \propto E\left[\kern.8pt p(X)\partial G({\beta }'X)/\partial X\right] =-2E\left[Y\partial p(X)/\partial X\right]\;,$

(10.2)

where

is the probability density function of

and the second equality follows from integrating the first by parts. Thus, $\beta$ can be estimated up to scale by estimating the expression on the right-hand side of the second equality. Powell, et al. (1989) show that this can be done by replacing

with a nonparametric estimator and replacing the population expectation $\boldsymbol{E}$ with a sample average. Horowitz and Härdle (1996) extend this method to models in which some components of

are discrete. Hristache, Juditsky, and Spokoiny (2001) developed an iterated average derivative estimator that performs well when

is high-dimensional. Ichimura and Lee (1991) and Hristache, Juditsky, Polzehl and Spokoiny (2001) investigate multiple-index generalizations of (10.1).

The usefulness of single-index models can be illustrated with an example that is taken from Horowitz and Härdle (1996). The example consists of estimating a model of product innovation by German manufacturers of investment goods. The data, assembled in 1989 by the IFO Institute of Munich, consist of observations on 1100 manufacturers. The dependent variable is if a manufacturer realized an innovation during 1989 in a specific product category and 0 otherwise. The independent variables are the number of employees in the product category (), the number of employees in the entire firm (), an indicator of the firm's production capacity utilization (), and a variable , which is if a firm expected increasing demand in the product category and 0 otherwise. The first three independent variables are standardized so that they have units of standard deviations from their means. Scale normalization was achieved by setting $\beta _{EMPLP} =1$ .

Table 10.1 shows the parameter estimates obtained using a binary probit model and the semiparametric method of Horowitz and Härdle (1996). Figure 10.2 shows a kernel estimate of ${G}'(\nu)$ . There are two important differences between the semiparametric and probit estimates. First, the semiparametric estimate of $\beta _{EMPLF}$ is small and statistically nonsignificant, whereas the probit estimate is significant at the level and similar in size to $\beta _{CAP}$ . Second, in the binary probit model, is a cumulative normal distribution function, so is a normal density function. Figure 10.2 reveals, however, that is bimodal. This bimodality suggests that the data may be a mixture of two populations. An obvious next step in the analysis of the data would be to search for variables that characterize these populations. Standard diagnostic techniques for binary probit models would provide no indication that is bimodal. Thus, the semiparametric estimate has revealed an important feature of the data that could not easily be found using standard parametric methods.

**Table 10.1:** Estimated Coefficients (Standard Errors) for Model of Product Innovation
EMPLP	EMPLF	CAP	DEM
Semiparametric Model
1	0.032	0.346	1.732
	(0.023)	(0.078)	(0.509)
Probit Model
1	0.516	0.520	1.895
	(0.024)	(0.163)	(0.387)

**Figure 10.2:** Estimate of for model of product innovation
$\includegraphics[width=9cm]{text/3-10/abb/32}$

10.2.2 Partially Linear Models

In a partially linear model, is partitioned into two non-overlapping subvectors, $X_{1}$ and $X_{2}$ . The model has the form

$\displaystyle E(Y\vert x_1 ,x_2 )={\beta }'x_1 +G(x_2 )\;,$

(10.3)

where $\beta$ is an unknown constant vector and

is an unknown function. This model is distinct from the class of single index models. A single index model is not partially linear unless

is a linear function. Conversely, a partially linear model is a single index model only in this case. Stock (1989, 1991) and Engle et al. (1986) illustrate the use of (10.3) in applications. Identification of $\beta$ requires the exclusion restriction that none of the components of $X_{1}$ are perfectly predictable by components of $X_{2}$ . When $\beta$ is identified, it can be estimated with an $n^{-1/2}$ rate of convergence regardless of the dimensions of $X_{1}$ and $X_{2}$ . Thus, the curse of dimensionality is avoided in estimating $\beta$ .

An estimator of $\beta$ can be obtained by observing that (10.3) implies

$\displaystyle Y-E(Y\vert x_2)={\beta }'\left[X_1 -E(X_1 \vert x_2 )\right]+U\;,$

(10.4)

where

is an unobserved random variable satisfying $E(U\vert x_1 ,x_2 )=0$ . Robinson (1988) shows that under regularity conditions, $\beta$ can be estimated by applying OLS to (10.4) after replacing $E(Y\vert x_2 )$ and $E(X_1 \vert x_2 )$ with nonparametric estimators. The estimator of $\beta$ ,

, converges at rate $n^{-1/2}$ and is asymptotically normally distributed.

can be estimated by carrying out the nonparametric regression of $Y-b_n ^{\prime} X_1$ on

. Unlike

, the estimator of

suffers from the curse of dimensionality; its rate of convergence decreases as the dimension of

increases.

10.2.3 Nonparametric Additive Models

Let have continuously distributed components that are denoted $X_{1}, \ldots, X_{d}$ . In a nonparametric additive model of the conditional mean function,

$\displaystyle E(Y\vert x)=\mu +f_1 (x_1 )+\ldots +f_d (x_d )\;,$

(10.5)

where $\mu$ is a constant and $f_1 ,\ldots,f_d$ are unknown functions that satisfy a location normalization condition such as

$\displaystyle \int f_k (v)w_k (v){\text{d}} v=0\;,\quad k=1,\ldots,d \;,$

(10.6)

where

is a non-negative weight function. An additive model is distinct from a single index model unless $E(Y\vert x)$ is a linear function of

. Additive and partially linear models are distinct unless $E(Y\vert x)$ is partially linear and

in (10.3) is additive.

An estimator of $f_k \,(k=1,\ldots,d)$ can be obtained by observing that (10.5) and (10.6) imply

$\displaystyle f_k (x_k)=\int E (Y\vert x)w_{-k} (x_{-k} ){\text{d}} x_{-k} \;,$

(10.7)

where $x_{-k}$ is the vector consisting of all components of

except the

'th and $w_{-k}$ is a weight function that satisfies $\int {w_{-k} (x_{-k} ){\text{d}} x_{-k} =1}$ . The estimator of

is obtained by replacing $E(Y\vert x)$ on the right-hand side of (10.7) with nonparametric estimators. Linton and Nielsen (1995) and Linton (1997) present the details of the procedure and extensions of it. Under suitable conditions, the estimator of

converges to the true

at rate $n^{-2/5}$ regardless of the dimension of

. Thus, the additive model provides dimension reduction. It also permits extrapolation of $E(Y\vert x)$ within the rectangle formed by the supports of the individual components of

. Mammen, Linton, and Nielsen (1999) describe a backfitting procedure that is likely to be more precise than the estimator based on (10.7) when

is large. See Hastie and Tibshirani (1990) for an early discussion of backfitting.

Linton and Härdle (1996) describe a generalized additive model whose form is

$\displaystyle E(Y\vert x)=G\left[\mu +f_1 (x_1 )+\ldots+f_K (x_d )\right] \;,$

(10.8)

where $f_1 ,\ldots,f_d$ are unknown functions and

is a known, strictly increasing (or decreasing) function. Horowitz (2001) describes a version of (10.8) in which

is unknown. Both forms of (10.8) achieve dimension reduction. When

is unknown, (10.8) nests additive and single index models and, under certain conditions, partially linear models.

**Figure 10.3:** Components of nonparametric, additive wage equation
$\includegraphics[width=9.3cm]{text/3-10/abb/33}$

The use of the nonparametric additive specification (10.5) can be illustrated by estimating the model $E(\log W\vert {EXP},$ EDUC $)=\mu +f_{{EXP}} ({EXP})+f_{\text{\textit{EDUC}}} (\text{\textit{EDUC}})$ , where and EXP are defined as in Sect. 10.1, and EDUC denotes years of education. The data are taken from the 1993 CPS and are for white males with or fewer years of education who work full time and live in urban areas of the North Central U.S. The results are shown in Fig 10.3. The unknown functions $f_{{EXP}}$ and $f_{\text{\textit{EDUC}}}$ are estimated by the method of Linton and Nielsen (1995) and are normalized so that $f_{{EXP}} (2)=f_{EDCU} (5)=0$ . The estimates of $f_{{EXP}}$ (Fig 10.3a) and $f_{\text{\textit{EDUC}}}$ (Fig 10.3b) are nonlinear and differently shaped. Functions $f_{{EXP}}$ and $f_{\text{\textit{EDUC}}}$ with different shapes cannot be produced by a single index model, and a lengthy specification search might be needed to find a parametric model that produces the shapes shown in Fig 10.3. Some of the fluctuations of the estimates of $f_{{EXP}}$ and $f_{\text{\textit{EDUC}}}$ may be artifacts of random sampling error rather than features of $E(\log W\vert {EXP},$ EDUC. However, a more elaborate analysis that takes account of the effects of random sampling error rejects the hypothesis that either function is linear.

10.2.4 Transformation Models

A transformation model has the form

$\displaystyle H(Y)={\beta }'X+U\;,$

(10.9)

where

is an unknown increasing function, $\beta$ is an unknown finite dimensional vector of constants, and

is an unobserved random variable. It is assumed here that

is statistically independent of

. The aim is to estimate

and $\beta$ . One possibility is to assume that

is known up to a finite-dimensional parameter. For example,

could be the Box-Cox transformation

$\displaystyle H(y)= \begin{cases}(y^{\tau} -1)/\tau & \text{if}\;\;\tau >0\\ \log y & \text{if}\;\;\tau =0 \\ \end{cases}$

where $\tau$ is an unknown parameter. Methods for estimating transformation models in which

is parametric have been developed by Amemiya and Powell (1981) and Foster, et al. (2001) among others.

Another possibility is to assume that is unknown but that the distribution of is known. Cheng, Wei, and Ying (1995, 1997) have developed estimators for this version of (10.9). Consider, first, the problem of estimating $\beta$ . Let denote the (known) cumulative distribution function (CDF) of . Let and $(i\ne j)$ be two distinct, independent observations of . Then it follows from (10.9) that

$\displaystyle E\left[I(Y_i >Y_j )\vert X_i =x_i ,X_j =x_j \right]= P\left[U_i -U_j >-(x_i -x_j )\right]\;.$

(10.10)

Let

for any real

. Then

$\displaystyle G(z)=\int\limits_{-\infty }^\infty \left[1-F(u+z)\right]{\text{d}} F(u) \;.$

is a known function because

is assumed known. Substituting

into (10.10) gives

$\displaystyle E\left[I(Y_i >Y_j )\vert X_i =x_i ,X_j =x_j \right]=G\left[-{\beta }'(x_i -x_j )\right]\;.$

Define $X_{ij} =X_i -X_j$ . Then it follows that $\beta$ satisfies the moment condition

$\displaystyle E\left\{w\left({\beta }'X_{ij} \right)X_{ij} \left[I\left(Y_i >Y_j \right)-G\left(-{\beta }'X_{ij} \right)\right]\right\}=0$

(10.11)

where

is a weight function. Cheng, Wei, and Ying (1995) propose estimating $\beta$ by replacing the population moment condition (10.11) with the sample analog

$\displaystyle \sum\limits_{i=1}^n \sum\limits_{j=1}^n \left\{w\left({b}'X_{ij} ... ...} \left[I\left(Y_i >Y_j \right)-G\left(-{b}'X_{ij} \right)\right]\right\} =0\;.$

(10.12)

The estimator of $\beta$ ,

, is the solution to (10.12). Equation (10.12) has a unique solution if

for all

and the matrix $\sum\nolimits_i \sum\nolimits_j {X}'_{ij} X_{ij}$ is positive definite. It also has a unique solution asymptotically if

is positive everywhere (Cheng, Wei, and Ying 1995). Moreover,

converges almost surely to $\beta$ . Cheng, Wei, and Ying (1995) also give conditions under which $n^{1/2}(b_n -\beta )$ is asymptotically normally distributed with a mean of 0.

The problem of estimating the transformation function is addressed by Cheng, Wei, and Ying (1997). Equation (10.11) implies that for any real and vector that is conformable with , $EI[I(Y\le y)\vert X=x]-F[H(y)-{\beta }'x]=0$ . Cheng, Wei, and Ying (1997) propose estimating by the solution to the sample analog of this equation. That is, the estimator solves

$\displaystyle n^{-1}\sum\limits_{i=1}^n \left\{ I\left(Y_i \le y\right)-F\left[H_n (y)-{b}'_n X_i \right]\right\} =0\;,$

where

is the solution to (10.12). Cheng, Wei, and Ying (1997) show that if

is strictly increasing on its support, then

converges to

almost surely uniformly over any interval

such that

. Moreover, $n^{1/2}(H_n -H)$ converges to a mean-zero Gaussian process over this interval.

A third possibility is to assume that and are both nonparametric in (10.9). In this case, certain normalizations are needed to make identification of (10.9) possible. First, observe that (10.9) continues to hold if is replaced by , $\beta$ is replaced by $c\beta$ , and is replaced by for any positive constant . Therefore, a scale normalization is needed to make identification possible. This will be done here by setting $\vert \beta _1 \vert =1$ , where $\beta_1$ is the first component of $\beta$ . Observe, also, that when and are nonparametric, (10.9) is a semiparametric single-index model. Therefore, identification of $\beta$ requires to have at least one component whose distribution conditional on the others is continuous and whose $\beta$ coefficient is non-zero. Assume without loss of generality that the components of are ordered so that the first satisfies this condition.

It can also be seen that (10.9) is unchanged if is replaced by and is replaced by for any positive or negative constant . Therefore, a location normalization is also needed to achieve identification when and are nonparametric. Location normalization will be carried out here by assuming that for some finite With this location normalization, there is no centering assumption on and no intercept term in .

Now consider the problem of estimating , $\beta$ , and . Because (10.9) is a single-index model in this case, $\beta$ can be estimated using the methods described in Sect. 10.2.1. Let denote the estimator of $\beta$ . One approach to estimating and is given by Horowitz (1996). To describe this approach, define $Z={\beta }'X$ . Let $G(\cdot \vert z)$ denote the CDF of conditional on . Set $G_y (y\vert z)=\partial G(y\vert z)/\partial z$ and $G_z (y\vert z)=\partial G(y\vert z)/\partial z$ . Then it follows from (10.9) that ${H}'(y)=-G_y (y\vert z)/G_z (y\vert z)$ and that

$\displaystyle H(y)=-\int\limits_{y_0 }^y \left[G_y (v\vert z)/G_z (v\vert z)\right]{\text{d}} v$

(10.13)

for any

such that the denominator of the integrand is non-zero. Now let $w(\cdot )$ be a scalar-valued, non-negative weight function with compact support

such that the denominator of $G_z (v\vert z)$ is bounded away from 0 for all $v\in [y_0 ,y]$ and $z\in S_w$ . Also assume that

$\displaystyle \int\limits_{S_w } w(z){\text{d}} z =1\;.$

Then

$\displaystyle H(y)=-\int\limits_{y_0 }^y \int\limits_{S_w } w(z)\left[G_y (v\vert z)/G_z (v\vert z)\right]{\text{d}} z\,{\text{d}} v\;.$

(10.14)

Horowitz (1996) obtains an estimator of

from (10.14) by replacing

and

by kernel estimators. Specifically,

is replaced by a kernel estimator of the probability density function of

conditional on ${b}'_n X=z$ , and

is replaced by a kernel estimator of the derivative with respect to

of the CDF of

conditional on ${b}'_n X=z$ . Denote these estimators by $G_{ny}$ and $G_{nz}$ . Then the estimator of

$\displaystyle H_n (y)=-\int\limits_{y_0 }^y \int\limits_{S_w } w(z)\left[G_{ny} (v\vert z)/G_{nz} (v\vert z)\right]{\text{d}} z\,{\text{d}} v\;.$

(10.15)

Horowitz (1996) gives conditions under which

is uniformly consistent for

and $n^{1/2}(H_n -H)$ converges weakly to a mean-zero Gaussian process. Horowitz (1996) also shows how to estimate

, the CDF of

, and gives conditions under which $n^{1/2}(F_n -F)$ converges to a mean-zero Gaussian process, where

is the estimator. Gørgens and Horowitz (1999) extend these results to a censored version of (10.9). Integration over

in (10.14) and (10.15) accelerates the convergence of

. Kernel estimators converge in probability at rates slower than $n^{-1/2}$ . Therefore, $G_{ny} (v\vert z)/G_{nz} (v\vert z)$ is not $n^{-1/2}$ -consistent for $G_y (v\vert z)/G_z (v\vert z)$ . However, integration over

and

in (10.15) creates an averaging effect that causes the integral and, therefore,

to converge at the rate $n^{-1/2}$ . This is the reason for basing the estimator on (10.14) instead of (10.13).

Other estimators of when and are both nonparametric have been proposed by Ye and Duan (1997) and Chen (2002). Chen uses a rank-based approach that is in some ways simpler than that of Horowitz (1996) and may have better finite-sample performance. To describe this approach, define $d_{iy} =I(Y_i >y)$ and $d_{jy_0 } =I(Y_j >y_0 )$ . Let $i\ne j$ . Then $E(d_{iy} -d_{jy0} \vert X_i ,X_j )\ge 0$ whenever $Z_i -Z_j \ge H(y)$ . This suggests that if $\beta$ were known, then could be estimated by

$\displaystyle H_n (y)=\arg \max\limits_{\tau} \frac{1}{n(n-1)} \sum\limits_{i=1... ...frac{}{}{0pt}{1}{j=1}{j\ne i}} ^n (d_{iy} -d_{iy_0 } )I(Z_i -Z_j \ge \tau )\; .$

Since $\beta$ is unknown, Chen (2002) proposes

$\displaystyle H_n (y)=\arg \max\limits_{\tau} \frac{1}{n(n-1)} \sum\limits_{i=1... ...1}{j=1}{j\ne i}} ^n (d_{iy} -d_{iy_0 } )I({b}'_n X_i -{b}'_n X_j \ge \tau )\; .$

Chen (2002) gives conditions under which

is uniformly consistent for

and $n^{1/2}(H_n -H)$ converges to a mean-zero Gaussian process. Chen (2002) also shows how this method can be extended to a censored version of (10.9).

Next: 10.3 The Proportional Hazards Up: 10. Semiparametric Models Previous: 10.1 Introduction

10.2 Semiparametric Models for Conditional Mean Functions

10.2.1 Single Index Models

10.2.2 Partially Linear Models

10.2.3 Nonparametric Additive Models

10.2.4 Transformation Models

10.2 Semiparametric Models
for Conditional Mean Functions