# 6.2 Estimation

When estimating a SIM, we have to take into account that the functional form of the link function is unknown. Moreover, since the shape of will determine the value of

for a given index , estimation of the index weights will have to adjust to a specific estimate of the link function to yield a correct" regression value. Thus, in SIM both the index and the link function have to be estimated, even though only the link function is of nonparametric character.

Recall that is the vector of explanatory variables, is a -dimensional vector of unknown coefficients (or weights), and is an arbitrary smooth function. Let

be the deviation of from its conditional expectation w.r.t.  . Using we can write down the single index model as

 (6.4)

Our goal is to find efficient estimators for and . As is inside the nonparametric link, the challenge is to find an appropriate estimator for , in particular one that reaches the -rate of convergence. (Recall that -convergence is typically achieved by parametric estimators.)

Two essentially different approaches exist for this purpose:

• An iterative approximation of by semiparametric least squares (SLS) or pseudo maximum likelihood estimation (PMLE),
• a direct (non-iterative) estimation of through the average derivative of the regression function (ADE, WADE).
In both cases the estimation procedure can be summarized as:

The final step is relevant for all SIM estimators considered in the following sections. More precisely: Suppose that has already been estimated by . Set

 (6.5)

Once we have created these observations" of the new dependent variable we have a standard univariate regression problem. For example, a possible and convenient choice for estimating would be the Nadaraya-Watson estimator introduced in Chapter 4.1:

 (6.6)

It can be shown that if is -consistent, converges for at rate , i.e. like a univariate kernel regression estimator:

with bias and variance , see e.g. Powell et al. (1989).

As in the first part of this book we concentrate on kernel based methods here. The choice of and in (6.6) is independent of the choice of these parameters in the following sections. For other than kernel based SIM estimators we refer to the bibliographic notes.

## 6.2.1 Semiparametric Least Squares

As indicated in the introduction we are going to concentrate on the estimation of now. The methods that we consider here under semiparametric least squares (SLS) and pseudo maximum likelihood estimation (PMLE) have the following idea in common: establish an appropriate objective function to estimate with parametric -rate. Certainly, inside the objective function we use the conditional distribution of , or the link function , or both of them. As these are unknown they need to be replaced by nonparametric estimates. The objective function then will be maximized (or minimized) with respect to the parameter  .

Why is this extension of least squares or maximum likelihood not a trivial one? The reason is that when changes, the link (respectively its nonparametric substitute) may change simultaneously. Thus, it is not clear if the necessary iteration will converge and even if it does, that it will converge to a unique optimum.

SLS and its weighted version (WSLS) have been introduced by Ichimura (1993). As SLS is just a special case of WSLS with a weighting equal to the identity matrix, we concentrate here on WSLS. An objective function of least squares type can be motivated by minimizing the variation in the data that can not be explained by the fitted regression. This left over'' variation can be written as

with being the index function specified up to . The previous equation leads us to a variant of the well known LS criterion

 (6.7)

in which on the right hand side has to be replaced by a (consistent) estimate. The outer expectation in (6.7) can be replaced by an average.

We can account for possible heteroscedasticity by using proper weights. This motivates

 (6.8)

with an appropriate weight function. Next, employ the nonparametric technique to substitute the (inner) unknown conditional expectation. As the index function is univariate, we could take any univariate consistent smoother. For simplicity of the presentation, we will use a Nadaraya-Watson type estimator here.

Define the WSLS estimator as

 (6.9)

where is a trimming factor and a leave-one-out estimator of assuming the parameter would be known. In more detail, is a (weighted) Nadaraya-Watson estimator

 (6.10)

with denoting a bandwidth and a scaled (compact support) kernel. The trimming factor has been introduced to guarantee that the density of the index is bounded away from zero. has to be chosen accordingly. The set in the trimming factor of (6.10) is constructed in such a way that all boundary points of are interior to :

In practice trimming can often be skipped, but it is helpful for establishing asymptotic theory. Also the choice of taking the leave-one-out version of the Nadaraya-Watson estimator in (6.10) is for that reason. For the estimator given (6.9) the following asymptotic results can be proved, for the details see Ichimura (1993).

THEOREM 6.1
Assume that has a th absolute moment ( ) and follows model (6.4). Suppose further that the bandwidth converges to 0 at a certain rate. Then, under regularity conditions on the link , the error term , the regressors and the index function , the WSLS estimator (6.9) fulfills

where, using the notation

and , the matrices and are defined by

As we can see, the estimator is unbiased and converges at parametric -rate. Additionally, by choosing the weight

the WSLS reaches the efficiency bound for the parameter estimates in SIM (Newey, 1990). This means, is an unbiased parameter estimator with optimal rate and asymptotically efficient covariance matrix. Unfortunately, is an unknown function as well. In practice, one therefore uses an appropriate pre-estimate for the variance function.

How do we estimate the (asymptotic) variance of the estimator? Not surprisingly, the expressions and are estimated consistently by their empirical analogs. Denote

Ichimura (1993) proves that

are consistent estimators for and under appropriate bandwidth conditions.

## 6.2.2 Pseudo Likelihood Estimation

For motivation of pseudo maximum likelihood estimation (PMLE) we rely on the ideas developed in the previous section. Let us first discuss why pseudo maximum likelihood always leads to an unbiased -consistent estimators with minimum achievable variance: In fact, the computation of the PMLE reduces to a formal parametric MLE problem with as many parameters as observations. In this case (as we have also seen above), the inverse Fisher information turns out to be a consistent estimator for the covariance matrix of the PMLE. Gill (1989) and Gill & van der Vaart (1993) explain this as follows: a sensibly defined nonparametric MLE can be seen as a MLE in any parametric submodel which happens to include or pass through the point given by the PMLE. For smooth parametric submodels, the MLE solves the likelihood equations. Consequently, also in nonparametric problems the PMLE can be interpreted as the solution of the likelihood equations for every parametric submodel passing through it.

You may have realized that the mentioned properties coincide with the results made for the WSLS above. Indeed, looking closer at the objective function in (6.7) we could re-interpret it as the result of a maximum likelihood consideration. We only have to set weight equal to the inverse of the (known) variance function (compare the discussion of optimal weighting). We refer to the bibliographic notes for more details.

To finally introduce the PMLE, let us come back to the issue of a binary response model, i.e., observe only in . Recall that this means

 (6.11)

where the index function is known up to . We further assume that

This is indeed an additional restriction but still allowing for multiplicative heteroscedasticity as discussed around equation (6.3). Unfortunately, under heteroscedasticity -- when the variance function of the error term depends on the index to estimate -- a change of changes the variance function and thus implicitly affects equation (6.11) through . This has consequences for PMLE as the likelihood function is determined by (6.11) and the error distribution given the index . For this reason we recall the notation

where is the variance function and an error term independent of and . In the case of homoscedasticity we would have indeed . From (5.7) we know that

where is the conditional distribution of the error term. Since is binary, the log-likelihood function for this model (cf. (5.17)) is given by

 (6.12)

The problem is now to obtain a substitute for the unknown function . We see that

 (6.13)

with being the pdf of the index and the conditional pdf of given . Since if and only if , this is equivalent to

In the last expression, we can estimate all expressions nonparametrically. Instead of estimating by , Klein & Spady (1993) propose to consider and . One can estimate

 (6.14)

where denotes the scaled kernel as before. Hence, an estimate for in (6.12) is given by

To obtain the -rate for , one uses either bias reduction via higher order kernels or an adaptive undersmoothing bandwidth . Problems in the denominator when the density estimate becomes small, can be avoided by adding small terms in both the denominator and the numerator. These terms have to vanish at a faster rate than that for the convergence of the densities.

We can now define the pseudo log-likelihood version of (6.12) by

 (6.15)

The weight function can be introduced for numerical or technical reasons. Taking the squares inside the logarithms avoids numerical problems when using higher order kernels. (Otherwise these terms could be negative.) The estimator is found by maximizing (6.15).

Klein & Spady (1993) prove the following result on the asymptotic behavior of the PMLE . More details about conditions for consistency, the choice of the weight function and the appropriate adaptive bandwidth can be found in their article. We summarize the asymptotic distribution in the following theorem.

THEOREM 6.2
Under some regularity conditions it holds

where

It turns out that the Klein & Spady estimator reaches the efficiency bound for SIM estimators. An estimator for can be obtained by its empirical analog.

As the derivation shows, the Klein & Spady PMLE does only work for binary response models. In contrast, the WSLS was given for an arbitrary distribution of . For that reason we consider now an estimator that generalizes the idea of Klein & Spady to arbitrary distributions for .

Typically, the pseudo log-likelihood is based on the density (if is continuous) or on the probability function (if is discrete). The main idea of the following estimator first proposed by Delecroix et al. (2003) is that the function which defines the distribution of given only depends on the index function , i.e.,

In other words, the index function contains all relevant information about . The objective function to maximize is

As for SLS we proxy this expectation by averaging and introduce a trimming function:

 (6.16)

where denotes a suitable subset of the support of . Here, all we have to do is to estimate the conditional density or probability mass function . An estimator is given by:

 (6.17)

To reach the -rate, fourth order kernels and a bandwidth of rate , needs to be used. Delecroix et al. (2003) present their result for the linear index function

Therefore the following theorem only considers the asymptotic results for that special, but most common case.

THEOREM 6.3
Let be the vector that maximizes (6.16) under the use of (6.17). Then under the above mentioned and further regularity conditions it holds

where

and

Again, it can be shown that the variance of this estimator reaches the lower efficiency bound for SIM. That means this procedure provides efficient estimators for , too. Estimates for the matrices and can be found by their empirical analogs.

## 6.2.3 Weighted Average Derivative Estimation

We will now turn to a different type of estimator with two advantages: (a) we do not need any distributional assumption on and (b) the resulting estimator is direct, i.e. non-iterative. The basic idea is to identify as the average derivative and thus the studied estimator is called average derivative estimator (ADE) or weighted average derivative estimator (WADE). The advantages of ADE or WADE estimators come at a cost, though, as they are inefficient. Furthermore, the average derivative method is only directly applicable to models with continuous explanatory variables.

At the end of this section we will discuss how to estimate the coefficients of discrete explanatory variables in SIM, a method that can be combined with the ADE/WADE method. For this reason we introduce the notation

for the regressors. Here, (-dimensional) refers explicitly to continuous variables and (-dimensional) to discrete (or categorical) variables.

Let us first consider a model with a -dimensional vector of continuous variables only, i.e.,

 (6.18)

Then, the vector of weighted average derivatives is given by

 (6.19)

where is the vector of partial derivatives of , the derivative of and a weight function. (By we denote the partial derivative w.r.t. the th argument of the function.)

Looking at (6.19) shows that equals up to scale. Hence, if we find a way to estimate then we can also estimate up to scale. The approach studied in Powell et al. (1989) uses the density of as the weight function:

This estimator is sometimes referred to as density weighted ADE or DWADE. We will concentrate on this particular weight function. Generalizations to other weight functions are possible.

For deriving the estimator, it is instructive to write (6.19) in more detail:

Partial integration yields

 (6.20)

if we assume that for . Noting that and using the law of iterated expectations we finally arrive at

 (6.21)

We can now estimate by using the sample analog of the right hand side of (6.21):

 (6.22)

where we estimate by

 (6.23)

Here, the , are the partial derivatives of the multivariate kernel density estimator from Section 3.6, i.e.,

Regarding the sampling distribution of the estimator Powell et al. (1989) have shown the following theorem.

THEOREM 6.4
Under regularity conditions we have

The covariance matrix is given by where is given by .

Note that although is based on a multidimensional kernel density estimator, it achieves -convergence as the SIM estimators considered previously which were all based on univariate kernel smoothing.

EXAMPLE 6.1
We cite here an example on unemployment after completion of an apprenticeship in West Germany which was first analyzed in Proença & Werwatz (1995). The data comprise individuals from the first nine waves of the German Socio-Economic Panel (GSOEP), see GSOEP (1991). The dependent variable takes on the value 1 if the individual is unemployed one year after completing the apprenticeship. Explanatory variables are gross monthly earnings as an apprentice (EARNINGS), city size (CITY SIZE), percentage of people apprenticed in a certain occupation divided by the percentage of people employed in this occupation in the entire company (DEGREE) and unemployment rate in the state where the individual lived (URATE).

5pt
 GLM WADE (logit) constant -5630 -- -- -- -- -- EARNINGS -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 CITY SIZE -0.72 -0.23 -0.47 -0.66 -0.81 -0.91 DEGREE -1.79 -1.06 -1.52 -1.93 -2.25 -2.47 URATE 363.03 169.63 245.75 319.48 384.46 483.31

Table 6.1 shows the results of a GLM fit (using the logistic link function) and the WADE coefficients (for different bandwidths). For easier comparison the coefficients are rescaled such that all coefficients of EARNINGS are equal to . To eliminate the possible effects of the correlation between the variables and to standardize the data, a Mahalanobis transformation had been applied before computing the WADE. Note that in particular for the coefficients of both the GLM and WADE are very close. One may thus argue that the parametric logit model is not grossly misspecified.

Let us now turn to the problem of estimating coefficients for discrete or categorical variables. By definition, derivatives can only be calculated if the variable under study is continuous. Thus, the WADE method fails for discrete explanatory variables. Before presenting a more general solution, let us explain how the coefficient of one dichotomous variable is introduced to the model. We extend model (6.18) by an additional term:

 (6.24)

with the continuous and the discrete part of the regressors. In the simplest case we suppose that discrete part is a univariate binary variable and that is binary as well. In this case, the model splits'' into two submodels

There are in fact two models to be estimated, one for and one for . Note that alone could be estimated from the first equation only.

Theoretically, the same can be associated with either yielding an index value of or with leading to an index value of . Thus the difference between the two indices is exactly , see the left panel of Figure 6.2.

In practice finding these horizontal differences will be rather difficult. A common approach is based on the observation that the integral difference between the two link functions also equals , see the right panel of Figure 6.2.

A very simple estimator is proposed in Korostelev & Müller (1995). Essentially, the coefficient of the binary explanatory variable can be estimated by

with

where the superscripts and denote the observations from the subsamples according to and . The estimator is in the simplest case of a binary variable -consistent and can be improved for efficiency by a one-step estimator (Korostelev & Müller, 1995).

Horowitz & Härdle (1996) extend this approach to multivariate multi-categorical and an arbitrary distribution of . Recall the model (6.24)

Again, the approach for this model is based on a split of the whole sample into subsamples according to the categories of . However, this subsampling makes the estimator infeasible for more than one or two discrete variables. To compute integral differences of the link functions according to the realizations of , we consider the truncated link function

Denote now a possible realization of , then the integrated link function conditional on is

Now compare the integrated link functions for all -categories ( ) to the first -category . It holds

hence with

we obtain . This yields finally

 (6.25)

to determine the coefficients . The estimation of is based on replacing in 6.25 by

with a nonparametric estimate of the truncated link function . This estimator is obtained by a univariate regression of the estimated continuous'' index values on . Horowitz & Härdle (1996) show that using a -consistent estimate and a Nadaraya-Watson estimator the estimated coefficient is itself -consistent and has an asymptotic normal distribution.