10.4 Average derivative estimation

The primary motivation for studying the average derivative

\begin{displaymath}\delta = E[ m'(X)]\end{displaymath}

with

\begin{displaymath}m'(X) = \left({\partial m \over \partial x_1}, \ldots,
{\partial m \over \partial x_d}\right) (X), \end{displaymath}

comes from models where the mean response depends on $X$ only through a linear combination $\beta^Tx$. That is, similarly to projection pursuit regression,
\begin{displaymath}
m(x)= g(x^T\beta )
\end{displaymath} (10.4.9)

for some nonparametric function $ g.$

The average derivative $\delta$ is proportional to $\beta$ since

\begin{displaymath}\delta = E[ m'(X)] = E [ dg/d(x^T\beta) ] \beta.\end{displaymath}

Thus the average derivative vector $\delta$ determines $\beta$ up to scale. In this section a nonparametric estimator $\hat \delta$ of the average derivative is considered which achieves the rate $n^{-1/2}$ (typical for parametric problems). From this $\hat \delta$ the multivariate $\hat m(x)=\hat g (
x^T\hat \delta)$ is constructed which achieves the rate $n^{-4/5}$ (typical for one dimensional smoothing problems). A weighted average derivative estimator has been introduced by Powell, Stock and Stoker (1989).

Assume that the function $ g(x^T\delta)=E(Y\vert X=x^T\delta)$ is normalized in such a way that $E[ d g / d(x^T\delta)] =1.$ Average derivative estimation(ADE) yields a direct estimator for the weights $\beta$ in 10.4.9. (Note that as in PPR the model 10.4.9 is not identifiable unless we make such a normalization assumption.)

Let $f(x)$ denote the marginal density,

\begin{displaymath}f'=\partial f/\partial x\end{displaymath}

its vector of partial derivatives and

\begin{displaymath}l=-\partial \log f/\partial x=-f'/f\end{displaymath}

the negative log-density derivative. Integration by parts gives
\begin{displaymath}
\delta = E[ m'(X)] = E[ l Y].
\end{displaymath} (10.4.10)

Consequently, if $\hat f_h$ denotes a kernel estimator of $f(x)$ and $\hat l(x) =-\hat f'_h(x) /
\hat f_h(x)$, then $\delta$ can be estimated by the sample analogue

\begin{displaymath}\hat \delta^* = n^{-1}\sum_{i=1}^n \hat l (X_i)Y_i\ .\end{displaymath}

Since this estimator involves dividing by $\hat f_h,$ a more refined estimator $\hat \delta$ of $\delta$ is advisable in practice. For this reason the following estimator is proposed:
\begin{displaymath}
\hat \delta = n^{-1}\sum_{i=1}^n \hat l_h (X_i) Y_i\hat I_i,
\end{displaymath} (10.4.11)

with the indicator variables

\begin{displaymath}\hat I_i = I\{ \hat f_h(X_i) > b_n\}, \quad b_n \to 0,\end{displaymath}

and the density estimator

\begin{displaymath}\hat f_h(x)=n^{-1}\sum_{j=1}^n h^{-d} K\left(
{x -X_j \over h}\right)\cdotp\end{displaymath}

Note that here the kernel function $K$ is a function of $d$ arguments. Such a kernel can be constructed, for example, as a product of one-dimensional kernels; see Section 3.1. The main result of Härdle and Stoker (1989) is Theorem 10.4.1.

Theorem 10.4.1   Assume that apart from very technical conditions, $f$ is $p$-times differentiable, ${b_n}$ ``slowly" converges to zero and $nh^{2p-2} \to 0$, where $p$ denotes the number of derivatives of $f$. Then the average derivative estimator $\hat \delta$ has a limiting normal distribution,

\begin{displaymath}\sqrt n (\hat \delta - \delta) \ {\buildrel {\cal L}\over
\longrightarrow }\ N(0, \Sigma),\end{displaymath}

where $\Sigma$ is the covariance matrix of

\begin{displaymath}l(X)Y + [ m'(X) -l(X)m(X)]\ .\end{displaymath}

There are some remarkable features about this result. First, the condition on the bandwidth sequence excludes the optimal bandwidth sequence $h \sim n^{-1/(2p+d)}$; see Section 4.1. The bandwidth $h$ has to tend to zero faster than the optimal rate in order to keep the bias of $\hat \delta$ below the desired $n^{-1/2}$ rate. A similar observation has been made in the context of semiparametric models; see Section 8.1. Second, the covariance matrix is constructed from two terms, $l(X)Y$ and $m'(X)-l(X)m(X)$. If one knew the marginal density then the first term $l(X)Y$ would determine the covariance. It is the estimation of $l(X)$ by $\hat l(X)$ that brings in this second term. Third, the bandwidth condition is of qualitative nature, that is, it says that $h$ should tend to zero not ``too fast" and not ``too slow." A more refined analysis (Härdle, Hart, Marron and Tsybakov 1989) of second-order terms shows that for $d=1$ the MSE of $\hat \delta$ can be expanded as

\begin{displaymath}
MSE(\hat \delta) \sim n^{-1}+ n^{-1}h^{-3} + h^4.
\end{displaymath} (10.4.12)

A bandwidth minimizing this expression would therefore be proportional to $n^{-2/7}$. Fourth, the determination of the cutoff sequence $b_n$ is somewhat complicated in practice; it is therefore recommended to just cut off the lower 5 percent of the $\hat l(X_i)$.

Let me come now to the estimation of $g$ in 10.4.9. Assume that in a first step $\hat \delta$ has been estimated, yielding the one-dimensional projection $\hat
Z_i=\hat\delta^TX_i, \quad i=1, \ldots, n$. Let $\hat g_{h'}(z)$ denote a kernel estimator with one-dimensional kernel $K^z$ of the regression of $Y$ on $\hat Z$, that is,

\begin{displaymath}
\hat g_{h'}(z) = n^{-1}\sum_{i=1}^n
K_{h'}^z(z-\hat Z_i)Y_i / n^{-1}\sum_{i=1}^n K_{h'}^z(z-\hat Z_i).
\end{displaymath} (10.4.13)

Suppose, for the moment, that $Z_i=\delta^TX_i$ instead of $\hat Z_i$ were used in 10.4.13. In this case, it is well known (Section 4.2) that the resulting regression estimator is asymptotically normal and converges at the optimal pointwise rate $ n^{-2/5}.$ Theorem 10.4.2 states that there is no cost in using the estimated projections $\{ \hat Z_i\}$, that is, one achieves through additivity a dimension reduction, as considered by Stone (1985).

Theorem 10.4.2   If the bandwidth $h'\sim n^{-1/5}$ then

\begin{displaymath}n^{2/5} [ \hat g_{h'}(z)- g(z)] \end{displaymath}

has a limiting normal distribution with mean $B(z)$ and variance $V(z)$, where, with the density $f_z$ of $\delta^T X,$

\begin{eqnarray*}
B(z) &=&{1\over 2}[ g''(z)+ g'(z)f_z'(z)/f_z(z) ] d_{K^z}, \cr
V(z) &=&\textrm{var}[ Y \vert \delta^T X=z ] c_{K^z}. \end{eqnarray*}



More formally, the ADE procedure is described in

Algorithm 10.4.1

STEP 1.

Compute $\hat \delta$ by 10.4.11 with a cutt of $\alpha = 5 \%$.

STEP 2.

Compute $\hat g$ by 10.4.13 from $(\hat \delta^T X_i, Y_i)$ with a one-dimensional cross-validated bandwith.

STEP 3.

Compose both steps into the function

$\hat m(x) = \hat g( \hat \delta x)$.

An application of this technique is given in Appendix 2 where I consider a desirable computing environment for high dimensional smoothing techniques. Simulations from the ADE algorithm for different nonparametric models in more than four dimensions can be found in Härdle and Stoker (1989). One model in this article is

\begin{displaymath}Y_i = \sin\left(\sum_{j=1}^4X_{ij}\right)+ 0.1\varepsilon_i,
\quad i=1, \ldots, n,\end{displaymath}

where $\varepsilon_i, X_{i1}, \ldots, X_{i4}$ are independent standard normally distributed random variables.


Table 10.1: ADE estimation of the Sine model
  $h=0.9$ $h=0.7$ $h=1.5$ known density
$\hat{\delta}_{1}$ 0.1134 0.0428 0.1921 0.1329
  (0.0960) (0.0772) (0.1350) (0.1228)
$\hat{\delta}_{2}$ 0.1356 0.0449 0.1982 0.1340
  (0.1093) (0.0640) (0.1283) (0.1192)
$\hat{\delta}_{3}$ 0.1154 0.0529 0.1837 0.1330
  (0.1008) (0.0841) (0.1169) (0.1145)
$\hat{\delta}_{4}$ 0.1303 0.0591 0.2042 0.1324
  (0.0972) (0.0957) (0.1098) (0.1251)
$b$ 0.0117 0.0321 0.0017  

Note: In brackets are standard deviations over the Monte Carlo simulations. $\scriptstyle n=100 $, $\scriptstyle \alpha = 0.05. $

The average derivative takes the form

\begin{displaymath}\delta=\delta_0(1,1,1,1)^T,\end{displaymath}

and some tedious algebra gives $\delta_0=0.135$. Table 10.1 reports the result over 100 Monte Carlo simulations with a cutoff rule of $\alpha=0.05$. It is remarkable that even in the case of a known density (therefore, $l$ is known) the standard deviations (given in brackets) are of similar magnitude to those in the case of unknown $l$. This once again demonstrates that there is no cost (parametric rate!) in not knowing $l$. An actual computation with $n=200$ resulted in the values of $\delta=(0.230, 0.023, 0.214, 0.179)$. The correlation between $Z_i=\delta^TX_i$ and $\hat Z_i=\hat \delta^TX_i$ was 0.903. The estimated function $\hat g_{h'}(z)$ is shown in Figure 10.13 together with the points $\{
\hat Z_i, Y_i\}_{i=1}^n$. A kernel smooth based on the true projections $Z_i$ is depicted together with the smooth $\hat g_{h'}(z)$ in Figure 10.14. The estimated $\hat g_{h'}(z)$ is remarkably close to the true regression function as Figure 10.15 suggests.

Figure: The estimated curve $\scriptstyle \hat g_{h'}(z)$ together with the projected data $\scriptstyle \{\hat Z_i, Y_i\}_{i=1}^n.$
\includegraphics[scale=0.15]{ANR10,13.ps}

Figure 10.14: Two kernel smooths based on $\scriptstyle \{Z_i,
Y_i\}_{i=1}^n, $ $\scriptstyle \{\hat Z_i, Y_i\}_{i=1}^n$, respectively. The thin line indicates the ADE smooth based on the estimated projections $\scriptstyle \hat Z_i = \hat \delta ^T X_i $. The thick line is the kernel smooth based on the true projections $\scriptstyle Z_i = \delta ^T X_i $.
\includegraphics[scale=0.15]{ANR10,14.ps}

Figure 10.15: The ADE smooth and the true curve. The thin line indicates the ADE smooth as in Figure 10.14 and Figure 10.13; the thick line is the true curve $\scriptstyle g(\delta ^T X_i)$.
\includegraphics[scale=0.15]{ANR10,15.ps}

Exercises

10.4.1Prove formula 10.4.10.

10.4.2Explain the bandwidth condition ``that $h$ has to tend zero faster than the optimal rate" from formula 10.4.12.

10.4.3Assume a partial linear model as in Chapter 8. How can you estimate the parametric part by ADE?

10.4.4Assume $X$ to be standard normal. What is $l$ in this case?

10.4.5In the case of a pure linear model $Y = \beta^T X$ what is $\delta$?