3.2 Average Outer Product of Gradients and its Estimation

In this section, we give some properties of the e.d.r. directions. The e.d.r. directions coincide with the eigenvectors of the so-called Hessian matrix of the regression function. Slightly different from the usual one, the outer product of gradients simplifies the calculation and extends the ADE of Härdle and Stoker (1989) to the case of more than one direction. For ease of exposition, we always assume the eigenvectors of a semi-positive matrix be arranged according to their corresponding eigenvalues in descending order.


3.2.1 The Simple Case

Consider the relation between $ y$ and $ X$ in model (3.4). The $ k$-indexing e.d.r. directions are as defined in (3.5). Let $ \tilde {\sl g}(X) = \textrm{E}(y\vert X) $ and $ \bigtriangledown \tilde {\sl g}$ denote the gradients of the function $ \tilde {\sl g}$ with respect to the arguments of $ \tilde {\sl g}$. Under model (3.4), i.e. $ \tilde {\sl g}(X) = {\sl g}(B_0^{\top }X) $, we have $ \bigtriangledown\tilde {\sl g}(X) = B_0 \bigtriangledown {\sl g}
(B_0^{\top }X) $. Therefore

$\displaystyle E [ \bigtriangledown \tilde {\sl g}(X) \bigtriangledown^{\top } \tilde {\sl g}(X) ] =
B_0 H B_0^{\top },$      

where $ H = E [ \bigtriangledown {\sl g}(B_0^{\top } X )
\bigtriangledown^{\top } {\sl g}(B_0^{\top } X )] $, the average outer product of gradients of $ {\sl g}(\cdot) $. It is easy to see that the following lemma holds.

LEMMA 3.1   Suppose that $ \tilde {\sl g}(\cdot) $ is differentiable. If model (3.4) is true, then $ B_0
$ is in the space spanned by the first $ D$ eigenvectors of $ E[\bigtriangledown \tilde {\sl g}(X)
\bigtriangledown^{\top }\tilde {\sl g}(X)]$.

Lemma 3.1 provides a simple method to estimate $ B_0
$ through the eigenvectors of $ E[\bigtriangledown \tilde {\sl g}(X)
\bigtriangledown^{\top }\tilde {\sl g}(X)]$. Härdle and Stoker (1989) noted this fact in passing but seemed to have stopped short of exploiting it. Instead, they proposed the so-called ADE method, which suffers from the disadvantages as stated in Section 3.1. Li (1992) proposed the principal Hessian directions (pHd) method by estimating the Hessian matrix of $ {\sl g}(\cdot) $. For a normally distributed design $ X$, the Hessian matrix can be properly estimated simply by the Stein's Lemma. (Cook (1998) claimed that the result can be extended to symmetric design $ X$). However, in time series analysis, the assumption of symmetric design $ X$ is frequently violated. As an example, see (3.28) and Figure 3.5 in the next section. Now, we propose a direct estimation method as follows. Suppose that $ \{(y_i, X_i ), i = 1,2,\cdots, n\} $ is a random sample. First, estimate the gradients $ \bigtriangledown
{\sl g}(X_j ) $ by local polynomial smoothing. Thus, we consider the local $ r$-th order polynomial fitting in the form of the following minimization problem

$\displaystyle \min_{a_j, b_j, c_j }$ $\displaystyle \sum_{i=1}^n$ $\displaystyle \Big[ y_i - a_j -
b_j^{\top }(X_i - X_j)
- \sum_{1<k\le r} \sum_{i_1+\cdots+i_p=k}\Big(c_{i_1,i_2,\cdots,i_p}$ (3.11)
$\displaystyle <tex2html_comment_mark>550$   $\displaystyle \times \{X_i-X_j\}_1^{i_1}\{X_i-X_j\}_2^{i_2}\cdots\{X_i-X_j\}_p^{i_p}\Big) \Big]^2
K_h(X_i - X_j),$  

where $ \{X_i - x\}_k $ denotes the $ k$th element of matrix $ X_i
- x$. Here, $ K(x) $ is a kernel function, $ h$ is a bandwidth and $ K_h(\cdot) = K(\cdot/h)/h^p $. A special case is the local linear fitting with $ r = 1 $. The minimizer $ \hat b_j = (\hat
b_{j1}, \cdots, \hat b_{jp})^{\top } $ is the estimator of $ \bigtriangledown \tilde {\sl g}(X_j ) $. We therefore obtain the estimator of $ E \{\bigtriangledown\tilde {\sl g}(X)
\bigtriangledown^{\top } \tilde {\sl g}(X)\} $ as
$\displaystyle \hat \Sigma = \frac{1}{n}\sum_{j=1}^n\hat b_j \hat b_j^{\top } .$      

Finally, we estimate the $ k$-indexing e.d.r. directions by the first $ k$ eigenvectors of $ \hat \Sigma $. We call this method the method of outer product of gradients estimation (OPG). The difference between the OPG method and the ADE method is that the former uses the second moment of the derivative but the latter uses only the first moment. Unlike the ADE, the OPG method still works even if $ E \bigtriangledown\tilde {\sl g}(X) = 0$. Moreover, the OPG method can handle multiple e.d.r. directions simultaneously whereas the ADE can only handle the first e.d.r. direction (i.e. single-index model).

THEOREM 3.1   Let $ \hat \beta_1, \cdots, \hat \beta_p $ be the eigenvectors of $ \hat \Sigma $. Suppose that (C1)-(C6) (in Appendix 3.9) hold and model (3.4) is true. If $ nh^p/\log n
\to \infty $ and $ h \to 0 $, then for any $ k \le D $,
$\displaystyle \Vert (I-B_0B_0^{\top })\hat \beta_k \Vert = O_P (h^r + \delta_n^2
h^{-1}),$      

where $ \delta_n = \{ \log n/(nh^p )\}^{1/2} $ and $ \Vert M \Vert $ denotes the Euclidean norm of $ M$. If further, $ r > p-2 $ and $ h = O(n^{-\tau}) $ with $ \{ 2(p-2)\}^{-1} > \tau > (2r)^{-1} $, then
$\displaystyle \Vert (I-B_0B_0^{\top })\hat \beta_k\Vert = O_P(n^{-1/2}).$      

Similar results were obtained by Härdle and Stoker (1989) for the ADE method. Note that the optimal bandwidth for the estimation of the regression function (or the derivatives) in the sense of the mean integrated squared errors (MISE) is $ h_{opt} \sim
n^{-1/\{2(r+1) + p\}} $. The fastest convergence rate for the estimator of the directions can never be achieved at the bandwidth of this value, but is actually achieved at $ h \sim
n^{-1/(r+1+p)}$, which is smaller than $ h_{opt} $. This point has been noticed by many authors in other contexts. See for example Hall (1984), Weisberg and Welsh (1994), and Carroll et al. (1997).


3.2.2 The Varying-coefficient Model

Let $ Z = (1, z_1, \cdots, z_q)^{\top } $ and $ C(X) = (c_0(X),
c_1(X), \cdots, c_q(X))^{\top } $. Then model (3.7) can be written as $ y = Z^{\top } C(X) $. It is easy to see that

$\displaystyle C(x) = \{\textrm{E}(ZZ^{\top }\vert X=x)\}^{-1} \textrm{E}(Zy\vert X=x).$      

LEMMA 3.2   Suppose that $ c_0(x), \cdots, c_q(x) $ are differentiable. If model (3.8) is true, then $ B_0
$ is in the space spanned by the first $ D$ eigenvectors of $ E\{\bigtriangledown C(X)$ $ \bigtriangledown^{\top } C(X)] $, where $ \bigtriangledown C(X) =
(\bigtriangledown c_0(X), \cdots, \bigtriangledown c_q(X)). $

Similarly, we can estimate the gradient $ \bigtriangledown C(x) $ using a nonparametric method. For example, if we use the local linear smoothing method, we can estimate the gradients by solving the following minimization problem

$\displaystyle \min_{c_k, b_k: k = 0, 1, \cdots, q} \sum_{i=1}^n \left\{
y_i - \sum_{k=0}^q [ c_k(x) +
b_k^{\top }(x) (X_i - x) ] z_k \right\}^2 K_h(X_i - x).$      

Let $ \{ \hat c_k(x), \hat b_k(x): k = 0, \cdots, q\} $ be the solutions. Then, we get the estimate $ \widehat {\bigtriangledown
C}(X_j) = (\hat b_0(X_j),\hat b_1(X_j), \cdots, \hat b_q(X_j) )
$. Finally, $ B_0
$ can be estimated by the first $ D$ eigenvectors of $ n^{-1} \sum_{j=1}^n \widehat {\bigtriangledown
C}(X_j) \widehat {\bigtriangledown C}^{\top }(X_j) $.