3.3 A Unified Estimation Method

As we have discussed above, the e.d.r. directions can be obtained from the relevant outer product of gradients. Further, the proposed OPG method can achieve root-$ n$ consistency. Unlike the SIR method, the OPG method does not need strong assumptions on the design $ X$ and can be used for more complicated models. However, its estimators still suffer from poor performance when a high-dimensional kernel is used in (3.12). Now, we discuss how to improve the OPG method.

Note that all the existing methods adopt two separate steps to estimate the directions. First estimate the regression function and then estimate the directions based on the estimated regression function. See for example Hall (1984), Härdle and Stoker (1989), Carroll et al. (1997) and the OPG method above. It is therefore not surprising that the performance of the direction estimator suffers from the bias problem in nonparametric estimation. Härdle, Hall, and Ichimura (1993) noticed this point and estimated the bandwidth and the directions simultaneously in a single-index model by minimizing the sum of squares of the residuals. They further showed that the optimal bandwidth for the estimation of the regression function in the sense of MISE enables the estimator of the direction to achieve root-$ n$ consistency. Inspired by this, we propose to estimate the direction by minimizing the mean of the conditional variance simultaneously with respect to the regression function and the directions. As we shall see, similar results as Härdle, Hall, and Ichimura (1993) can be obtained and the improvement over the OPG method achieved.


3.3.1 The Simple Case

In this subsection, we investigate the relation between $ y$ and $ X$. The idea was proposed by Xia et al. (2002). Consider model (3.4). For any orthogonal matrix $ B =
(\beta_1, \cdots, \beta_d) $, the conditional variance given $ B^{\top } X $ is

$\displaystyle \sigma_{B}^2(B^{\top } X) = E[\{ y- \textrm{E}(y\vert B^{\top } X)\}^2 \vert
B^{\top } X].\ $     (3.12)

It follows that
$\displaystyle E\left[ y- \textrm{E}(y\vert B^{\top } X)\right]^2 = E\sigma_{B}^2(B^{\top } X).$      

Therefore, minimizing (3.5) is equivalent to minimizing, with respect to $ B$,
$\displaystyle E\sigma^2_{B}(B^{\top }X)\quad \textrm{subject to}\ \ B^{\top }B = I.$     (3.13)

We shall call this the the minimum average (conditional) variance (MAVE) estimation. Suppose $ \{ (X_i, y_i )\ i = 1, 2, \cdots, n\}
$ is a random sample from $ (X, y ) $. Let $ {\sl g}_B(v_1, \cdots,
v_d) = \textrm{E}(y\vert\beta_1^{\top }X = v_1, \cdots, \beta_d^{\top }X =
v_d) $. For any given $ X_0 $, a local linear expansion of $ \textrm{E}(y_i\vert B^{\top }X_i) $ at $ X_0 $ is
$\displaystyle \textrm{E}(y_i\vert B^{\top } X_i) \approx a + b B^{\top }(X_i - X_0) ,$      

where $ a = {\sl g}_{B}(B^{\top }X_0)
$ and $ b = (b_1, \cdots, b_D) $ with
$\displaystyle b_k = \frac{\partial {\sl g}_B(v_1, \cdots, v_d)}{\partial
v_k}\B...
... \beta_1^{\top }X_0, \cdots, v_d=
\beta_d^{\top }X_0}, \quad k = 1, \cdots , d.$      

The residuals are then
$\displaystyle y_i- {\sl g}_B(B^{\top }X_i) \approx y_i - \left[ a + b B^{\top }
(X_i - X_0) \right].$      

Following the idea of Nadaraya-Watson estimation, we can estimate $ \sigma_{B}^2 ( B^{\top } X_0) $ by exploiting the approximation
    $\displaystyle \sum_{i=1}^n
\left[y_i - \textrm{E}(y_i\vert B^{\top }X_i) \right...
..._{i=1}^n
\left[y_i - \{ a + b B^{\top }(X_i - X_0) \} \right]^2
w_{i0} , \qquad$ (3.14)

where $ w_{i0}\ge 0 $ are some weights with $ \sum_{i=1}^n
w_{i0} = 1 $ and typically centered around $ B^{\top }X_0$. The choice of the weights $ w_{i0}$ plays a key role in the different approaches to searching for the e.d.r. directions in this chapter. We shall discuss this issue in detail later. Usually, $ w_{i0} = K_h(B^{\top }(X_i - X_0))/\sum_{l=1}^n
K_h(B^{\top }(X_l - X_0)) $. For ease of exposition, we use $ K(\cdot) $ to denote different kernel functions at different places. let $ K_h(\cdot) = h^d K(\cdot/h) $, $ d $ being the dimension of $ K(\cdot) $. Note that the estimators of $ a$ and $ b$ are just the minimum point of (3.14). Therefore, the estimator of $ \sigma_{B}^2$ at $ B^{\top }X_0$ is just the minimum value of (3.14), namely
$\displaystyle \hat \sigma_{B}^2(B^{\top } X_0)$ $\displaystyle =$ $\displaystyle \min_{a, b_1, b_2, \cdots,
b_d}
\sum_{i=1}^n
\Big[y_i - \{ a + b B^{\top }(X_i - X_0)\} \Big]^2
w_{i0} . \qquad$ (3.15)

Note that the estimator $ \hat \sigma_B^2(B^{\top }x) $ is different from existing ones. See, for example, Fan and Yao (1998). For this estimator, the following holds.

LEMMA 3.3   Under assumptions (C1)-(C6) (in Appendix 3.9) and $ w_{i0} = K_h(B^{\top }(X_i - X_0))/ \sum_{i=1}^nK_h( $ $ B^{\top }(X_i - X_0)) $, we have
$\displaystyle \hat \sigma_{B}^2(B^{\top } X_0) - \sigma_{B}^2(B^{\top } X_0) = o_P(1)$      

Based on (3.5), (3.13), and (3.15), we can estimate the e.d.r. directions by solving the following minimization problem.

    $\displaystyle \min_{B:\ B^{\top }B = I}
\sum_{j=1}^n \hat \sigma_{B}^2(B^{\top } X_j) =$  
    $\displaystyle \quad = \min_{\stackrel{B:\ B^{\top }B = I } { _{a_{j}, b_{j},
j ...
...i=1}^n
\Big[y_i - \{ a_{j} + b_{j}B^{\top }(X_i - X_j) \} \Big]^2
w_{ij}, \quad$ (3.16)

where $ b_j = (b_{j1},\cdots, b_{jd})$. The MAVE method or the minimization in (3.16) can be seen as a combination of nonparametric function estimation and direction estimation, which minimizes (3.16) simultaneously with respect to the directions and the nonparametric regression function. As we shall see, we benefit from this simultaneous minimization.

Note that the weights depend on $ B$. Therefore, to implement the minimization in (3.16) is non-trivial. The weight $ w_{i0}$ in (3.14) should be chosen such that the value of $ w_{i0}$ is proportional to the difference between $ X_i $ and $ X_0 $. Next, we give two choices of $ w_{i0}$.

(1) Multi-dimensional kernel. To simplify (3.16), a natural choice is $ w_{i0} = K_h(X_i - X_0)/ \sum_{l=1}^n K_h(X_l
- X_0)$. If our primary interest is on dimension reduction, this multidimensional kernel will not slow down the convergence rate in the estimation of the e.d.r. directions. This was first observed by Härdle and Stoker (1989). See also Theorem 3.1. For such weights, the right hand side of (3.16) does not tend to $ \sigma_{B}^2 ( B^{\top } X_0) $. However, we have

LEMMA 3.4   Suppose $ y = \tilde {\sl g}(X) + \varepsilon $ with $ \textrm{E}(\varepsilon\vert
X) = 0\ a.s. $ Under assumptions (C1)-(C6) (in Appendix 3.9) and $ w_{i0} = K_h(X_i - X_0)/ \sum_{l=1}^n K_h(X_l
- X_0)$, we have
    $\displaystyle \min_{a,b_1, b_1, \cdots, b_d}
\sum_{i=1}^n
\Big[y_i - \{a + b B^{\top }(X_i - X_0)\} \Big]^2
w_{i0}$  
    $\displaystyle \qquad \qquad \qquad = \hat \sigma^2(X_0) + h^2
\bigtriangledown^...
...)
(I_{p\times p} - BB^{\top }) \bigtriangledown\tilde {\sl g}(X_0) + o_P(h^2) ,$  

where $ \hat \sigma^2(X_0) = n^{-1} \sum_{i=1}^n \varepsilon_i^2
w_{i0} $ does not depend on $ B$.

Note that $ BB^{\top } $ is a projection matrix. The bias term on the right hand side above is asymptotically non-negative. Therefore, by the law of large numbers and Lemma 3.4, the minimization problem (3.16) depends mainly on

    $\displaystyle E \{\bigtriangledown^{\top } \tilde {\sl g}(X)
(I_{p\times p} - BB^{\top }) \bigtriangledown \tilde {\sl g}(X)\}$  
    $\displaystyle = tr\left[(I_{p\times p} - BB^{\top }) E \{\bigtriangledown \tilde {\sl g}(X) \bigtriangledown^{\top }\tilde {\sl g}(X)\}\right]$  
    $\displaystyle = tr \Big[E \{\bigtriangledown \tilde
{\sl g}(X)\bigtriangledown^...
...iangledown \tilde
{\sl g}(X)\bigtriangledown^{\top }\tilde {\sl g}(X)\}B \Big].$ (3.17)

Therefore, the $ B$ which minimizes the above equation is close to the first $ d $ eigenvectors of $ E \{\bigtriangledown\tilde {\sl g}(X)
\bigtriangledown^{\top } \tilde {\sl g}(X)\} $. By Lemma 3.4, we can still use (3.16) to find an estimator for $ B$ if we use the weight $ w_{ij} = K_h(X_i - X_j)/\sum_{l=1}^n K_h(X_l -
X_j) $. To improve the convergence rate, we can further use the $ r$-th ($ r\ge 1$) order local polynomial fittings as follows.
$\displaystyle \min_{\stackrel{B:\ B^{\top }B = I } { _{a_{j}, b_{j}, j =
1,\cdots, n } } }$   $\displaystyle \sum_{j=1}^n\sum_{i=1}^n
\Big[y_i - a_{j} - b_{j}B^{\top }(X_i - X_j) -$ (3.18)
    $\displaystyle \qquad - \sum_{1<k\le r} \sum_{i_1+\cdots+i_p=k}\Big(c_{i_1,i_2,\cdots,i_p} \times$  
    $\displaystyle \qquad\qquad \times\{X_i-X_j\}_1^{i_1}\{X_i-X_j\}_2^{i_2}\cdots\{X_i-X_j\}_p^{i_p}\Big) \Big]^2
w_{ij},$  

where $ b_j = (b_{j1},\cdots, b_{jd})$ and we assume the summation over an empty set to be 0 in order to include the case of $ r = 1 $. Note that the minimization in (3.18) can be effected by alternating between $ (a_j, b_j) $ and $ B$. See the next section for details.

The root-$ n$ consistency for the MAVE estimator of e.d.r. directions with sufficiently large order $ r$ can also be proved. Besides the difference between the MAVE method and the other methods as stated at the beginning of this section, we need to address another difference between the multi-dimensional kernel MAVE method and the OPG method or the other existing estimation methods. The MAVE method uses the common e.d.r directions as the prior information. Therefore, it can be expected that the MAVE method outperforms the OPG method as well as other existing methods. In order not to be distracted by the complexity of the expressions using high-order local polynomial methods, we now focus on the case $ r = 1 $.

THEOREM 3.2   Suppose that (C1)-(C6) (in Appendix 3.9) hold and model (3.4) is true. Take $ r = 1 $. If $ nh^p/\log n
\to \infty $, $ h \to 0 $ and $ d \ge D $, then
$\displaystyle \Vert(I-\hat B\hat B^{\top })B_0\Vert = O_P(h^3 + h\delta_n+h^{-1}\delta_n^2).$      

Note that the convergence rate is of $ O_P(h^3(\log n)^{1/2}) $ if we use the optimal bandwidth $ h_{opt} $ of the regression function estimation in the sense of MISE, in which case $ \delta_n =
O_P(h^2(\log n)^{1/2}) $. This is faster than the rate for the other methods, which is of $ O_P(h^2) $. Note that the convergence rate for the local linear estimator of the function is also $ O_P(h^2) $. As far as we know, if we use the optimal bandwidth $ h_{opt} $ for the nonparametric function estimation in the sense of MISE, then for all the non-MAVE methods, the convergence rate for the estimators of directions is the same as that for the estimators of the nonparametric functions. As a typical illustration, consider the ADE method and the single-index model $ y = {\sl g}_0(\beta_0^{\top }X) + \varepsilon $. The direction $ \beta_0$ can be estimated as

$\displaystyle \hat \beta_0 = \sum_{j=1}^n \hat b_j/\Vert\sum_{j=1}^n \hat b_j\Vert,$      

where $ \hat b_j $ is obtained by the minimization in (3.12). If we take $ r = 1 $ in (3.12), then we have
$\displaystyle \hat \beta_0$ $\displaystyle =$ $\displaystyle \pm \beta_0 + \frac12 h^2
\{E{\sl g}_0'(\beta_0^{\top }X)\}^{-1}(...
..._0^{\top }) E\{{\sl g}''_0(\beta_0^{\top }X) f^{-1}(X)
\bigtriangledown f(X) \}$  
    $\displaystyle + o_P(h^2),$ (3.19)

from which we can see that the convergence rate for $ \hat
\beta_0$ is also $ O_P(h^2) $. In order to improve the convergence rate, a popular device is to undersmooth the regression function $ {\sl g}_0 $ by taking $ h = o(h_{opt})$. Although the undersmooth method is proved asymptotically useful, there are two disadvantages in practice. (1) There is no general guidance on the selection of such a smaller bandwidth. (2) All our simulations show that the estimation errors of the estimators for the directions become large very quickly as the bandwidth gets small, in contrast to the case when the bandwidth gets large. See Figures 3.3, 3.4 and 3.5. Thus a small bandwidth can be quite dangerous.

To illustrate this, consider the model

$\displaystyle y = (\theta_1 x_1 + \theta_2 x_2 + \theta_3 x_3)^2 + 0.5\varepsilon,$     (3.20)

where $ x_1, x_2, x_3$ and $ \varepsilon$ are i.i.d. $ \sim N(0, 1)$ and $ (\theta_1, \theta_2, \theta_3) = (4, 1, 0)/\sqrt{17} $. With sample size $ n=100$, 500 samples are drawn independently. With different choices of the bandwidth, the average absolute errors $ \sum_{i=1}^3\vert \hat \theta_i -\theta_i\vert $ are computed using both the OPG method and the MAVE method. As a more complicated example, we also investigate the nonlinear autoregressive model
$\displaystyle y_t = \sin(\pi(\theta^{\top } X_t )) + 0.2 \varepsilon_t,$     (3.21)

with $ X_t = (y_{t-1}, \cdots, y_{t-5})^{\top } $, $ \varepsilon_t \sim N(0,1) $ and $ \theta = (1, 0, 1, 0,
1)/\sqrt{3} $. With sample size $ n=200$, we perform a similar comparison as for model (3.20). In Figure 3.3, the solid lines are the errors of the MAVE method and the dashed lines are those of the OPG method. The MAVE method always outperforms the OPG method across all bandwidths. The MAVE method also outperforms the PPR method, which are shown by the dotted line in Figure 3.3 (a). [For model (3.21), the PPR error is 0.9211, which is much worse than our simulation results]. The asterisks refer to the errors when using the bandwidth chosen by the cross-validation method. It seems that this bandwidth is close to the optimal bandwidth as measured by the errors of the estimated directions. This observation suggests the feasibility of using the cross-validation bandwidth for the MAVE method. It also supports our theoretical result that the optimal bandwidth for the estimation of nonparametric function is also suitable for the MAVE estimation of the directions.

Figure 3.3: (a) and (b) are the simulation results for model (3.20) and model (3.21) respectively. The dash lines are the estimation errors by the OPG method and the solid lines are the errors by the MAVE method. The asterisks refer to the errors when using the bandwidth chosen by the cross-validation method
\includegraphics[width=1.2\defpicwidth]{d_g1.ps}

(2) Inverse regression weight. If $ \{y_i \} $ and $ \{ X_i
\} $ have an approximate 1-1 correspondence, then we can use $ y$ instead of $ X$ to produce the weights. As an example, suppose $ \textrm{E}(y\vert X) = {\sl g}_0(\beta_0^{\top }X) $ and $ {\sl g}_0(\cdot) $ is invertible. Then we may choose

$\displaystyle w_{ij} = K_{h}(y_i - y_j)/\sum_{\ell = 1}^n K_h(y_\ell - y_j).$     (3.22)

For this weight function, the minimization in (3.15) becomes the minimization of
$\displaystyle \sum_{i=1}^n \Big[ y_i - \{a + b\beta_0^{\top }(X_i - X_0)\} \Big]^2 w_{i0} .$      

Following the idea of the MAVE method above, we should minimize
$\displaystyle n^{-1} \sum_{i=1}^n\sum_{j=1}^n
\Big[y_i - \{ a_{j} + b_{j}\beta^{\top }(X_i - X_j) \} \Big]^2
w_{ij} .$     (3.23)

We may also consider a `dual' of (3.23) and minimize
$\displaystyle n^{-1} \sum_{j=1}^n\sum_{i=1}^n
\Big[\beta^{\top } X_i - c_j - d_j(y_j-y_i) \Big]^2 w_{ij} .$     (3.24)

This may be considered an alternative derivation of the SIR method. Extension of (3.24) to more than one direction can be stated as follows. Suppose that the first $ k$ directions have been calculated and are denoted by $ \hat \beta_1, \cdots,
\hat \beta_k $ respectively. To obtain the $ (k+1)$th direction, we need to perform

    $\displaystyle \min_{\alpha_1, \cdots, \alpha_k, c_j, d_j, \beta
}\sum_{j=1}^n \...
...\cdots+
\alpha_k\hat \beta_k^{\top }
X_i - c_j - d_j (y_i - y_j)\Big\}^2 w_{ij}$  
    $\displaystyle \textrm{subject to:} \qquad \beta^{\top } (\hat \beta_1, \cdots,
\hat \beta_k) = 0 \ \textrm{and} \
\Vert\beta\Vert= 1.$ (3.25)

We call the estimation method by minimizing (3.23) with $ w_{ij} $ as defined in (3.22) the inverse MAVE (iMAVE) method. Under similar conditions as for the SIR method, the root-$ n$ consistency for the estimators can be proved.

THEOREM 3.3   Suppose that (3.10) and assumptions (C1), (C2$ '$), (C3$ '$), (C4), (C5$ '$) and (C6) (in Appendix 3.9) hold. Then for any $ h \to 0 $ and $ nh^2/\log n \to \infty $
$\displaystyle \Vert (I - B_0 B_0^{\top }) \hat B\Vert = O_P(h^2 + n^{-1/2}).$      

If further $ nh^4 \to 0 $, then
$\displaystyle \Vert (I - B_0 B_0^{\top }) \hat B\Vert = O_P(n^{-1/2}).$      

The result is similar to that of Zhu and Fang (1996). However, in our simulations the method based on the minimization in (3.25) always outperforms the SIR method. To illustrate, we adopt the examples used in Li (1991),

    $\displaystyle y = x_1 (x_1 + x_2 + 1) + 0.5\varepsilon,$ (3.26)
    $\displaystyle y = x_1/(0.5 + (x_2 + 1.5)^2) + 0.5\varepsilon,$ (3.27)

where $ \varepsilon, x_1, x_2, \cdots, x_{10} $ are independent and standard normally distributed. The sample size is set to $ n=200$ and 400. Let $ \beta_1 = (1, 0, \cdots, 0)^{\top } $ and $ \beta_2 = (0, 1, \cdots, 0)^{\top } $ and $ P = 1 - (\beta_1,
\beta_2)(\beta_1, \beta_2) ^{\top } $. Then the estimation errors can be measured by $ \hat \beta_1^{\top } P \hat \beta_1 $ and $ \hat \beta_2^{\top } P \hat \beta_2 $. Figure 3.4 shows the means of the estimation errors defined above; they are labeled by ``1" and ``2" for $ \beta_1 $ and $ \beta_2$ respectively. The iMAVE method has outperformed the SIR method in our simulations. Furthermore, we find that the MAVE method outperforms the iMAVE method in our simulation. Zhu and Fang (1996) proposed a kernel smooth version of the SIR method. However, their method does not show significant improvement over the original SIR method.

Figure: means of $ \hat \beta_1^{\top } P \hat \beta_1 $ [ $ \hat \beta_2^{\top } P \hat \beta_2 $] are labelled ``1'' [``2'']. Figures (a) and (b) [(c) and (d)] are for model (3.26) [(3.27)]. Figures (a) and (c) [(b) and (d)] are based on a sample size of 200 [400]. Dashed [smooth] lines are based on the MAVE [iMAVE] method. The wavy lines are based on the SIR method. The horizontal axes give the number of slices/bandwidth (in square brackets) for the SIR method/iMAVE method. For the MAVE method, the range of bandwidth extends from 2 to 7 for (a) and (c), 1.5 to 4 for (b) and (d).
\includegraphics[width=1.0\defpicwidth]{g_siry1.ps}

As noticed previously, the assumption of symmetry on the design $ X$ can be a handicap as far as applications of the SIR method are concerned. Interestingly, simulations show that the SIR method sometimes works in the case of independent data even when the assumption of symmetry is violated. However, for time series data, we find that the SIR often fails. As a typical illustration, consider the nonparametric times series model

$\displaystyle y_t = \{ \sqrt{5}(\theta_1 y_{t-1} + \theta_2 y_{t-2})/2 \}^{1/3} + \varepsilon_t ,$     (3.28)

where $ \varepsilon_t $ are i.i.d. standard normal and $ (\theta_1, \theta_2) = (2, 1)/\sqrt{5} $. A typical data set is generated with size $ n = 1000$ and is plotted in Figure 3.5 (a). Clearly, the assumption (3.10) is violated and the SIR method is inappropriate here.

Figure 3.5: (a): $ y_{t-1}$ plots against $ y_{t-2}$ from a typical data sets with $ n = 1000$. (b): the mean of absolute estimated error using SIR plotted against the number of slices. (c) the means of estimated errors using the iMAVE method (solid line) and the MAVE method (dashed line) against bandwidth respectively.
\includegraphics[width=1.2\defpicwidth]{d_g4.ps}

For sample size $ n = 500 $, we draw 200 samples from model (3.28). Using the SIR method with different number of slices, the mean of the estimated errors $ \vert\hat \theta_1 - \theta_1 \vert + \vert
\hat \theta_2 - \theta_2 \vert $ is plotted in Figure 3.5 (b). The estimation is quite poor. However, the iMAVE estimation gives much better results, and the MAVE ($ r = 1 $) method is even better.

Now, we make a comparison between the MAVE method and the iMAVE method (or the SIR method). Besides the fact that the MAVE method is applicable to an asymmetric design $ X$, the MAVE method (with $ r = 1 $) has better performance than the iMAVE (or SIR) method for the above models and for all the other simulations we have done. We even tried the same models with higher dimensionality $ p$. All our simulation results show that the iMAVE method performs better than the SIR method and the MAVE method performs better than both of them. Intuitively, we may think that the iMAVE method and the SIR method should benefit from the use of one-dimensional kernels, unlike the MAVE method, which uses a multi-dimensional kernel. However, if the regression function $ {\sl g}$ is symmetric about 0, then the SIR method and the iMAVE method usually fails to find the directions. Furthermore, any fluctuation in the regression function may reduce the efficiency of the estimation. (To overcome the effect of symmetry in the regression function, Li (1992) used a third moment method to estimate the Hessian matrix. This method has, however, a larger variance in practice). Another reason in addition to Theorem 3.2 why the iCMV method and the SIR method perform poorly may be the following. Note that for the MAVE method

$\displaystyle B_0-\hat B (\hat B^{\top } B_0)$ $\displaystyle =$ $\displaystyle (I - B_0B_0^{\top })
\sum_{i = 1}^n \bigtriangledown{\sl g}(B_0^{\top }X_j) \bigtriangledown f(X_i) \varepsilon_i\hspace{.5cm}$  
  $\displaystyle \times$ $\displaystyle \Big(
E \{\bigtriangledown{\sl g}(B_0^{\top }X) \bigtriangledown^{\top }\!\!{\sl g}(B_0^{\top }X)
\Big)^{-1}
+ O_P(h^3 + h\delta_n). \hspace{.2cm}$ (3.29)

Now consider the iMAVE method. Let $ S(y) = E\Big( \{X-
\textrm{E}(X\vert y)\}\{X- \textrm{E}(X\vert y)\}^{\top }\Big) $. The estimator $ \hat B $ consists of the eigenvectors of $ ES(y) + n^{-1} H_n + O_P(h^2), $ where
$\displaystyle H_n = n^{-1} \sum_{j=1}^n \Big[ \{S(y_i) - ES(y_i)\}
+ \{nf(y_j)\}^{-1}
\sum_{i=1}^n \Big( K_{h,i}(y_j)\{ X_i - A(y_j)\}
\qquad$      
$\displaystyle \times \{ X_i - A(y_j)\}^{\top }
- EK_{h,i}(y_j)\{ X_i - A(y_j)\} \{ X_i - A(y_j)\}^{\top }\Big) \Big].$      

Note that $ H_n $ is a $ p\times p $ matrix. The variance of the eigenvectors of $ ES(y) + n^{-1} H_n + O_P(h^2) $ is a summation of $ p^2 $ terms. See Zhu and Fang (1996). The variance is quite large for large $ p$. A theoretical comparison between the MAVE method and the iMAVE (or SIR) method is unavailable but we conjecture that the gain in using the univariate kernel will not be sufficient to compensate for the loss in other aspects such as the variance and the effect of fluctuations in the regression function.


3.3.2 The Varying-coefficient Model

Consider model (3.8). Note that $ B_0
$ is the solution of the following minimization problem

$\displaystyle \min_{B:\ B^{\top } B = I } E\Big[y- \sum_{\ell=0}^q z_\ell
{\sl g}_\ell(B^{\top } X) \Big]^2.\ $     (3.30)

Here and later, we take $ z_0 = 1 $ for ease of exposition. Consider the conditional variance of $ y$ given $ B^{\top } X $
$\displaystyle \sigma_{B}^2(B^{\top } X) = E\Big[ \{y- \sum_{\ell=0}^q z_\ell
{\sl g}_\ell(B^{\top } X) \}^2 \vert B^{\top } X \Big].\ $     (3.31)

It follows that
$\displaystyle E\left[E[ \{y- \sum_{\ell=1}^d z_\ell {\sl g}_\ell(B^{\top } X) \}^2 \vert B^{\top } X]\right] = E\sigma_{B}^2(B^{\top } X).$      

Therefore, the minimization of (3.3) is equivalent to
$\displaystyle \min_{B:\ B^{\top }B = I }
E\sigma_{B}^2(B^{\top }X).$     (3.32)

Suppose $ \{ (X_i, Z_i, y_i )\ i = 1, 2, \cdots, n\} $ is a random sample from $ (X, Z, y) $. For any given $ X_0 $, a local linear expansion of $ {\sl g}_\ell (B^{\top }X_i) $ at $ X_0 $ is
$\displaystyle {\sl g}_\ell (B^{\top } X_i) \approx a_\ell + b_\ell B^{\top }(X_i -
X_0),$     (3.33)

where $ a_\ell = {\sl g}_\ell (B^{\top } X_0)
$ and $ b_\ell = (b_{\ell 1}, \cdots, b_{\ell d}) $ with
$\displaystyle b_{\ell k} = \frac{\partial {\sl g}_{\ell}(v_1, \cdots, v_d)}{\pa...
... v_d)^{\top }= B^{\top }X_0}, \quad k =
1, \cdots , d;\ \ \ell = 1, \cdots, q .$      

The residuals are then
$\displaystyle y_i- \sum_{\ell=0}^\ell z_\ell {\sl g}_\ell(B^{\top } X) \approx y_i
- \sum_{\ell=0}^q \{ a_\ell z_\ell + b_\ell B^{\top }(X_i -
X_0)z_\ell\} .$      

Following the idea of Nadaraya-Watson estimation, we can estimate $ \sigma_{B}^2$ at $ B^{\top }X_0$ by
$\displaystyle \sum_{i=1}^n
\Big[y_i - \textrm{E}(y_i\vert B^{\top }X_i=B^{\top }X_0) \Big]^2
w_{0i} \hspace{5cm}$      
$\displaystyle \approx \sum_{i=1}^n
\Big[y_i - \sum_{\ell=0}^q \{ a_\ell z_\ell + b_\ell B^{\top }(X_i - X_0)z_\ell\} \Big]^2
w_{i0}. \quad$     (3.34)

As in the nonparametric case, the choice of $ w_{i0}$ is important. However, the model is now more complicated. Even if the $ {\sl g}_\ell $'s are monotonic functions, we can not guarantee a 1-1 correspondence between $ y$ and $ X$. Therefore a possible choice is the multi-dimensional kernel, i.e. $ w_{i0} = K_h(X_i - X_0)
/\sum_{\ell = 1}^n K_h(X_\ell - X_0)$. To improve the accuracy, we can also use a higher order local polynomial smoothing since

$\displaystyle \sum_{i=1}^n
\Big[y_i - \textrm{E}(y_i\vert B^{\top }X_0) \Big]^2...
...{\ell=0}^q \{ a_\ell z_\ell + b_\ell B^{\top }(X_i - X_0)z_\ell\}\hspace{1.5cm}$      
$\displaystyle - \sum_{\ell=0}^q z_\ell
\sum_{1 < k\le r} \sum_{i_1+\cdots+i_p=k...
...\{X_i-X_j\}_1^{i_1}\{X_i-X_j\}_2^{i_2}\cdots\{X_i-X_j\}_p^{i_p}
\Big]^2
w_{i0}.$      

Finally, we can estimate the directions by minimizing
    $\displaystyle \sum_{j=1}^n\sum_{i=1}^n
\Big[y_i - \sum_{\ell=0}^q \{ a_\ell z_\ell + b_\ell B^{\top }(X_i - X_j)z_\ell\}
- \sum_{\ell=0}^q z_\ell$  
    $\displaystyle \sum_{1<k \le r} \sum_{i_1+\cdots+i_p=k}
\times c_{\ell, i_1,i_2,...
...\{X_i-X_j\}_1^{i_1}\{X_i-X_j\}_2^{i_2}\cdots\{X_i-X_j\}_p^{i_p}
\Big]^2
w_{ij}.$  

Now, returning to the general model (3.2), suppose $ G(v_1,
\cdots, v_\gamma, Z, \theta) $ is differentiable. Let $ G_k'(v_1,
\cdots, v_\gamma, Z, \theta) =\partial G(v_1, \cdots, v_\gamma, Z,
\theta)/\partial v_k $, $ k = 1, \cdots, \gamma $. By (3.33), for $ B^{\top } X_i $ close to $ B^{\top }X_0$ we have

$\displaystyle G({\sl g}_1(B^{\top }X_i), \cdots, {\sl g}_\gamma(B^{\top }X_i), ...
...l g}_1(B^{\top }X_0), \cdots, {\sl g}_\gamma(B^{\top } X_0), Z_i, \theta)\qquad$      
$\displaystyle + \sum_{k=1}^\gamma G_k'({\sl g}_1(B^{\top } X_0), \cdots, {\sl g...
...\theta)
\bigtriangledown^{\top } {\sl g}_k(B^{\top } X_0) B^{\top }(X_i - X_0).$      

To estimate $ B$, we minimize
    $\displaystyle \sum_{j=1}^n \sum_{i=1}^n \Big\{
y_i - G(a_{1j}, \cdots, a_{\gamma j}, Z_i, \theta) +$  
    $\displaystyle \qquad + \sum_{k=1}^\gamma G_k'(a_{1j}, \cdots, a_{\gamma j}, Z_i, \theta)
b_{kj}^{\top } B^{\top }(X_i - X_j)\Big\}^2w_{i,j}$  

with respect to $ a_{1j}, \cdots, a_{\gamma j}$, $ b_{1j},\cdots,
b_{\gamma j},$ $ j = 1, \cdots, n $, $ \theta$ and $ B$.