3.1 Introduction


3.1.1 Real Data Sets

This chapter is motivated by our attempt to answer pertinent questions concerning a number of real data sets, some of which are listed below.

Example 3.1.1. Consider the relationship between the levels of pollutants and weather with the total number ($ y_t $) of daily hospital admissions for circulatory and respiratory problems. The covariates are the average levels of sulphur dioxide ($ x_{1t}$, unit $ \mu g m^{-3}$), nitrogen dioxide ($ x_{2t}$, unit $ \mu g m^{-3}$), respirable suspended particulates ($ x_{3t} $, unit $ \mu g m^{-3}$), ozone ($ x_{4t}$, unit $ \mu g m^{-3}$), temperature ($ x_{5t}$, unit $ ^oC$) and humidity ($ x_{6t}$, unit %). The data set was collected daily in Hong Kong from January 1, 1994 to December 31, 1995 (Courtesy of Professor T.S. Lau). The basic question is this: Are the prevailing levels of the pollutants a cause for concern?

Figure 3.1: (a) Total number of daily hospital admissions for circulatory and respiratory problems. (b) the average levels of sulphur dioxide. (c) nitrogen dioxide. (d) respirable suspended particulates. (e) ozone. (f) temperature. (g) humidity.
\includegraphics[width=1.6\defpicwidth]{hkepdplot1.ps}

\includegraphics[width=1.6\defpicwidth]{hkepdplot2.ps}

\includegraphics[width=1.6\defpicwidth]{hkepdplot3.ps}

\includegraphics[width=1.6\defpicwidth]{hkepdplot4.ps}

The relationship between $ y_t $ and $ x_{1t}, x_{2t}, x_{3t},
x_{4t}, x_{5t}, x_{6t} $ is quite complicated. A naive approach may be to start with a simple linear regression model such as

$\displaystyle y_t$ $\displaystyle =$ $\displaystyle 255.45 - 0.55x_{t1} + 0.58x_{2t}+ 0.18x_{3t} -$ (3.1)
    $\displaystyle (20.64) \hspace{0.3cm} (0.18) \hspace{.6cm}
(0.17)\hspace{0.6cm} (0.13)\hspace{0.5cm}$  
    $\displaystyle - 0.33x_{4t} -0.12x_{5t} -0.16x_{6t}.$  
    $\displaystyle ~(0.11)\hspace{0.6cm} (0.46)\hspace{0.7cm} (0.23)$  

14381 XEGeq1.xpl

Note that the coefficients of $ x_{3t} $, $ x_{5t}$ and $ x_{6t}$ are not significantly different from 0 (at the 5% level of significance) and the negative coefficients of $ x_{t1} $ and $ x_{t4} $ are difficult to interpret. Refinements of the above model are, of course, possible within the linear framework but it is unlikely that they will throw much light in respect of the opening question because, as we shall see, the situation is quite complex.

Example 3.1.2. We revisit the Mackenzie River Lynx data for 1821-1934. Following common practice in ecology and statistics, let $ y_t $ denote $ \log($number recorded as trapped in year 1820$ +t$) $ (t=1, 2, \cdots, 114)$. The series is shown in Figure 1.2. It is interesting to see that the relation between $ y_t $ and $ y_{t-1}$ seems quite linear as shown in Figure 1.2(b). However, the relation between $ y_{t} $ and $ y_{t-2}$ shows some nonlinearity. A number of time series models have been proposed in the literature. Do they have points of contact with one another?

Figure: The Mackenzie River Lynx data for 1821-1934. (a) The series $ y_t = \log($number recorded as trapped in year 1820$ +t$). (b) Directed scatter diagrams of $ y_t $ against $ y_{t-1}$. (c) Directed scatter diagrams of $ y_t $ against $ y_{t-2}$.
\includegraphics[width=1.5\defpicwidth]{lynxplot1.ps}

\includegraphics[width=1.4\defpicwidth]{lynxplot2.ps}


3.1.2 Theoretical Consideration

Let $ (X, Z, y) $ be respectively $ {{\Bbb R}}^p $-valued, $ {{\Bbb R}}^q $-valued, and $ {{\Bbb R}} $-valued random variables. In the absence of any prior knowledge about the relation between $ y$ and $ (X, Z) $, a nonparametric regression model is usually adopted, i.e. $ y = {\sl g}(X, Z) + \varepsilon $, where $ \textrm{E}(\varepsilon \vert X, Z) = 0$ almost surely. More recently, there is a tendency to use instead some semiparametric models to fit the relations between $ y$ and $ (X, Z) $. There are three reasons for this. The first is to reduce the impact of the curse of dimensionality in nonparametric estimation. The second is that a parametric form allows us some explicit expression, perhaps based on intuition or prior information, about part of the relation between variables. The third is that for one reason or another (e.g. availability of some background information) some restrictions have to be imposed on the relations. The latter two reasons also mean that we have some information about some of the explanatory variables but not the others. A general semiparametric model can be written as

$\displaystyle y = G({\sl g}_{1}(B^{\top }X), \cdots, {\sl g}_\gamma(B^{\top }X), Z, \theta) + \varepsilon ,$     (3.2)

where $ G$ is a known function up to a parameter vector $ \theta \in {{\Bbb R}}^l $, $ {\sl g}_{1}(\cdot), \cdots,
{\sl g}_\gamma(\cdot) $ are unknown functions and $ \textrm{E}(
\varepsilon \vert X, Z) $ $ = 0$ almost surely. Covariates $ X$ and $ Z$ may have some common components. Matrix $ B$ is a $ p\times D
$ orthogonal matrix with $ D < p $. Note that model (3.2) still contains unknown multivariate functions $ {\sl g}_{1},\cdots,
{\sl g}_\gamma $. For model (3.2), parameter $ \theta$, matrix $ B$ and functions $ {\sl g}_{1}(\cdot), \cdots,
{\sl g}_\gamma(\cdot) $ are typically chosen to minimize
$\displaystyle E\Big[ y - G({\sl g}_1(B^{\top }X), \cdots, {\sl g}_\gamma(B^{\top }X),
Z, \theta) \Big]^2 .$     (3.3)

Note that the space spanned by the column vectors of $ B$ is uniquely defined under some mild conditions and is our focus of interest. For convenience, we shall refer to these column vectors the efficient dimension reduction (e.d.r.) directions, which are unique up to orthogonal transformations. The above idea underlines many useful semiparametric models, of which the following are some examples.

(1) The following model has been quite often considered.

$\displaystyle y = {\sl g}(B_0^{\top }X) + \varepsilon,$     (3.4)

where $ B_0^{\top } B_0 = I_{D\times D}\ (D\le p) $ and $ \textrm{E}(\varepsilon\vert X) = 0$ almost surely. Here, both $ {\sl g}$ and $ B_0
$ are unknown. Li (1991) gave an interesting approach to the estimation of $ B_0
$. Li (1991) used model (3.4) to investigate the e.d.r. directions of $ \textrm{E}(y\vert X=x) $. Specifically, for any $ k$, the $ k$-indexing directions $ \beta_1,\cdots, \beta_k$ are defined such that $ B_k =
(\beta_1,\cdots, \beta_k ) $ minimizes
$\displaystyle E[y- \textrm{E}(y\vert B_k^{\top } X)]^2.$     (3.5)

(2) A slightly more complicated model is the multi-index model proposed by Ichimura and Lee (1991), namely

$\displaystyle y = \theta_0^{\top } X + {\sl g}(B_0^{\top }X) + \varepsilon.$     (3.6)

The linear restriction of the first component is of some interest. See Xia, Tong, and Li (1999). Carroll et al. (1997) proposed a slightly simpler form. To fit model (3.6), we also need to estimate the e.d.r. directions or equivalently $ B_0
$, a $ p\times D
$ matrix. An important special case of (3.6) is the single-index model, in which $ \theta_0 = 0 $ and $ D = 1$.

(3) The varying-coefficient model proposed by Hastie and Tibshirani (1993),

$\displaystyle y = c_0(X) + c_1(X) z_1 + \cdots + c_q(X)z_q + \varepsilon,$     (3.7)

is a generalized linear regression with unknown varying-coefficients $ c_0, \cdots, c_q $, which are functions of $ X$. Here, $ Z=(z_1,\cdots, z_q)^{\top }$. A similar model was proposed by Chen and Tsay (1993) in the time series context. The question is how to find the e.d.r. directions for the $ X$ within the $ c_k $'s. More precisely, we seek the model
$\displaystyle y = {\sl g}_0(B_0^{\top }X) + {\sl g}_1(B_0^{\top }X) z_1 + \cdots +
{\sl g}_q(B_0^{\top }X)z_q+\varepsilon,$     (3.8)

where $ {\sl g}_1,\cdots, {\sl g}_q $ are unknown functions and the columns in $ B_0: p\times D$ are unknown directions. The case $ D = 1$ has an important application in nonlinear time series analysis: it can be used to select the (sometimes hidden) threshold variable (previously called the indicator variable) of a threshold autoregressive model (Tong; 1990). See also Xia and Li (1999).

The above discussion and examples highlight the importance of dimension reduction for semiparametric models. For some special cases of model (3.2), some dimension reduction methods have been introduced. Next, we give a brief review of these methods.

The projection pursuit regression (PPR) was proposed by Friedman and Stuetzle (1981). Huber (1985) gave a comprehensive discussion. The commonly used PPR aims to find univariate functions $ {\sl g}_1,
{\sl g}_2, \cdots, {\sl g}_D $ and directions $ \beta_1, \beta_2,$ $ \cdots, \beta_D$ which satisfy the following sequence of minimizations,

$\displaystyle \min_{\beta_1} E[ y - {\sl g}_1(\beta_1^{\top } X) ]^2,\ \cdots, ...
...sum_{j=1}^{D-1}{\sl g}_j(\beta_j^{\top } X) - {\sl g}_D(\beta_D^{\top } X)\}^2.$     (3.9)

Actually, $ {\sl g}_1(\beta^{\top } X) = \textrm{E}(y\vert\beta^{\top } X) $ and $ {\sl g}_k(\beta^{\top } X) = E\Big[ \{y- \sum_{i=1}^{k-1}
{\sl g}_i(\beta_i^{\top } X)\} \vert \beta^{\top }X\Big],$ $ k = 1,
\cdots, D.$ Because both $ \beta_k $ and $ {\sl g}_k $ are unknown, the implementation of the above minimizations is non-trivial. Compared with (3.4), the PPR assumes that $ {\sl g}(x) = \textrm{E}(y\vert X=x) $ depends additively on its e.d.r. directions. The primary focus of the PPR is more on the approximation of $ {\sl g}(x) $ by a sum of ridge functions $ {\sl g}_k (\cdot) $, namely $ {\sl g}(X) \approx
\sum_{k=1}^D {\sl g}_k(\beta_k^{\top } X) $, than on looking for the e.d.r. directions. To illustrate, let $ y = (x_1+x_2)^2(1 + x_3)
+ \varepsilon, $ where $ x_1, x_2, x_3, \varepsilon $ are i.i.d. random variables with the common distribution $ N(0, 1)$. The e.d.r. directions are $ (1/\sqrt{2}, 1/\sqrt{2}, 0)^{\top }$ and $ (0, 0, 1)^{\top } $. However, the PPR cannot find the directions correctly because the components are not additive.

Another simple approach related to the estimation of the e.d.r. direction is the average derivative estimation (ADE) proposed by Härdle and Stoker (1989). Suppose that $ {\sl g}(x) = {\sl g}_1(\beta_1^{\top }x)
$. Then $ \bigtriangledown {\sl g}(x) = {\sl g}_1'(\beta_1^{\top } x)
\beta_1, $ where $ \bigtriangledown {\sl g}(\cdot) $ is the gradient of the unknown regression function $ {\sl g}(\cdot) $ with respect to its arguments. It follows that $ E\bigtriangledown {\sl g}(X) = \{
E{\sl g}_1'(\beta_1^{\top } X) \}\beta_1$. Therefore, the difference between $ \beta_1 $ and the expectation of the gradient is a scalar constant. We can estimate $ \bigtriangledown {\sl g}(x) $ nonparametrically, and then obtain an estimate of $ \beta_1 $ by the direction of the estimate of $ E \bigtriangledown {\sl g}(X) $. An interesting result is that the estimator of $ \beta_1 $ can achieve root-$ n$ consistency even when we use high-dimensional kernel smoothing method to estimate $ E \bigtriangledown {\sl g}(X) $. However, there are several limitations with the ADE: (i) To obtain the estimate of $ \beta_1 $, the condition $ E{\sl g}'_1(\beta_1^{\top }X) \neq 0 $ is needed. This condition is violated when $ {\sl g}_1(\cdot) $ is an even function and $ X$ is symmetrically distributed. (ii) As far as we known, there is no successful extension to the case of more than one e.d.r. direction.

The sliced inverse regression (SIR) method proposed by Li (1991) is perhaps up to now the most powerful method for searching for the e.d.r. directions. However, to ensure that such an inverse regression can be taken, the SIR method imposes some strong probabilistic structure on $ X$. Specifically, the method requires that for any constant vector $ b=(b_1, \cdots, b_p) $, there are constants $ c_0 $ and $ c = (c_1, \cdots, c_D) $ such that for any $ B$,

$\displaystyle \textrm{E}(bX\vert B^{\top } X) = c_0 + c B^{\top }X.$     (3.10)

In times series analysis, we typically set $ X = (y_{t-1}, \cdots,
y_{t-p} )^{\top }$, where $ \{y_t\} $ is a time series. Then the above restriction of probability structure is tantamount to assuming that $ \{y_t\} $ is time-reversible. However, it is well known that time-reversibility is the exception rather than the rule for time series. Another important problem for dimension reduction is the determination of the number of the e.d.r. directions. Based on the SIR method, Li (1991) proposed a testing method. For reasons similar to the above, the method is typically not relevant in time series analysis.

For the general model (3.2), the methods listed above may fail in one way or another. For instance, the SIR method fails with most nonlinear times series models and the ADE fails with model (3.8) when $ X$ and $ Z$ have common variables. In this chapter, we shall propose a new method to estimate the e.d.r. directions for the general model (3.2). Our approach is inspired by the SIR method, the ADE method and the idea of local linear smoothers (see, for example, Fan and Gibbers (1996)). It is easy to implement and needs no strong assumptions on the probabilistic structure of $ X$. In particular, it can handle time series data. Our simulations show that the proposed method has better performance than the existing ones. Based on the properties of our direction estimation methods, we shall propose a method to estimate the number of the e.d.r. directions, which again does not require special assumptions on the design $ X$ and is applicable to many complicated models.

To explain the basic ideas of our approach, we shall refer mostly to models (3.4) and (3.8). Extension to other models is not difficult. The rest of this chapter is organized as follows. Section 3.2 gives some properties of the e.d.r. directions and extends the ADE method to the average outer product of gradients estimation method. These properties are important for the implementation of our estimation procedure. Section 3.3 describes the the minimum average (conditional) variance estimation procedure and gives some results. Some comparisons with the existing methods are also discussed. An algorithm is proposed in Section 3.5. To check the feasibility of our approach, we have conducted a substantial volume of simulations, typical ones of which are reported in Section 3.6. Section 3.7 gives some real applications of our method to both independently observed data sets and time series data sets.