18.3 Sliced Inverse Regression

Sliced inverse regression (SIR) is a dimension reduction method proposed by Duan and Li (1991). The idea is to find a smooth regression function that operates on a variable set of projections. Given a response variable $Y$ and a (random) vector  $X \in \mathbb{R}^p$ of explanatory variables, SIR is based on the model:

\begin{displaymath}
Y \ = \ m(\beta_1^{\top}X,\dots,\beta_k^{\top}X,\varepsilon),
\end{displaymath} (18.10)

where $\beta_{1},\dots,\beta_{k}$ are unknown projection vectors, $k$ is unknown and assumed to be less than $p$, $m:\mathbb{R}^{k+1}\to \mathbb{R}$ is an unknown function, and $\varepsilon$ is the noise random variable with $\textrm{E}\left(\varepsilon\!\mid\!X\right)=0$.

Model (18.10) describes the situation where the response variable $Y$ depends on the $p$-dimensional variable $X$ only through a $k$-dimensional subspace. The unknown $\beta_{i}$'s, which span this space, are called effective dimension reduction directions (EDR-directions). The span is denoted as effective dimension reduction space (EDR-space). The aim is to estimate the base vectors of this space, for which neither the length nor the direction can be identified. Only the space in which they lie is identifiable.

SIR tries to find this $k$-dimensional subspace of $\mathbb{R}^p$ which under the model (18.10) carries the essential information of the regression between $X$ and $Y$. SIR also focuses on small $k$, so that nonparametric methods can be applied for the estimation of $m$. A direct application of nonparametric smoothing to $X$ is for high dimension $p$ generally not possible due to the sparseness of the observations. This fact is well known as the curse of dimensionality, see Huber (1985).

The name of SIR comes from computing the inverse regression (IR) curve. That means instead of looking for $\textrm{E}\left(Y\!\mid\!X=x\right)$, we investigate $\textrm{E}\left(X\!\mid\!Y=y\right)$, a curve in $\mathbb{R}^p$ consisting of $p$ one-dimensional regressions. What is the connection between the IR and the SIR model (18.10)? The answer is given in the following theorem from Li (1991).

THEOREM 18.1   Given the model (18.10) and the assumption
\begin{displaymath}
\forall b \in \mathbb{R}^p :\ E\left(b^{\top}X \!\mid\!\beta...
...\top} x\right) \ = \ c_0 + \sum_{i=1}^k c_i\beta_i^{\top} x ,
\end{displaymath} (18.11)

the centered IR curve $E(X\!\mid\!Y=y) - E(X)$ lies in the linear subspace spanned by the vectors $\Sigma
\beta_i$, $i=1,\dots,k,$ where $\Sigma=\Cov(X)$.

Assumption (18.11) is equivalent to the fact that $X$ has an elliptically symmetric distribution, see Cook and Weisberg (1991). Hall and Li (1993) have shown that assumption (18.11) only needs to hold for the EDR-directions.

It is easy to see that for the standardized variable $Z=\Sigma^{-1/2}\{X-\textrm{E}(X)\}$ the IR curve $m_1(y)= \textrm{E}(Z\!\mid\!Y=y)$ lies in $\textrm{span}(\eta_{1},\dots,\eta_{k})$, where $\eta_{i}=\Sigma^{1/2}\beta_{i}$. This means that the conditional expectation $m_1(y)$ is moving in $\textrm{span}(\eta_{1},\dots,\eta_{k})$ depending on $y$. With $b$ orthogonal to $\textrm{span}(\eta_{1},\dots,\eta_{k})$, it follows that

\begin{displaymath}b^{\top}m_1(y) \ = \ 0, \end{displaymath}

and further that

\begin{displaymath}m_1(y)m_1(y)^{\top} b \ = \
\Cov\{m_1(y)\}b \ = \ 0. \end{displaymath}

As a consequence $\Cov\{\textrm{E}(Z\!\mid\!y)\}$ is degenerated in each direction orthogonal to all EDR-directions $\eta_{i}$ of $Z$. This suggests the following algorithm.

First, estimate $\Cov\{m_1(y)\}$ and then calculate the orthogonal directions of this matrix (for example, with eigenvalue/eigenvector decomposition). In general, the estimated covariance matrix will have full rank because of random variability, estimation errors and numerical imprecision. Therefore, we investigate the eigenvalues of the estimate and ignore eigenvectors having small eigenvalues. These eigenvectors $\hat\eta_{i}$ are estimates for the EDR-direction $\eta_{i}$ of $Z$. We can easily rescale them to estimates $\hat\beta_{i}$ for the EDR-directions of $X$ by multiplying by $\hat\Sigma^{-1/2}$, but then they are not necessarily orthogonal. SIR is strongly related to PCA. If all of the data falls into a single interval, which means that $\widehat{\Cov}\{m_1(y)\}$ is equal to $\widehat{\Cov}(Z)$, SIR coincides with PCA. Obviously, in this case any information about $y$ is ignored.


The SIR Algorithm

The algorithm to estimate the EDR-directions via SIR is as follows:
  1. Standardize $x$:

    \begin{displaymath}
z_i \ = \ \hat\Sigma^{-1/2} (x_i - \bar x) .
\end{displaymath}

  2. Divide the range of $y_{i}$ into $S$ nonoverlapping intervals (slices) $H_s$, $s=1,\dots, S$. $n_s$ denotes the number of observations within slice $H_s$, and ${\boldsymbol{I}}_{H_s}$ the indicator function for this slice:

    \begin{displaymath}
n_s \ = \ \sum_{i=1}^n {\boldsymbol{I}}_{H_s}(y_i).
\end{displaymath}

  3. Compute the mean of $z_{i}$ over all slices. This is a crude estimate $\widehat{m}_1$ for the inverse regression curve $m_1$:

    \begin{displaymath}
\bar{z}_s \ = \ \frac{1}{n_s} \sum_{i=1}^n z_i \
{\boldsymbol{I}}_{H_s}(y_i) .
\end{displaymath}

  4. Calculate the estimate for $\Cov\{m_1(y)\}$:

    \begin{displaymath}
\widehat V \ = \ n^{-1} \sum_{s=1}^S n_s \bar z_s \bar{z}_s^{\top} .
\end{displaymath}

  5. Identify the eigenvalues $\hat \lambda_i$ and eigenvectors $\hat
\eta_i$ of $\widehat V$.
  6. Transform the standardized EDR-directions $\hat
\eta_i$ back to the original scale. Now the estimates for the EDR-directions are given by

    \begin{displaymath}
\hat \beta_i \ = \ \hat \Sigma^{-1/2} \hat \eta_i.
\end{displaymath}

REMARK 18.1   The number of different eigenvalues unequal to zero depends on the number of slices. The rank of $\hat V$ cannot be greater than the number of slices$-1$ (the $z_{i}$ sum up to zero). This is a problem for categorical response variables, especially for a binary response--where only one direction can be found.


SIR II

In the previous section we learned that it is interesting to consider the IR curve, that is, $\textrm{E}(X\!\mid\!y)$. In some situations however SIR does not find the EDR-direction. We overcome this difficulty by considering the conditional covariance $\Cov(X\!\mid\!y)$ instead of the IR curve. An example where the EDR directions are not found via the SIR curve is given below.

EXAMPLE 18.2   Suppose that $(X_{1},X_{2})^{\top} \sim N(0, {\data{I}}_{2}) $ and $Y=X_{1}^2$. Then $E(X_{2}\!\mid\!y) = 0$ because of independence and $E(X_{1} \!\mid\!y) = 0$ because of symmetry. Hence, the EDR-direction $\beta=(1,0)^{\top}$ is not found when the IR curve $\textrm{E}(X \!\mid\!y)=0$ is considered.

The conditional variance

\begin{displaymath}\Var(X_{1} \!\mid\!Y=y) = \textrm{E}(X_{1}^2\!\mid\!Y=y) = y, \end{displaymath}

offers an alternative way to find $\beta$. It is a function of $y$ while $\Var(X_{2}\!\mid\!y)$ is a constant.

The idea of SIR II is to consider the conditional covariances. The principle of SIR II is the same as before: investigation of the IR curve (here the conditional covariance instead of the conditional expectation). Unfortunately, the theory of SIR II is more complicated. The assumption of the elliptical symmetrical distribution of $X$ has to be more restrictive, i.e., assuming the normality of $X$.

Given this assumption, one can show that the vectors with the largest distance to $\Cov(Z\!\mid\!Y=y) -\textrm{E}\{\Cov(Z\!\mid\!Y=y)\}$ for all $y$ are the most interesting for the EDR-space. An appropriate measure for the overall mean distance is, according to Li (1992),

    $\displaystyle \textrm{E}\left(\vert\vert\left[\Cov(Z\!\mid\!Y=y) -
\textrm{E}\{\Cov(Z\!\mid\!Y=y)\}\right] b\vert\vert^2\right) =$ (18.12)
  $\textstyle =$ $\displaystyle b^{\top}\textrm{E}\left(\vert\vert\Cov(Z\!\mid\!y) -
\textrm{E}\{\Cov(Z\!\mid\!y)\}\vert\vert^2\right) b.$ (18.13)

Equipped with this distance, we conduct again an eigensystem decomposition, this time for the above expectation $\textrm{E}\left(\vert\vert\Cov(Z\!\mid\!y) - \textrm{E}\{\Cov(Z\!\mid\!
y)\}\vert\vert^2\right)$. Then we take the rescaled eigenvectors with the largest eigenvalues as estimates for the unknown EDR-directions.


The SIR II Algorithm

The algorithm of SIR II is very similar to the one for SIR, it differs in only two steps. Instead of merely computing the mean, the covariance of each slice has to be computed. The estimate for the above expectation (18.12) is calculated after computing all slice covariances. Finally, decomposition and rescaling are conducted, as before.

  1. Do steps 1 to 3 of the SIR algorithm.
  2. Compute the slice covariance matrix $\widehat V_{s}$:

    \begin{displaymath}\widehat V_{s} = \frac{1}{n_{s}-1} \sum_{i=1}^n I_{H_{s}}(y_{i}) z_{i}z_{i}^{\top}
- n_{s}\bar z_{s}\bar z_{s}^{\top}. \end{displaymath}

  3. Calculate the mean over all slice covariances:

    \begin{displaymath}\bar V = \frac{1}{n} \sum_{s=1}^S n_{s} \widehat V_{s}. \end{displaymath}

  4. Compute an estimate for (18.12):

    \begin{displaymath}\widehat V = \frac{1}{n} \sum_{s=1}^S n_{s}\left(\widehat V_{...
...2
= \frac{1}{n} \sum_{s=1}^Sn_{s}\widehat V_{s}^2 - \bar V^2. \end{displaymath}

  5. Identify the eigenvectors and eigenvalues of $\widehat V$ and scale back the eigenvectors. This gives estimates for the SIR II EDR-directions:

    \begin{displaymath}\hat\beta_{i} = \hat\Sigma^{-1/2}\hat\eta_{i}. \end{displaymath}

EXAMPLE 18.3   The result of SIR is visualized in four plots in Figure 18.6: the left two show the response variable versus the first respectively second direction. The upper right plot consists of a three-dimensional plot of the first two directions and the response. The last picture shows $\hat\Psi_{k}$, the ratio of the sum of the first $k$ eigenvalues and the sum of all eigenvalues, similar to principal component analysis.

The data are generated according to the following model:

\begin{displaymath}y_{i} = \beta_{1}^{\top}x_{i} + (\beta_{1}^{\top}x_{i})^3 +
4 \left(\beta_{2}^{\top}x_{i}\right)^2 + \varepsilon_{i}, \end{displaymath}

where the $x_{i}$'s follow a three-dimensional normal distribution with zero mean, the covariance equal to the identity matrix, $\beta_{2}=(1,-1,-1)^{\top}$, and $\beta_{1}=(1,1,1)^{\top}$. $\varepsilon_{i}$ is standard, normally distributed and $n=300$. Corresponding to model (18.10), $m(u,v,\varepsilon) = u+u^3+v^2+\varepsilon$. The situation is depicted in Figure 18.4 and Figure 18.5.

Figure 18.4: Plot of the true response versus the true indices. The monotonic and the convex shapes can be clearly seen. 51349 MVAsirdata.xpl
\includegraphics[width=1\defpicwidth]{MVAsirdata1.ps}

Figure 18.5: Plot of the true response versus the true indices. The monotonic and the convex shapes can be clearly seen. 51356 MVAsirdata.xpl
\includegraphics[width=1\defpicwidth]{MVAsirdata2.ps}

Both algorithms were conducted using the slicing method with $20$ elements in each slice. The goal was to find $\beta_{1}$ and $\beta _{2}$ with SIR. The data are designed such that SIR can detect $\beta_{1}$ because of the monotonic shape of $\{\beta_{1}^{\top}x_{i} + (\beta_{1}^{\top}x_{i})^3\}$, while SIR II will search for $\beta _{2}$, as in this direction the conditional variance on $y$ is varying.


Table 18.3: SIR: EDR-directions for simulated data.
$\hat\beta_{1}$ $\hat\beta_{2}$ $\hat\beta_{3}$
0.578 -0.723 -0.266
0.586 0.201 0.809
0.568 0.661 -0.524


Figure 18.6: SIR: The left plots show the response versus the estimated EDR-directions. The upper right plot is a three-dimensional plot of the first two directions and the response. The lower right plot shows the eigenvalues $\hat \lambda_i$ ($\ast$) and the cumulative sum ($\circ$). 51363 MVAsirdata.xpl
\includegraphics[width=1\defpicwidth]{MVAsirdata3.ps}

If we normalize the eigenvalues for the EDR-directions in Table 18.3 such that they sum up to one, the resulting vector is $(0.852, 0.086, 0.062)$. As can be seen in the upper left plot of Figure 18.6, there is a functional relationship found between the first index $\hat\beta_{1}^{\top}x$ and the response. Actually, $\beta_{1}$ and $\hat\beta_{1}$ are nearly parallel, that is, the normalized inner product $\hat\beta_{1}^{\top}\beta_{1}/ \{\vert\vert\hat\beta_{1}\vert\vert\vert\vert\beta_{1}\vert\vert\} = 0.9894$ is very close to one.

The second direction along $\beta _{2}$ is probably found due to the good approximation, but SIR does not provide it clearly, because it is ``blind'' with respect to the change of variance, as the second eigenvalue indicates.

For SIR II, the normalized eigenvalues are $(0.706, 0.185, 0.108)$, that is, about 69% of the variance is explained by the first EDR-direction (Table 18.4). Here, the normalized inner product of $\beta _{2}$ and $\hat\beta_{1}$ is $0.9992$. The estimator $\hat\beta_1$ estimates in fact $\beta_2$ of the simulated model. In this case, SIR II found the direction where the second moment varies with respect to $\beta_{2}^{\top}x$.

Figure 18.7: SIR II mainly sees the direction $\beta _{2}$. The left plots show the response versus the estimated EDR-directions. The upper right plot is a three-dimensional plot of the first two directions and the response. The lower right plot shows the eigenvalues $\hat \lambda_i$ ($\ast$) and the cumulative sum ($\circ$). 51372 MVAsir2data.xpl
\includegraphics[width=1\defpicwidth]{MVAsir2data.ps}


Table 18.4: SIR II: EDR-directions for simulated data.
$\hat\beta_{1}$ $\hat\beta_{2}$ $\hat\beta_{3}$
0.821 0.180 0.446
-0.442 -0.826 0.370
-0.361 -0.534 0.815


In summary, SIR has found the direction which shows a strong relation regarding the conditional expectation between $\beta_{1}^{\top}x$ and $y$, and SIR II has found the direction where the conditional variance is varying, namely, $\beta_{2}^{\top}x$.

The behavior of the two SIR algorithms is as expected. In addition, we have seen that it is worthwhile to apply both versions of SIR. It is possible to combine SIR and SIR II (Schott; 1994; Li; 1991; Cook and Weisberg; 1991) directly, or to investigate higher conditional moments. For the latter it seems to be difficult to obtain theoretical results. For further details on SIR see Kötter (1996).

Summary
$\ast$
SIR serves as a dimension reduction tool for regression problems.
$\ast$
Inverse regression avoids the curse of dimensionality.
$\ast$
The dimension reduction can be conducted without estimation of the regression function $y=m(x)$.
$\ast$
SIR searches for the effective dimension reduction (EDR) by computing the inverse regression IR.
$\ast$
SIR II bases the EDR on computing the inverse conditional variance.
$\ast$
SIR might miss EDR directions that are found by SIR II.