next up previous contents index
Next: 6.3 Nonlinear Reduction of Up: 6. Dimension Reduction Methods Previous: 6.1 Introduction

Subsections



6.2 Linear Reduction of High-dimensional Data

The $ p$-dimensional data can be reduced into $ q$-dimensional data using $ q$ linear combinations of $ p$ variables. The linear combinations can be considered as linear projection. Most methods for reduction involve the discovery of linear combinations of variables under set criterion. Principal component analysis (PCA) and projection pursuit are typical methods of this type. These methods will be described in the following subsections.

6.2.1 Principal Component Analysis

Suppose that we have observations of $ p$ variables size $ n$; $ \{{\boldsymbol{x}}_i;
i=1,2,\ldots,n\}$ (referred to as $ X$ hereafter). PCA is conducted for the purpose of constructing linear combinations of variables so that their variances are large under certain conditions. A linear combination of variables is denoted by $ \{{\boldsymbol{a}}^{\top}{\boldsymbol{x}}_i;
i=1,2,\ldots,n\}$ (simply, $ {\boldsymbol{a}}^{\top}{\boldsymbol{X}}$), where $ {\boldsymbol{a}}=(a_1,a_2,\ldots,a_p)^{\top}$.

Then, the sample variance of $ {\boldsymbol{a}}^{\top}{\boldsymbol{X}}$ can be represented by

$\displaystyle V\left({\boldsymbol{a}}^{\top}{\boldsymbol{x}}\right)={\boldsymbol{a}}^{\top}\hat{\Sigma}{\boldsymbol{a}}{},$    

where $ \hat{\Sigma}=V({\boldsymbol{X}})$. $ {\boldsymbol{a}}^{\top}\hat{\Sigma}{\boldsymbol{a}}$ is regarded as a $ p$ variable function of $ (a_1,a_2,\ldots,a_p)$: $ \phi(a_1,a_2,\ldots,a_p)= {\boldsymbol{a}}^{\top}\hat{\Sigma}{\boldsymbol{a}}$. To consider the optimization problem for $ \phi$, $ {\boldsymbol{a}}$ is constrained to $ {\boldsymbol{a}}^{\top}{\boldsymbol{a}}=1$. This problem is solved using Lagrange multipliers. The following Lagrange function is defined as

$\displaystyle L\left(a_1,a_2,\ldots,a_p\right)$ $\displaystyle = \phi\left(a_1,a_2,\ldots,a_p\right)-\lambda_1 \left(\sum_{i=1}^p a_i^2-1\right)$    
  $\displaystyle = {\boldsymbol{a}}^{\top}\hat{\Sigma}{\boldsymbol{a}}-\lambda_1\left({\boldsymbol{a}}^{\top}{\boldsymbol{a}}-1\right){},$    

where $ \lambda $ is the Lagrange multiplier. $ L$ is partially differentiated with respect to $ {\boldsymbol{a}}=(a_1,a_2,\ldots,a_p)^{\top}$ and $ \lambda_1$, and the derivatives are equated to zero. We therefore obtain the simultaneous equations:

$\displaystyle \left\{ \begin{array}{l}
 2\hat{\Sigma}{\boldsymbol{a}}-2\lambda_...
...\ [1.5mm]
 {\boldsymbol{a}}^{\top}{\boldsymbol{a}}-1=0{}.
 \end{array}
 \right.$    

This is an eigenvector problem; the solution to this problem for $ {\boldsymbol{a}}=(a_1,a_2,\ldots,a_p)^{\top}$ is a unit eigenvector of $ \hat{\Sigma}$ corresponding to the largest eigenvalue. Let $ {\boldsymbol{a}}$ be an eigenvector and let $ \lambda $ be an eigenvalue. We then have

$\displaystyle \phi\left(a_1,a_2,\ldots,a_p\right)=V\left({\boldsymbol{a}}^{\top...
...igma}{\boldsymbol{a}}=\lambda{\boldsymbol{a}}^{\top}{\boldsymbol{a}}=\lambda{}.$    

The eigenvector is denoted as $ {\boldsymbol{a}}_1$. Then $ {{\boldsymbol{a}}_1^{\top}{\boldsymbol{x}}_i;
i=1,2,\ldots,n}$ are referred to as the first principal components. The first principal components are one-dimensional data that are the projection of the original data with the maximum variance. If all of the information for the data can be represented by the first principal components, further calculation is unnecessary. However, the first principal components usually exhibit the ''size factor'' only, whereas we would like to obtain another projection, namely the second principal components $ {\boldsymbol{a}}_2^{\top}{\boldsymbol{x}}_i$.

The second principal components serve to explain the maximum variance under the constraint and the fact that they are independent of the first principal components. In other words, the second principal components $ {\boldsymbol{a}}_2^{\top}{\boldsymbol{X}}$ take the maximum variance under the constraints $ {\boldsymbol{a}}_1^{\top}{\boldsymbol{a}}_2=0$ and $ {\boldsymbol{a}}_2^{\top}{\boldsymbol{a}}_2=1$. The second principal components can also be derived with Lagrange multipliers;

$\displaystyle L(a_1,a_2,\ldots,a_p,\lambda,\lambda_2)
 ={\boldsymbol{a}}^{\top}...
...^{\top}{\boldsymbol{a}}-\lambda_2({\boldsymbol{a}}^{\top}{\boldsymbol{a}}-1){}.$    

$ L$ is partially differentiated with respect to $ {\boldsymbol{a}}=(a_1,a_2,\ldots,a_p)^{\top}$, $ \lambda $ and $ \lambda_2$, and the derivatives are equated to zero. The simultaneous equations below are obtained:

$\displaystyle \left\{
 \begin{array}{l}
 2\hat{\Sigma}{\boldsymbol{a}}-\lambda{...
...=0\\ 
 {\boldsymbol{a}}_2^{\top}{\boldsymbol{a}}_2-1=0{}.
 \end{array}
 \right.$    

We can obtain $ \lambda=0$ and $ \lambda_2$ is another eigenvalue (not equal to $ \lambda_1$). Since the variance of $ {\boldsymbol{a}}_2^{\top} X$ is $ \lambda_2$, the $ {\boldsymbol{a}}_2$ must be the second largest eigenvalue of  $ \hat{\Sigma}$. $ \{{\boldsymbol{a}}_2^{\top}{\boldsymbol{x}}_i; i=1,2,\ldots,n\}$ are referred to as the second principal components. The third principal components, fourth principal components, $ \ldots$, and the $ p$-th principal components can then be derived in the same manner.

6.2.1.1 Proportion and Accumulated Proportion

The first principal components through the $ p$-th principal components were defined in the discussions above. As previously mentioned, the variance of the $ k$-th principal components is $ \lambda_k$. The sum of variances of $ p$ variables is $ \sum_{j=1}^p\hat{\sigma}_j=trace(\hat{\Sigma})$, where $ \hat{\Sigma}=(\hat{\sigma}_{ij})$. It is well known that $ trace(\hat{\Sigma})=\sum_{j=1}^p\lambda_j$; the sum of the variances coincides with the sum of the eigenvalues. The proportion of the $ k$-th principal components is defined as the proportion of the entire variance to the variance of the $ k$-th principal components:

$\displaystyle \frac{\lambda_k}{\sum_{j=1}^p\lambda_j}{}.$    

The first principal components through the $ k$-th principal components are generally used consecutively. The total variance of these principal components is represented by the accumulated proportion:

$\displaystyle \frac{\sum_{j=1}^k\lambda_j}{\sum_{j=1}^p\lambda_j}{}.$    

We have explained PCA as an eigenvalue problem of covariance matrix. However, the results of this method are affected by units of measurements or scale transformations of variables. Thus, another method is to employ a correlation matrix rather than a covariance matrix. This method is invariant under units of variables, but does not take the variances of the variables into account.

6.2.2 Projection Pursuit

PCA searches a lower dimensional space that captures the majority of the variation within the data and discovers linear structures in the data. This method, however, is ineffective in analyzing nonlinear structures, i.e. curves, surfaces or clusters. In 1974, [2] proposed projection pursuit to search for linear projection onto the lower dimensional space that robustly reveals structures in the data. After that, many researchers developed new methods for projection pursuit and evaluated them (e.g. [6,1,4,7,17,10]). The fundamental idea behind projection pursuit is to search linear projection of the data onto a lower dimensional space their distribution is ''interesting''; ''interesting'' is defined as being ''far from the normal distribution'', i.e. the normal distribution is assumed to be the most uninteresting. The degree of ''far from the normal distribution'' is defined as being a projection index, the details of which will be described later.

6.2.2.1 Algorithm

The use of a projection index makes it possible to execute projection pursuit with the projection index. Here is the fundamental algorithm of $ k$-dimensional projection pursuit.

  1. Sphering $ {\boldsymbol{x}}$: $ {\boldsymbol{z}}_i=\hat{\Sigma}^{-1/2}_{xx}({\boldsymbol{x}}_i-\hat{{\boldsymbol{x}}})$ $ (i=1,2,\ldots,n)$, where $ \hat{\Sigma}$ is the sample covariance matrix and $ \hat{{\boldsymbol{x}}}$ is the sample mean of  $ {\boldsymbol{x}}$.
  2. Initialize the project direction: $ \alpha = ({\boldsymbol{\alpha}}_1,
{\boldsymbol{\alpha}}_2, \ldots, {\boldsymbol{\alpha}}_k)$.
  3. Search the direction $ \alpha $ that maximizes the projection index.
  4. Project the data onto the lower dimensional space and display or analyze them.
  5. Change the initial direction and repeat Steps 3 and 4, if necessary.

6.2.2.2 Projection Indexes

The goal of projection pursuit is to find a projection that reveals interesting structures in the data. There are various standards for interestingness, and it is a very difficult task to define. Thus, the normal distribution is regarded as uninteresting, and uninterestingness is defined as a degree that is ''far from the normal distribution.''

Projection indexes are defined as of this degree. There are many definitions for projection indexes. Projection pursuit searches projections based on the projection index; methods of projection pursuit are defined by the projection indexes.

Here we will present several projection indexes. It is assumed that $ {\boldsymbol{Z}}=({\boldsymbol{z}}_1, \ldots, {\boldsymbol{z}}_n)$ is the result of sphering  $ {\boldsymbol{X}}$; the mean vector is a zero vector and the covariance matrix is an identity matrix.

6.2.2.2.1 Friedman's Index.

[1] proposed the following projection index:

$\displaystyle I = \frac{1}{2}\sum_{j=1}^J(2j+1)\left[ \frac{1}{n}\sum_{i=1}^nP_...
...ft({\boldsymbol{\alpha}}^{\top} {\boldsymbol{Z}}_i\right)-1\right) \right]^2{},$    

where $ P_j( \cdot )$ are Legendre polynomials of order $ j$ and $ \Phi(\cdot )$ is the cumulative distribution function of the normal distribution and $ J$ is a user-defined constant number, i.e. the degree of approximation.

In the case of two-dimensional projection pursuit, the index is represented by

$\displaystyle I$ $\displaystyle = \sum_{j=1}^J(2j+1)E^2[P_j(R_1)]/4$    
  $\displaystyle \quad{} + \sum_{k=1}^J(2k+1)E^2\left[P_k(R_2)\right]/4$    
  $\displaystyle \quad{} + \sum_{j=1}^J\sum_{k=1}^{J-j}(2j+1)(2k+1)E^2\left[P_j(R_1)P_k(R_2)\right]/4{},$    

where

$\displaystyle X_1={\boldsymbol{\alpha}}_1^{\top} {\boldsymbol{Z}}{},\quad X_2={\boldsymbol{\alpha}}_2^{\top} {\boldsymbol{Z}}$    
$\displaystyle R_1=2\Phi\left(X_1\right)-1{}, \quad R_2=2\Phi\left(X_2\right)-1{}.$    

6.2.2.2.2 Moment Index.

The third and higher cumulants of the normal distribution vanish. The cumulants are sometimes used for the test of normality, i.e. they can be used for the projection index. [8] proposed a one-dimensional projection index named the ''moment index,'' with the third cumulant $ k_3 = \mu_3$ and the fourth cumulant $ k_4 =
\mu_4-3$:

$\displaystyle I = k_3^2 + \frac{1}{4} k_4^2{}.$    

For two-dimensional projection pursuit, the moment index can be defined as

$\displaystyle I = \left(k_{30}^2 + 3k_{21}^2 + 3k_{12}^2 + k_{03}^2\right) +
 \frac{1}{4}\left(k_{40}^2 +4k_{31}^2 +6k_{22}^2 +4k_{13}^2 +k_{04}^2\right){}.$    

6.2.2.2.3 Hall's Index.

[4] proposed the following projection index:

$\displaystyle I = \left[\theta_0({\boldsymbol{\alpha}})-2^{-1/2}\pi^{-1/4}\right]^2+\sum_{j=1}^J\theta_j^2({\boldsymbol{\alpha}}){},$    

where

$\displaystyle \theta_j({\boldsymbol{\alpha}}) = n^{-1}\sum_{i=1}^nP_j\left({\bo...
...}}_i\right)
 \phi\left({\boldsymbol{\alpha}}^{\top}{\boldsymbol{Z}}_i\right){},$    
$\displaystyle P_j(z) = \frac{\sqrt{2}}{\sqrt{j!}} \pi^{1/4}H_j\left(2^{1/2}z\right){},$    

$ \phi(z)$ is the normal density function and $ H_j(z)$ are the Hermite polynomials of degree $ j$. $ J$ is a user-defined constant number. Hall's index is much more robust for outliers than Freidman's index.

6.2.2.3 Relative Projection Pursuit

The main objective of ordinary projection pursuit is the discovery of non-normal structures in a dataset. Non-normality is evaluated using the degree of difference between the distribution of the projected dataset and the normal distribution.

There are times in which it is desired that special structures be discovered using criterion other than non-normal criterion. For example, if the purpose of analysis is to investigate a feature of a subset of the entire dataset, the projected direction should be searched so that the projected distribution of the subset is far from the distribution of the entire dataset. In sliced inverse regression (please refer to the final subsection of this chapter), the dataset is divided into several subsets based on the values of the response variable, and the effective dimension-reduction direction is searched for using projection pursuit. In this application of projection pursuit, projections for which the distributions of the projected subsets are far from those of the entire dataset are required. [16] proposed the adoption of relative projection pursuit for these purposes. Relative projection pursuit finds interesting low-dimensional space that differs from the reference dataset predefined by the user.


next up previous contents index
Next: 6.3 Nonlinear Reduction of Up: 6. Dimension Reduction Methods Previous: 6.1 Introduction