9.9 More Examples

EXAMPLE 9.8   Let us now apply the PCA to the standardized bank data set (Table B.2). Figure 9.13 shows some PC plots of the bank data set. The genuine and counterfeit bank notes are marked by ``o'' and ``+'' respectively.

Figure 9.13: Principal components of the standardized bank data. 34484 MVAnpcabank.xpl
\includegraphics[width=1\defpicwidth]{pcastd1a.ps}

The vector of eigenvalues of $\data{R}$ is

\begin{displaymath}\ell =\left( 2.946, 1.278, 0.869, 0.450, 0.269, 0.189 \right)^{\top}.\end{displaymath}

The eigenvectors $g_j$ are given by the columns of the matrix

\begin{displaymath}
\data{G}=\left( \begin{array}{rrrrrr}
-0.007 &-0.815 & 0.01...
...-0.274 &-0.114 &-0.392 & 0.340 &-0.632\\
\end{array} \right).\end{displaymath}

Each original variable has the same weight in the analysis and the results are independent of the scale of each variable.

Figure 9.14: The correlations of the original variable with the PCs. 34491 MVAnpcabanki.xpl
\includegraphics[width=1\defpicwidth]{banki.ps}


Table: Eigenvalues and proportions of explained variance
$\ell_j$ proportion of variances cumulated proportion
2.946 0.491 49.1
1.278 0.213 70.4
0.869 0.145 84.9
0.450 0.075 92.4
0.264 0.045 96.9
0.189 0.032 100.0



Table: Correlations with PCs
  $r_{X_iZ_1}$ $r_{X_iZ_2}$ $r^2_{X_iZ_1} + r^2_{X_iZ_2}$
$X_1$: length $-$0.012 $-$0.922 0.85
$X_2$: left height 0.803 $-$0.387 0.79
$X_3$: right height 0.835 $-$0.285 0.78
$X_4$: lower 0.698 0.301 0.58
$X_5$: upper 0.631 0.104 0.41
$X_6$: diagonal $-$0.847 $-$0.310 0.81


The proportions of explained variance are given in Table 9.7. It can be concluded that the representation in two dimensions should be sufficient. The correlations leading to Figure 9.14 are given in Table 9.8. The picture is different from the one obtained in Section 9.3 (see Table 9.2). Here, the first factor is mainly a left-right vs. diagonal factor and the second one is a length factor (with negative weight). Take another look at Figure 9.13, where the individual bank notes are displayed. In the upper left graph it can be seen that the genuine bank notes are for the most part in the south-eastern portion of the graph featuring a larger diagonal, smaller height ($Z_1 <0$) and also a larger length ($Z_2<0$). Note also that Figure 9.14 gives an idea of the correlation structure of the original data matrix.

EXAMPLE 9.9   Consider the data of 79 U.S. companies given in Table B.5. The data is first standardized by subtracting the mean and dividing by the standard deviation. Note that the data set contains six variables: assets $(X_1)$, sales $(X_2)$, market value $(X_3)$, profits $(X_4)$, cash flow $(X_5)$, number of employees $(X_6)$.

Calculating the corresponding vector of eigenvalues gives

\begin{displaymath}
\ell =\left( 5.039, 0.517, 0.359, 0.050, 0.029, 0.007 \right)^{\top}
\end{displaymath}

and the matrix of eigenvectors is

\begin{displaymath}
\data{G} = \left(\begin{array}{rrrrrr}
0.340 &-0.849 &-0.339...
...0.010 & 0.726 & 0.548 & 0.098 & 0.065\\
\end{array} \right).
\end{displaymath}

Using this information the graphical representations of the first two principal components are given in Figure 9.15. The different sectors are marked by the following symbols:

H ... Hi Tech and Communication
E ... Energy
F ... Finance
M ... Manufacturing
R ... Retail
$\star$ ... all other sectors.

Figure 9.15: Principal components of the U.S. company data. 34502 MVAnpcausco.xpl
\includegraphics[width=1\defpicwidth]{pcaus1neu.ps}

Figure 9.16: Principal components of the U.S. company data (without IBM and General Electric). 34509 MVAnpcausco2.xpl
\includegraphics[width=1\defpicwidth]{pcaus2neu.ps}

The two outliers in the right-hand side of the graph are IBM and General Electric (GE), which differ from the other companies with their high market values. As can be seen in the first column of ${\data{G}}$, market value has the largest weight in the first PC, adding to the isolation of these two companies. If IBM and GE were to be excluded from the data set, a completely different picture would emerge, as shown in Figure 9.16. In this case the vector of eigenvalues becomes

\begin{displaymath}\ell =\left( 3.191, 1.535, 0.791, 0.292, 0.149, 0.041 \right)^{\top},\end{displaymath}

and the corresponding matrix of eigenvectors is

\begin{displaymath}\data{G} = \left(\begin{array}{rrrrrr}
0.263 &-0.408 &-0.800 ...
...0.277 & 0.558 & 0.021 & 0.575 & 0.313\\
\end{array} \right) . \end{displaymath}


Table: Eigenvalues and proportions of explained variance.
$\ell_j$ proportion of variance cumulated proportion
3.191 0.532 0.532
1.535 0.256 0.788
0.791 0.132 0.920
0.292 0.049 0.968
0.149 0.025 0.993
0.041 0.007 1.000


The percentage of variation explained by each component is given in Table 9.9. The first two components explain almost 79% of the variance. The interpretation of the factors (the axes of Figure 9.16) is given in the table of correlations (Table 9.10). The first two columns of this table are plotted in Figure 9.17.


Table: Correlations with PCs.
  $r_{X_iZ_1}$ $r_{X_iZ_2}$ $r^2_{X_iZ_1} + r^2_{X_iZ_2}$
$X_1$: assets 0.47 $-$0.510 0.48
$X_2$: sales 0.78 $-$0.500 0.87
$X_3$: market value 0.89 $-$0.003 0.80
$X_4$: profits 0.59 0.770 0.95
$X_5$: cash flow 0.79 0.560 0.94
$X_6$: employees 0.76 $-$0.340 0.70


Figure 9.17: The correlation of the original variables with the PCs. 34520 MVAnpcausco2i.xpl
\includegraphics[width=1\defpicwidth]{npcausco2i.ps}

From Figure 9.17 (and Table 9.10) it appears that the first factor is a ``size effect'', it is positively correlated with all the variables describing the size of the activity of the companies. It is also a measure of the economic strength of the firms. The second factor describes the ``shape'' of the companies (``profit-cash flow'' vs. ``assets-sales'' factor), which is more difficult to interpret from an economic point of view.

EXAMPLE 9.10   Volle (1985) analyzes data on 28 individuals (Table B.14). For each individual, the time spent (in hours) on 10 different activities has been recorded over 100 days, as well as informative statistics such as the individual's sex, country of residence, professional activity and matrimonial status. The results of a NPCA are given below.

Figure 9.18: Representation of the individuals. 34527 MVAnpcatime.xpl
\includegraphics[width=1\defpicwidth]{neuesbild1.ps}

Figure 9.19: Representation of the variables. 34534 MVAnpcatime.xpl
\includegraphics[width=1\defpicwidth]{npcatime2.ps}


Table 9.11: Eigenvalues of correlation matrix for the time budget data.
$\ell_j$ proportion of variance cumulated proportion
4.59 0.459 0.460
2.12 0.212 0.670
1.32 0.132 0.800
1.20 0.120 0.920
0.47 0.047 0.970
0.20 0.020 0.990
0.05 0.005 0.990
0.04 0.004 0.999
0.02 0.002 1.000
0.00 0.000 1.000


The eigenvalues of the correlation matrix are given in Table 9.11. Note that the last eigenvalue is exactly zero since the correlation matrix is singular (the sum of all the variables is always equal to $2400 =24\times 100$). The results of the 4 first PCs are given in Tables 9.12 and 9.13.


Table: Correlation of variables with PCs.
    $r_{X_iW_1}$ $r_{X_iW_2}$ $r_{X_iW_3}$ $r_{X_iW_4}$
$X_1$: prof 0.9772 $-$0.1210 $-$0.0846 0.0669
$X_2$: tran 0.9798 0.0581 $-$0.0084 0.4555
$X_3$: hous $-$0.8999 0.0227 0.3624 0.2142
$X_4$: kids $-$0.8721 0.1786 0.0837 0.2944
$X_5$: shop $-$0.5636 0.7606 $-$0.0046 $-$0.1210
$X_6$: pers $-$0.0795 0.8181 $-$0.3022 $-$0.0636
$X_7$: eati $-$0.5883 $-$0.6694 $-$0.4263 0.0141
$X_8$: slee $-$0.6442 $-$0.5693 $-$0.1908 $-$0.3125
$X_9$: tele $-$0.0994 0.1931 $-$0.9300 0.1512
$X_{10}$: leis $-$0.0922 0.1103 0.0302 $-$0.9574



Table 9.13: PCs for time budget data.
  $Z_1$ $Z_2$ $Z_3$ $Z_4$
maus 0.0633 0.0245 $-$0.0668 0.0205
waus 0.0061 0.0791 $-$0.0236 0.0156
wnus $-$0.1448 0.0813 $-$0.0379 $-$0.0186
mmus 0.0635 0.0105 $-$0.0673 0.0262
wmus $-$0.0934 0.0816 $-$0.0285 0.0038
msus 0.0537 0.0676 $-$0.0487 $-$0.0279
wsus 0.0166 0.1016 $-$0.0463 $-$0.0053
mawe 0.0420 $-$0.0846 $-$0.0399 $-$0.0016
wawe $-$0.0111 $-$0.0534 $-$0.0097 0.0337
wnwe $-$0.1544 $-$0.0583 $-$0.0318 $-$0.0051
mmwe 0.0402 $-$0.0880 $-$0.0459 0.0054
wmwe $-$0.1118 $-$0.0710 $-$0.0210 0.0262
mswe 0.0489 $-$0.0919 $-$0.0188 $-$0.0365
wswe $-$0.0393 $-$0.0591 $-$0.0194 $-$0.0534
mayo 0.0772 $-$0.0086 0.0253 $-$0.0085
wayo 0.0359 0.0064 0.0577 0.0762
wnyo $-$0.1263 $-$0.0135 0.0584 $-$0.0189
mmyo 0.0793 $-$0.0076 0.0173 $-$0.0039
wmyo $-$0.0550 $-$0.0077 0.0579 0.0416
msyo 0.0763 0.0207 0.0575 $-$0.0778
wsyo 0.0120 0.0149 0.0532 $-$0.0366
maes 0.0767 $-$0.0025 0.0047 0.0115
waes 0.0353 0.0209 0.0488 0.0729
wnes $-$0.1399 0.0016 0.0240 $-$0.0348
mmes 0.0742 $-$0.0061 $-$0.0152 0.0283
wmes $-$0.0175 0.0073 0.0429 0.0719
mses 0.0903 0.0052 0.0379 $-$0.0701
fses 0.0020 0.0287 0.0358 $-$0.0346


From these tables (and Figures 9.18 and 9.19), it appears that the professional and household activities are strongly contrasted in the first factor. Indeed on the horizontal axis of Figure 9.18 it can be seen that all the active men are on the right and all the inactive women are on the left. Active women and/or single women are inbetween. The second factor contrasts meal/sleeping vs. toilet/shopping (note the high correlation between meal and sleeping). Along the vertical axis of Figure 9.18 we see near the bottom of the graph the people from Western-European countries, who spend more time on meals and sleeping than people from the U. S. (who can be found close to the top of the graph). The other categories are inbetween.

In Figure 9.19 the variables television and other leisure activities hardly play any role (look at Table 9.12). The variable television appears in $Z_3$ (negatively correlated). Table 9.13 shows that this factor contrasts people from Eastern countries and Yugoslavia with men living in the U.S. The variable other leisure activities is the factor $Z_4$. It merely distinguishes between men and women in Eastern countries and in Yugoslavia. These last two factors are orthogonal to the preceeding axes and of course their contribution to the total variation is less important.