13.3 Example: Eye-Hair


13.3.1 Description of Data

The data set given in Table 13.1 is a contingency table of hair colors (4 categories) and eye colors (4 categories) for 592 women (Lebart, L., Morineau, A., and Piron, M.; 1995).


Table 13.1: Contingency table for eye-hair color data.
EYE $ \backslash$ HAIR COLOR black brown red blond total
dark brown 68 119 26 7 220
light brown 15 54 14 10 93
green 5 29 14 16 64
blue 20 84 17 94 215
total 108 286 71 127 592



13.3.2 Calling the Quantlet

The following XploRe code explains how to run correspondence analysis using quantlet 23520 corresp in XploRe .

  library("stats")
  corresp("e.dat","null","null","EYE-HAIR","eltxt.dat",
                            "ectxt.dat","null","null","null")
23528 XAGcorre01.xpl

In this example, we use the active data file e.dat . The file e.dat contains the Hair-eye contingency table given in Table 13.1.

  68 119 26 7
  15 54 14 10
  5  29 14 16
  20 84 17 94
Row labels are given in the file eltxt.dat :
  dark-brown
  light-brown
  green
  blue

Column labels are in the file ectxt.dat :

 BLACK
 BROWN
 RED
 BLOND


13.3.3 Documentation of Results

The output of CA from 23563 XAGcorre01.xpl is shown in the output window. In this example, we get altogether three factors--three eigenvalues and three coordinates for each row (column) item.


13.3.4 Eigenvalues

The eigenvalues $ g_{1},g_{2},\dots g_{u}$ give the part of total variation recovered on the first, second, ... , $ u$-th factors. They allow to make a choice for the number of factors (or axes, in the geometrical representation) to retain.
  [1,] EIGENVALUES AND PERCENTAGES
  Contents of seig
 
  [1,]   0.2088   89.3727   89.3727
  [2,]   0.0222    9.5149   98.8876
  [3,]   0.0026    1.1124  100.0000
We see that already the first factor explains nearly $ 90\%$ of total variation in this contingency table, equal to $ (0.2088 +
0.0222 + 0.0026)592 = 138.3$


13.3.5 Contributions

13.3.5.1 Global Contributions of Rows (Resp. Columns)

From the formula of Pearson's chi-square (here divided by $ n$) one can obviously decompose the total variation across row (resp. column) items additively. This yields the global row (resp. column) contributions to total variation. In the geometrical representation of row (resp. column) profiles in a $ u$-dimensional Euclidean space--taking the marginal row (resp. column) profile as the origin--the global contribution of a row (resp. column) is equal to the squared distance to the origin times it's relative weight (say $ {n_{i\bullet}}/{n}$ for row $ i$). The squared distance itself is useful to see how a row item deviates from what is expected under independence.

  [1,] "Row relative weights and distances to the origin"

  Contents of spdai

  [1,]   0.3716    0.0206
  [2,]   0.1571    0.0119
  [3,]   0.1081    0.0159
  [4,]   0.3632    0.0228

  [1,] Column relative weights and distances to the origin

  Contents of spdaj

  [1,]   0.1824    0.0227
  [2,]   0.4831    0.0066
  [3,]   0.1199    0.0146
  [4,]   0.2145    0.0345

13.3.5.2 Contributions of Rows Or Columns to a Factor

It is interesting to know how much each row (or column) contributes to the variation pertaining to a given factor. These specific contributions are useful to possibly interpret the factor in terms of contrasts between row (or column) items. These contributions are usually given in percents of total variation of the factor (i.e. corresponding eigenvalues).

  [1,] Coordinates of the columns

  Contents of scoordj
 
  [1,]   -0.0207   -0.0088    0.0023
  [2,]   -0.0061    0.0013   -0.0020
  [3,]   -0.0053    0.0131    0.0034
  [4,]    0.0343   -0.0029    0.0007

  [1,] Contributions of the columns 
 
  Contents of scontrj
 
  [1,]   22.2463   37.8774   21.6330
  [2,]    5.0860    2.3194   44.2838
  [3,]    0.9637   55.1305   31.9125
  [4,]   71.7039    4.6727    2.1706
The coordinates of the first axis show that blond hair color (4-th column item) is opposed to all the other hair colors on the first axis, in particular, to black hair color (1-st column item). The first factor can be essentially explained by a strong contrast between blond and black hair in terms of eyes color (respective contribution 71,7% and 22,2%)

The second axis (its eigenvalue 9.5% is ten times smaller than that of the first axis of 89.4%, is mainly constructed by the item of hair color red (55.1%) as opposed to black hair color (37,9%). The third factor is accounting for negligible contribution to total variation (1.1 %).

  [1,] Coordinates of the rows

  Contents of scoordi

  [1,]   -0.0202   -0.0036    0.0009
  [2,]   -0.0087    0.0069   -0.0041
  [3,]    0.0066    0.0139    0.0036
  [4,]    0.0225   -0.0034   -0.0002

  [1,] Contributions of the rows
 
  Contents of scontri

  [1,]   43.1157   13.0425    6.6796
  [2,]    3.4010   19.8040   61.0856
  [3,]    1.3549   55.9095   31.9248
  [4,]   52.1284   11.2440    0.3100
For the row items, the first axis is,solely, constructed by eye colors dark brown (1-st row item) and blue (4-th row item) (resp. contributions of 43.1% and 52.1%). Coordinates show that they are opposed in terms of hair profile. The second axis is mainly due to green eye color (3-rd row item).

13.3.5.3 Squared Correlations

The global contribution of a given row (resp. column) itself may be additively decomposed across the $ u$ factors into terms called squared correlations (by analogy with PCA) when expressed in percents of that global contribution. Squared correlations are useful to determine how well each row (resp. column) variation is recovered on a factor or on restricted number of factors (or axes in a geometrical representation). This allows to guard against illusory proximities of points (row or column profiles) in mappings.
  [1,] Squared correlations of the rows

  Contents of scorri

  [1,]    0.9670    0.0311    0.0019
  [2,]    0.5424    0.3363    0.1213
  [3,]    0.1759    0.7726    0.0516
  [4,]    0.9775    0.0224    0.0001

  [1,] Squared correlations of the columns

  Contents of scorrj
 
  [1,]    0.8380    0.1519    0.0101
  [2,]    0.8644    0.0420    0.0937
  [3,]    0.1333    0.8118    0.0549
  [4,]    0.9927    0.0069    0.0004
From these correlations it can be inferred, for instance, that factor 1 is exclusively specific for blond hair color.


13.3.6 Biplots

A simultaneous representation of row and column items in the same mapping has some interesting interpretational aspects. When row $ i$ and column $ j$, say, are represented by points in the same (resp. opposite) direction with respect to the origin it means that $ n_{ij}$ is above (resp. below) the value expected according to independence (conditioned on the fact that the sum of their squared correlations on the first two factors is, for each of them, sufficiently high).

13.3.6.1 Asking for Graph

Results of the analysis can be visualized in different graphs :
\includegraphics[scale=0.8]{graph}
We can visualize the configuration of the items in any two axes. The importance of the axes is proportional to the variation explained by this axis. It is measured by the eigenvalue. We can select any two axes for display. If $ u > 5$ than the first five axes are available to choose from.
\includegraphics[scale=0.8]{axes}
We can select different items to display in graphs :
\includegraphics[scale=0.8]{visual}
The graph requested ( 23671 XAGcorre01.xpl ) is shown in the Figure 13.1

Figure 13.1: CA for the eye-hair example.
\includegraphics[scale=0.6]{eye}

The graph using the two first coordinates shows the suggestive features of simultaneous representation of row and column items in the same mapping. This allows us to interpret the proximities or distances between items of the same set with their associations to those of other item sets.

13.3.6.2 Supplementary Items

It is possible to project additional rows or columns onto the various factors without having these elements enter the construction of factors, as opposed to so-called active items. This may be useful for various reasons: to get some exogenous explanations of some features revealed in the data, to ignore a much too influentional row or column item (in particular for items with low frequencies), to see the positions of several items forming a natural group, etc.


13.3.7 Brief Remark

Why is the position of the item of hair color blond more extreme than the eye color blue on the first dominant axis? Because the item of hair color blond is much more characterized by eye's color blue than the inverse fact: as can be seen from the data, 74% of blond people have blue eyes while only 44% of people with blue eyes have blond hair.