13.1 Motivation

The aim of correspondence analysis is to develop simple indices that show relations between the row and columns of a contingency tables. Contingency tables are very useful to describe the association between two variables in very general situations. The two variables can be qualitative (nominal), in which case they are also referred to as categorical variables. Each row and each column in the table represents one category of the corresponding variable. The entry $x_{ij}$ in the table ${\data{X}}
$ (with dimension $(n\times p)$) is the number of observations in a sample which simultaneously fall in the $i$-th row category and the $j$-th column category, for $i=1,\ldots ,n$ and $j=1,\ldots , p$. Sometimes a ``category'' of a nominal variable is also called a ``modality'' of the variable.

The variables of interest can also be discrete quantitative variables, such as the number of family members or the number of accidents an insurance company had to cover during one year, etc. Here, each possible value that the variable can have defines a row or a column category. Continuous variables may be taken into account by defining the categories in terms of intervals or classes of values which the variable can take on. Thus contingency tables can be used in many situations, implying that correspondence analysis is a very useful tool in many applications.

The graphical relationships between the rows and the columns of the table ${\data{X}}
$ that result from correspondence analysis are based on the idea of representing all the row and column categories and interpreting the relative positions of the points in terms of the weights corresponding to the column and the row. This is achieved by deriving a system of simple indices providing the coordinates of each row and each column. These row and column coordinates are simultaneously represented in the same graph. It is then clear to see which column categories are more important in the row categories of the table (and the other way around).

As was already eluded to, the construction of the indices is based on an idea similar to that of PCA. Using PCA the total variance was partitioned into independent contributions stemming from the principal components. Correspondence analysis, on the other hand, decomposes a measure of association, typically the total $\chi^2$ value used in testing independence, rather than decomposing the total variance.

EXAMPLE 13.1   The French ``baccalauréat" frequencies have been classified into regions and different baccalauréat categories, see Appendix, Table B.8. Altogether $n=202100$ baccalauréats were observed. The joint frequency of the region Ile-de-France and the modality Philosophy, for example, is 9724. That is, 9724 baccalauréats were in Ile-de-France and the category Philosophy.

The question is whether certain regions prefer certain baccalauréat types. If we consider, for instance, the region Lorraine, we have the following percentages:

\begin{displaymath}{ \vbox {\offinterlineskip \halign {
\vrule height 2.5pt de...
...&19.6&3.4&14.5&18.9&0.2&\cr
&&&&&&&&&\cr
\noalign{\hrule }
}}}\end{displaymath}

The total percentages of the different modalities of the variable baccalauréat are as follows:

\begin{displaymath}{ \vbox {\offinterlineskip \halign {
\vrule height 2.5pt de...
...2&22.8&2.6&9.7&15.2&0.2&\cr
&&&&&&&&&\cr
\noalign{\hrule }
}}}\end{displaymath}

One might argue that the region Lorraine seems to prefer the modalities E, F, G and dislike the specializations A, B, C, D relative to the overall frequency of baccalauréat type.

In correspondence analysis we try to develop an index for the regions so that this over- or underrepresentation can be measured in just one single number. Simultaneously we try to weight the regions so that we can see in which region certain baccalauréat types are preferred.

EXAMPLE 13.2   Consider $n$ types of companies and $p$ locations of these companies. Is there a certain type of company that prefers a certain location? Or is there a location index that corresponds to a certain type of company?

Assume that $n=3$, $p=3$, and that the frequencies are as follows:

\begin{eqnarray*}
\data{X}&=&\left (\begin{array}{ccc} 4&0&2\\
0&1&1\\ 1&1&4\e...
...quad\quad\quad\ \ \uparrow\textrm{\scriptsize Munich}\end{array} \end{eqnarray*}



The frequencies imply that four type 3 companies (HiTech) are in location 3 (Munich), and so on. Suppose there is a (company) weight vector $r=\left ( r_1, \ldots, r_n \right )^{\top}$ such that a location index $s_j$ could be defined as
\begin{displaymath}
s_j = c \sum ^n_{i=1}r_i\frac{x_{ij} }{x_{\bullet j}}\ ,
\end{displaymath} (13.1)

where $x_{\bullet j}=\sum^n_{i=1}x_{ij}$ is the number of companies in location $j$ and $c$ is a constant. $s_{1}$, for example, would give the average weighted frequency (by $r$) of companies in location 1 (Frankfurt).

Given a location weight vector $s^*=\left (s^*_1, \ldots, s^*_p\right )^{\top}$, we can define a company index in the same way as

\begin{displaymath}
r^*_i = c^* \sum ^p_{j=1}s^*_j\frac{x_{ij} }{x_{i \bullet}}\ ,
\end{displaymath} (13.2)

where $c^*$ is a constant and $x_{i\bullet} = \sum^p_{j=1}x_{ij}$ is the sum of the $i$-th row of $\data{X}$, i.e., the number of type $i$ companies. Thus $r_2^*$, for example, would give the average weighted frequency (by $s^*$) of energy companies.

If (13.1) and (13.2) can be solved simultaneously for a ``row weight'' vector $r= (r_1, \ldots, r_n)^{\top}$ and a ``column weight'' vector $s= (s_1, \ldots, s_p)^{\top}$, we may represent each row category by $r_i, \; i=1, \ldots, n$ and each column category by $s_j, \; j=1, \ldots, p$ in a one-dimensional graph. If in this graph $r_i$ and $s_j$ are in close proximity (far from the origin), this would indicate that the $i$-th row category has an important conditional frequency $x_{ij}/x_{\bullet j}$ in (13.1) and that the $j$-th column category has an important conditional frequency $x_{ij}/x_{i \bullet}$ in (13.2). This would indicate a positive association between the $i$-th row and the $j$-th column. A similar line of argument could be used if $r_i$ was very far away from $s_j$ (and far from the origin). This would indicate a small conditional frequency contribution, or a negative association between the $i$-th row and the $j$-th column.

Summary
$\ast$
The aim of correspondence analysis is to develop simple indices that show relations among qualitative variables in a contingency table.
$\ast$
The joint representation of the indices reveals relations among the variables.