The aim of correspondence analysis is to develop simple
indices that show relations between the row and columns of a contingency
tables. Contingency tables are very useful to describe the association between
two variables in very general situations. The two variables can be
qualitative (nominal), in which case they are also referred to as
categorical variables. Each row and each column in the table represents
one category of the corresponding variable. The entry in the table
(with dimension
)
is the number of observations in a sample which
simultaneously fall in the
-th row category and the
-th column category,
for
and
. Sometimes a ``category'' of a nominal
variable is also called a ``modality'' of the variable.
The variables of interest can also be discrete quantitative variables, such as the number of family members or the number of accidents an insurance company had to cover during one year, etc. Here, each possible value that the variable can have defines a row or a column category. Continuous variables may be taken into account by defining the categories in terms of intervals or classes of values which the variable can take on. Thus contingency tables can be used in many situations, implying that correspondence analysis is a very useful tool in many applications.
The graphical relationships
between the rows and the columns of the table
that result from correspondence analysis are based on the idea
of representing all the row and column categories and interpreting
the relative positions of the points in terms of the weights
corresponding to the column and the row.
This is achieved by deriving a system of simple indices providing the
coordinates of each row and each column.
These row and column coordinates are simultaneously represented
in the same graph. It is then clear to see which column categories are
more important in the row categories of the table (and the other way around).
As was already eluded to, the construction of the indices is
based on an idea similar to that of PCA.
Using PCA the total variance was partitioned into
independent contributions stemming from the principal components.
Correspondence analysis, on the other hand,
decomposes a measure of association,
typically the total value used in testing independence,
rather than decomposing the total variance.
The question is whether certain regions prefer certain
baccalauréat types.
If we consider, for instance, the region Lorraine,
we have the following
percentages:
The total percentages of the different modalities of the variable
baccalauréat are as follows:
One might argue that the region Lorraine seems to prefer the modalities E, F, G and dislike the specializations A, B, C, D relative to the overall frequency of baccalauréat type.
In correspondence analysis we try to develop an index for the regions so that this over- or underrepresentation can be measured in just one single number. Simultaneously we try to weight the regions so that we can see in which region certain baccalauréat types are preferred.
Assume that ,
, and that the frequencies are as follows:
Given a location weight vector
,
we can define a company index in the same way as
If (13.1) and (13.2) can be solved simultaneously
for a ``row weight'' vector
and a
``column weight'' vector
, we may
represent each row category by
and each
column category by
in a one-dimensional graph.
If in this graph
and
are in close proximity (far from the origin), this would indicate that
the
-th row category has an
important conditional frequency
in (13.1)
and that the
-th column category
has an important conditional frequency
in (13.2). This would indicate a positive association between
the
-th row and the
-th column. A similar line of
argument could be used if
was very far away from
(and far
from the origin). This would indicate a small conditional frequency
contribution, or a negative association between the
-th row
and the
-th column.