In practice we have to estimate the covariance matrices
,
and
.
Let us apply the canonical correlation analysis to the car
marks data (see Table B.7).
In the context of this data set one is interested in relating price
variables with variables such as sportiness, safety, etc.
In particular, we would like to investigate
the relation between the two variables
non-depreciation of value and
price of the car and all other variables.
EXAMPLE 14.1
We perform the canonical correlation
analysis on the data matrices

and

that correspond to the set of values

Price, Value Stability

and

Economy, Service, Design, Sporty car, Safety, Easy handling

,
respectively.
The estimated covariance matrix

is given by
2pt
Hence,
It is interesting to see that value stability and price have
a negative covariance.
This makes sense since highly priced vehicles tend to loose their
market value at a faster pace than medium priced vehicles.
Now we estimate
by
and perform a singular value decomposition of

:
where the

's are the eigenvalues of

and

with

, and

and

are
the eigenvectors of

and

, respectively.
The canonical correlation coefficients are
The high correlation of the first two canonical variables can be seen in Figure
14.1. The first canonical variables are
Note that the variables

(economy),

(service) and

(easy handling) have positive coefficients on

.
The variables

(design),

(sporty car) and

(safety) have
a negative influence on

.
The canonical variable
may be interpreted
as a price and value index.
The canonical variable
is mainly formed from the qualitative
variables economy, service and
handling with negative weights on design, safety and sportiness.
These variables may therefore be interpreted as an appreciation
of the value of the car.
The sportiness has a negative effect on the price and value index, as do
the design and the safety features.
Figure 14.1:
The first canonical variables for the car marks data.
MVAcancarm.xpl
|
The hypothesis that the two sets of variables
and
are uncorrelated may be tested (under normality assumptions)
with Wilk's likelihood ratio statistic (Gibbins; 1985):
This statistic unfortunately has a rather complicated distribution.
Bartlett (1939) provides an approximation for large
:
 |
(14.14) |
A test of the hypothesis
that only
of the canonical correlation coefficients
are non-zero may be based (asymptotically) on the statistic
 |
(14.15) |
EXAMPLE 14.2
Consider Example
14.1 again. There are

persons that have
rated the cars according to different categories with

and

. The canonical correlation coefficients were found to be

and

.
Bartlett's statistic (
14.14) is therefore
which is highly significant (the 99% quantile of
the

is 26.23).
The hypothesis of no correlation between the variables

and

is therefore rejected.
Let us now test whether the second canonical correlation coefficient
is different from zero. We use Bartlett's statistic (14.15)
with
and obtain
which is again highly significant
with the

distribution.
The canonical correlation technique may also be applied to qualitative data.
Consider for example the contingency table
of the French
baccalauréat data. The dataset is given in Table B.8
in Appendix B.8. The CCA cannot be applied directly
to this contingency table since the table does not correspond to the
usual data matrix structure. We may wish, however, to explain the relationship
between the row
and column
categories.
It is possible to represent the data in a
data
matrix
where
is the total number of
frequencies in the contingency table
and
and
are matrices of zero-one dummy variables. More precisely,
let
and
where the indices range from
,
and
.
Denote the cell frequencies by
so that
and
note that
where
(
) denotes the
-th (
-th) column of
(
).
EXAMPLE 14.3
Consider the following example where
The matrix
is therefore
the matrix

is
and the data matrix
is
The element

of

may be obtained by multiplying
the first column of

with the second column of

to yield
The purpose is to find the canonical variables
and
that are maximally correlated. Note, however, that
has only
one non-zero component and therefore an ``individual'' may be directly
associated with its canonical variables or score
. There will
be
points at each
and the correlation represented
by these points may serve as a measure of dependence between the
rows and columns of
.
Let
denote a data matrix constructed from
a contingency table
. Similar to Chapter 12 define
and define
and
.
Suppose that
and
for all
and
. It is not hard to see that
where
is the estimated value of
under the assumption of independence of the row and column categories.
Note that
and therefore
does not exist.
The same is true for
. One way out of this difficulty
is to drop one column from both
and
, say the first
column.
Let
and
denote the vectors obtained by deleting the first
component of
and
.
Define
,
and
,
,
accordingly and obtain
so that (14.3) exists. The score associated with an
individual contained in the first row (column) category of
is 0.
The technique described here for purely qualitative data may also be used
when the data is a mixture of qualitative and quantitative characteristics.
One has to ``blow up'' the data matrix by dummy zero-one values for the
qualitative data variables.
Summary

-
In practice we estimate
,
,
by the empirical covariances and use them to compute
estimates
,
,
for
,
,
from the SVD of
.

-
The signs of the coefficients of the canonical variables tell us
the direction of the influence of these variables.