In practice we have to estimate the covariance matrices ,
and .
Let us apply the canonical correlation analysis to the car
marks data (see Table B.7).
In the context of this data set one is interested in relating price
variables with variables such as sportiness, safety, etc.
In particular, we would like to investigate
the relation between the two variables
non-depreciation of value and
price of the car and all other variables.
EXAMPLE 14.1
We perform the canonical correlation
analysis on the data matrices
and
that correspond to the set of values
Price, Value Stability
and
Economy, Service, Design, Sporty car, Safety, Easy handling
,
respectively.
The estimated covariance matrix
is given by
2pt
Hence,
It is interesting to see that value stability and price have
a negative covariance.
This makes sense since highly priced vehicles tend to loose their
market value at a faster pace than medium priced vehicles.
Now we estimate
by
and perform a singular value decomposition of
:
where the
's are the eigenvalues of
and
with
, and
and
are
the eigenvectors of
and
, respectively.
The canonical correlation coefficients are
The high correlation of the first two canonical variables can be seen in Figure
14.1. The first canonical variables are
Note that the variables
(economy),
(service) and
(easy handling) have positive coefficients on
.
The variables
(design),
(sporty car) and
(safety) have
a negative influence on
.
The canonical variable may be interpreted
as a price and value index.
The canonical variable is mainly formed from the qualitative
variables economy, service and
handling with negative weights on design, safety and sportiness.
These variables may therefore be interpreted as an appreciation
of the value of the car.
The sportiness has a negative effect on the price and value index, as do
the design and the safety features.
Figure 14.1:
The first canonical variables for the car marks data.
MVAcancarm.xpl
|
The hypothesis that the two sets of variables and
are uncorrelated may be tested (under normality assumptions)
with Wilk's likelihood ratio statistic (Gibbins; 1985):
This statistic unfortunately has a rather complicated distribution.
Bartlett (1939) provides an approximation for large :
|
(14.14) |
A test of the hypothesis
that only of the canonical correlation coefficients
are non-zero may be based (asymptotically) on the statistic
|
(14.15) |
EXAMPLE 14.2
Consider Example
14.1 again. There are
persons that have
rated the cars according to different categories with
and
. The canonical correlation coefficients were found to be
and
.
Bartlett's statistic (
14.14) is therefore
which is highly significant (the 99% quantile of
the
is 26.23).
The hypothesis of no correlation between the variables
and
is therefore rejected.
Let us now test whether the second canonical correlation coefficient
is different from zero. We use Bartlett's statistic (14.15)
with and obtain
which is again highly significant
with the
distribution.
The canonical correlation technique may also be applied to qualitative data.
Consider for example the contingency table of the French
baccalauréat data. The dataset is given in Table B.8
in Appendix B.8. The CCA cannot be applied directly
to this contingency table since the table does not correspond to the
usual data matrix structure. We may wish, however, to explain the relationship
between the row and column categories.
It is possible to represent the data in a
data
matrix
where is the total number of
frequencies in the contingency table and and
are matrices of zero-one dummy variables. More precisely,
let
and
where the indices range from , and .
Denote the cell frequencies by so that
and
note that
where () denotes the -th (-th) column of
().
EXAMPLE 14.3
Consider the following example where
The matrix is therefore
the matrix
is
and the data matrix is
The element
of
may be obtained by multiplying
the first column of
with the second column of
to yield
The purpose is to find the canonical variables
and
that are maximally correlated. Note, however, that has only
one non-zero component and therefore an ``individual'' may be directly
associated with its canonical variables or score . There will
be points at each and the correlation represented
by these points may serve as a measure of dependence between the
rows and columns of .
Let
denote a data matrix constructed from
a contingency table . Similar to Chapter 12 define
and define
and
.
Suppose that
and
for all and . It is not hard to see that
where
is the estimated value of
under the assumption of independence of the row and column categories.
Note that
and therefore does not exist.
The same is true for . One way out of this difficulty
is to drop one column from both and , say the first
column.
Let and denote the vectors obtained by deleting the first
component of and .
Define
,
and , ,
accordingly and obtain
so that (14.3) exists. The score associated with an
individual contained in the first row (column) category of is 0.
The technique described here for purely qualitative data may also be used
when the data is a mixture of qualitative and quantitative characteristics.
One has to ``blow up'' the data matrix by dummy zero-one values for the
qualitative data variables.
Summary
-
In practice we estimate , ,
by the empirical covariances and use them to compute
estimates , , for ,
, from the SVD of
.
-
The signs of the coefficients of the canonical variables tell us
the direction of the influence of these variables.