14.2 Canonical Correlation in Practice

In practice we have to estimate the covariance matrices $\Sigma _{XX}$, $\Sigma_{XY}$ and $\Sigma_{YY}$. Let us apply the canonical correlation analysis to the car marks data (see Table B.7). In the context of this data set one is interested in relating price variables with variables such as sportiness, safety, etc. In particular, we would like to investigate the relation between the two variables non-depreciation of value and price of the car and all other variables.

EXAMPLE 14.1   We perform the canonical correlation analysis on the data matrices $\data{X}$ and $\data{Y}$ that correspond to the set of values $\{$Price, Value Stability$\}$ and $\{$Economy, Service, Design, Sporty car, Safety, Easy handling$\}$, respectively. The estimated covariance matrix $\data{S}$ is given by 2pt

\begin{eqnarray*}
&&\quad\begin{array}{rrcrrrrrr}
\textrm{\footnotesize Price}&...
...id& 0.28& 0.14& -0.10& -0.15& 0.22& 0.32
\end{array} \right ).
\end{eqnarray*}



Hence,

\begin{eqnarray*}
\data{S}_{XX} &=& \left (
\begin{array}{rr}
1.41& -1.11\\
...
...\
0.28& 0.14& -0.10& -0.15& 0.22& 0.32
\end{array}\right ).
\end{eqnarray*}



It is interesting to see that value stability and price have a negative covariance. This makes sense since highly priced vehicles tend to loose their market value at a faster pace than medium priced vehicles.

Now we estimate $\data{K} = \Sigma ^{-1/2}_{XX}\,\Sigma_{XY}\,
\Sigma ^{-1/2}_{YY}$ by

\begin{displaymath}\widehat{\data{K}} = \data{S}^{-1/2}_{XX}\;\data{S}_{XY}\;
\data{S}^{-1/2}_{YY}\end{displaymath}

and perform a singular value decomposition of $\widehat{\data{K}}$:

\begin{displaymath}\widehat{\data{K}}=\data{G} \data{L}\data{D}^{\top}
= (g_{1}...
...p{\hbox{diag}}(\ell^{1/2}_1,\ell^{1/2}_2)\,(d_{1},d_{2})^{\top}\end{displaymath}

where the $\ell _i$'s are the eigenvalues of $\widehat{\data{K}}\widehat{\data{K}}^{\top}$ and $\widehat{\data{K}}^{\top}\widehat{\data{K}}$ with $\mathop{\rm {rank}}(\widehat{\data{K}})=2$, and $g_{i}$ and $d_{i}$ are the eigenvectors of $\widehat{\data{K}}\widehat{\data{K}}^{\top}$ and $\widehat{\data{K}}^{\top}\widehat{\data{K}}$, respectively. The canonical correlation coefficients are

\begin{displaymath}r_1=\ell^{1/2}_1=0.98,\quad r_2=\ell^{1/2}_2=0.89.\end{displaymath}

The high correlation of the first two canonical variables can be seen in Figure 14.1. The first canonical variables are

\begin{eqnarray*}
\widehat\eta_1&=&\widehat a_{1}^{\top}x = 1.602\;x_1+1.686\;x_...
...8\;y_1+0.544\;y_2-0.012\;y_3
-0.096\;y_4-0.014\;y_5+0.915\;y_6.
\end{eqnarray*}



Note that the variables $y_1$ (economy), $y_2$ (service) and $y_6$ (easy handling) have positive coefficients on $\widehat\varphi_1$. The variables $y_3$ (design), $y_4$ (sporty car) and $y_5$ (safety) have a negative influence on $\widehat\varphi_1$.

The canonical variable $\eta_1$ may be interpreted as a price and value index. The canonical variable $\varphi_1$ is mainly formed from the qualitative variables economy, service and handling with negative weights on design, safety and sportiness. These variables may therefore be interpreted as an appreciation of the value of the car. The sportiness has a negative effect on the price and value index, as do the design and the safety features.

Figure 14.1: The first canonical variables for the car marks data. 45393 MVAcancarm.xpl
\includegraphics[width=1\defpicwidth]{cancarm2.ps}

Testing the canonical correlation coefficients

The hypothesis that the two sets of variables $\data{X}$ and $\data{Y}$ are uncorrelated may be tested (under normality assumptions) with Wilk's likelihood ratio statistic (Gibbins; 1985):

\begin{displaymath}
T^{2/n}=\left\vert \data{I} - S_{YY}^{-1}S_{YX}S_{XX}^{-1}S_{XY}\right\vert
=
\prod_{i=1}^k(1-l_i).
\end{displaymath}

This statistic unfortunately has a rather complicated distribution. Bartlett (1939) provides an approximation for large $n$:
\begin{displaymath}
-\{n-(p+q+3)/2\}\log\prod_{i=1}^k(1-l_i)\sim \chi^2_{pq}.
\end{displaymath} (14.14)

A test of the hypothesis that only $s$ of the canonical correlation coefficients are non-zero may be based (asymptotically) on the statistic

\begin{displaymath}
-\{n-(p+q+3)/2\}\log\prod_{i=s+1}^k(1-l_i)\sim \chi^2_{(p-s)(q-s)}.
\end{displaymath} (14.15)

EXAMPLE 14.2   Consider Example 14.1 again. There are $n=40$ persons that have rated the cars according to different categories with $p=2$ and $q=6$. The canonical correlation coefficients were found to be $r_1=0.98$ and $r_2=0.89$. Bartlett's statistic (14.14) is therefore

\begin{displaymath}
-\{40-(2+6+3)/2\}\log\{(1-0.98^2)(1-0.89^2)\}=165.59\sim\chi_{12}^2
\end{displaymath}

which is highly significant (the 99% quantile of the $ \chi_{12}^2$ is 26.23). The hypothesis of no correlation between the variables $\data{X}$ and $\data{Y}$ is therefore rejected.

Let us now test whether the second canonical correlation coefficient is different from zero. We use Bartlett's statistic (14.15) with $s=1$ and obtain

\begin{displaymath}
-\{40-(2+6+3)/2\}\log\{(1-0.89^2)\}=54.19\sim\chi_5^2
\end{displaymath}

which is again highly significant with the $\chi_5^2$ distribution.

Canonical Correlation Analysis with qualitative data

The canonical correlation technique may also be applied to qualitative data. Consider for example the contingency table $\data{N}$ of the French baccalauréat data. The dataset is given in Table B.8 in Appendix B.8. The CCA cannot be applied directly to this contingency table since the table does not correspond to the usual data matrix structure. We may wish, however, to explain the relationship between the row $r$ and column $c$ categories. It is possible to represent the data in a $(n\times (r+c))$ data matrix $\data{Z}=(\data{X},\data{Y})$ where $n$ is the total number of frequencies in the contingency table $\data{N}$ and $\data{X}$ and $\data{Y}$ are matrices of zero-one dummy variables. More precisely, let

\begin{displaymath}
x_{ki}=\left\{
\begin{array}{ll}
1&\quad \textrm{if the $k$-...
...w category}\\
0&\quad \textrm{otherwise}
\end{array} \right.
\end{displaymath}

and

\begin{displaymath}
y_{kj}=\left\{
\begin{array}{ll}
1&\quad \textrm{if the $k$-...
...n category}\\
0&\quad \textrm{otherwise}
\end{array} \right.
\end{displaymath}

where the indices range from $k=1,\dots,n$, $i=1,\dots,r$ and $j=1,\dots,c$. Denote the cell frequencies by $n_{ij}$ so that $\data{N}=(n_{ij})$ and note that

\begin{displaymath}
x_{(i)}^{\top} y_{(j)}=n_{ij},
\end{displaymath}

where $x_{(i)}$ ($y_{(j)}$) denotes the $i$-th ($j$-th) column of $\data{X}$ ($\data{Y}$).

EXAMPLE 14.3   Consider the following example where

\begin{displaymath}
\data{N}=
\left(
\begin{array}{cc}
3&2\\
1&4
\end{array}\right).
\end{displaymath}

The matrix $\data{X}$ is therefore

\begin{displaymath}
\data{X}=
\left(
\begin{array}{cc}
1&0\\
1&0\\
1&0\\
1&0\\
1&0\\
0&1\\
0&1\\
0&1\\
0&1\\
0&1\\
\end{array}\right),
\end{displaymath}

the matrix $\data{Y}$ is

\begin{displaymath}
\data{Y}=
\left(
\begin{array}{cc}
1&0\\
1&0\\
1&0\\
0&1\\
0&1\\
1&0\\
0&1\\
0&1\\
0&1\\
0&1\\
\end{array}\right)
\end{displaymath}

and the data matrix $\data{Z}$ is


\begin{displaymath}
\data{Z}= (\data{X},\data{Y})=
\left(
\begin{array}{cccc}
1&...
...&1&0&1\\
0&1&0&1\\
0&1&0&1\\
0&1&0&1\\
\end{array}\right).
\end{displaymath}

The element $n_{12}$ of $\data{N}$ may be obtained by multiplying the first column of $\data{X}$ with the second column of $\data{Y}$ to yield

\begin{displaymath}
x_{(1)}^{\top}y_{(2)}=2.
\end{displaymath}

The purpose is to find the canonical variables $\eta=a^{\top}x$ and $\varphi=b^{\top}y$ that are maximally correlated. Note, however, that $x$ has only one non-zero component and therefore an ``individual'' may be directly associated with its canonical variables or score $(a_i,b_j)$. There will be $n_{ij}$ points at each $(a_i,b_j)$ and the correlation represented by these points may serve as a measure of dependence between the rows and columns of $\data{N}$.

Let $\data{Z}=(\data{X},\data{Y})$ denote a data matrix constructed from a contingency table $\data{N}$. Similar to Chapter 12 define

\begin{displaymath}
c=x_{i\bullet}=\sum_{j=1}^c n_{ij},
\end{displaymath}


\begin{displaymath}
d=x_{\bullet j}=\sum_{i=1}^r n_{ij},
\end{displaymath}

and define $\data{C}=\mathop{\hbox{diag}}(c)$ and $\data{D}=\mathop{\hbox{diag}}(d)$. Suppose that $x_{i\bullet}>0$ and $x_{\bullet j}>0$ for all $i$ and $j$. It is not hard to see that

\begin{displaymath}
nS=\data{Z}^{\top}\data{H}\data{Z}=\data{Z}^{\top}\data{Z} -...
...ray}{cc}
nS_{XX}&nS_{XY}\\
nS_{YX}&nS_{YY}
\end{array}\right)
\end{displaymath}


\begin{displaymath}
=
\left(\frac{n}{n-1}\right)
\left(
\begin{array}{cc}
\data{...
...t{\data{N}}^{\top}&\data{D}-n^{-1}dd^{\top}
\end{array}\right)
\end{displaymath}

where $\widehat{\data{N}}= cd^{\top}/n$ is the estimated value of $\data{N}$ under the assumption of independence of the row and column categories.

Note that

\begin{displaymath}
(n-1)S_{XX}1_r = \data{C}1_r -n^{-1}cc^{\top}1_r
= c-c(n^{-1}c^{\top}1_r) = c-c(n^{-1}n)=0
\end{displaymath}

and therefore $S_{XX}^{-1}$ does not exist. The same is true for $S_{YY}^{-1}$. One way out of this difficulty is to drop one column from both $\data{X}$ and $\data{Y}$, say the first column. Let $\bar{c}$ and $\bar{d}$ denote the vectors obtained by deleting the first component of $c$ and $d$.

Define $\bar{\data{C}}$, $\bar{\data{D}}$ and $\bar S_{XX}$, $\bar S_{YY}$, $\bar S_{XY}$ accordingly and obtain

\begin{displaymath}
(n \bar S_{XX})^{-1}=\bar{\data{C}}^{-1}+n_{i \bullet}^{-1}1_r 1_r^{\top}
\end{displaymath}


\begin{displaymath}
(n \bar S_{YY})^{-1}=\bar{\data{D}}^{-1}+n_{\bullet j}^{-1}1_c 1_c^{\top}
\end{displaymath}

so that (14.3) exists. The score associated with an individual contained in the first row (column) category of $\data{N}$ is 0.

The technique described here for purely qualitative data may also be used when the data is a mixture of qualitative and quantitative characteristics. One has to ``blow up'' the data matrix by dummy zero-one values for the qualitative data variables.

Summary
$\ast$
In practice we estimate $\Sigma _{XX}$, $\Sigma_{XY}$, $\Sigma_{YY}$ by the empirical covariances and use them to compute estimates $\ell_{i}$, $g_{i}$, $d_{i}$ for $\lambda_{i}$, $\gamma_{i}$, $\delta_{i}$ from the SVD of $\widehat{\data{K}}=\data{S}_{XX}^{-1/2}
\data{S}_{XY}\data{S}_{YY}^{-1/2}$.
$\ast$
The signs of the coefficients of the canonical variables tell us the direction of the influence of these variables.