3.2 Correlation

The correlation between two variables

and

is defined from the covariance as the following:

$\begin{displaymath} \rho _{XY} = \frac{\mathop{\mathit{Cov}}(X,Y) }{\sqrt {\mathop{\mathit{Var}}(X)\mathop{\mathit{Var}}(Y)}}\cdotp \end{displaymath}$

(3.7)

The advantage of the correlation is that it is independent of the scale, i.e., changing the variables' scale of measurement does not change the value of the correlation. Therefore, the correlation is more useful as a measure of association between two random variables than the covariance. The empirical version of $\rho _{XY}$ is as follows:

$\begin{displaymath} r_{XY} = \frac{s_{XY} }{\sqrt {s_{XX}s_{YY}} }\cdotp \end{displaymath}$

(3.8)

The correlation is in absolute value always less than 1. It is zero if the covariance is zero and vice-versa. For

-dimensional vectors $(X_1,\ldots ,X_p)^{\top}$ we have the theoretical correlation matrix

$\begin{displaymath} {\data{P}} = \left ( {\begin{array}{ccc} \rho_{X_1X_1} &\ld... ...rho_{X_pX_1}&\ldots &\rho_{X_{p}X_{p}} \end{array}} \right ) , \end{displaymath}$

and its empirical version, the empirical correlation matrix which can be calculated from the observations,

$\begin{displaymath} {\data{R}} = \left ( {\begin{array}{ccc} r_{X_1X_1} &\ldots... ...s\\ r_{X_pX_1}&\ldots &r_{X_{p}X_{p}} \end{array}} \right ) . \end{displaymath}$

EXAMPLE 3.3 We obtain the following correlation matrix for the genuine bank notes:

$\begin{displaymath} {\data{R}}_g=\left ( {\begin{array}{rrrrrr} 1.00&0.41&0.41&... ...\ 0.03&-0.25&-0.14&-0.00&-0.25&1.00 \end{array}} \right )\ , \end{displaymath}$

(3.9)

and for the counterfeit bank notes:

$\begin{displaymath} {\data{R}}_f=\left ( {\begin{array}{rrrrrr} 1.00&0.35&0.24&-... ...06\\ 0.06&-0.03&0.20&0.37&-0.06&1.00 \end{array}} \right)\ . \end{displaymath}$

(3.10)

As noted before for $\mathop{\mathit{Cov}}(X_4,X_5)$ , the correlation between

(distance of the frame to the lower border) and

(distance of the frame to the upper border) is negative. This is natural, since the covariance and correlation always have the same sign (see also Exercise 3.17).

Why is the correlation an interesting statistic to study? It is related to independence of random variables, which we shall define more formally later on. For the moment we may think of independence as the fact that one variable has no influence on another.

THEOREM 3.1 If

and

are independent, then $\rho(X,Y)=\Cov(X,Y)=0.$

1mm
$\begin{picture}(2.00,2.00) \par\linethickness{1.0pt}\put(0.00,0.00){\line(1,0){1... ...\line(1,-2){5.00}} \put(5.00,4.00){\makebox(0,0)[cc]{\LARGE\bf !}} \end{picture}$
In general, the converse is not true, as the following example shows.

EXAMPLE 3.4 Consider a standard normally-distributed random variable

and a random variable

, which is surely not independent of

. Here we have

$\begin{displaymath}\Cov (X,Y) = E(XY) - E(X)E(Y) = E(X^3) = 0 \end{displaymath}$

(because

and

). Therefore $\rho (X,Y)=0$ , as well. This example also shows that correlations and covariances measure only linear dependence. The quadratic dependence of

is not reflected by these measures of dependence.

REMARK 3.1 For two normal random variables, the converse of Theorem 3.1 is true: zero covariance for two normally-distributed random variables implies independence. This will be shown later in Corollary 5.2.

Theorem 3.1 enables us to check for independence between the components of a bivariate normal random variable. That is, we can use the correlation and test whether it is zero. The distribution of $r_{XY}$ for an arbitrary is unfortunately complicated. The distribution of $r_{XY}$ will be more accessible if are jointly normal (see Chapter 5). If we transform the correlation by Fisher's -transformation,

$\begin{displaymath} W = \frac{1 }{2 }\log\left (\frac{1+r_{XY} } {1-r_{XY} }\right ), \end{displaymath}$

(3.11)

we obtain a variable that has a more accessible distribution. Under the hypothesis that $\rho = 0$ ,

has an asymptotic normal distribution. Approximations of the expectation and variance of

are given by the following:

$\begin{displaymath} \begin{array}{rcl} E(W)&\approx& \frac{1}{2 }\log \left (\fr... ...p{\mathit{Var}}(W)&\approx& \frac{1}{(n-3)}\cdotp \end{array} \end{displaymath}$

(3.12)

The distribution is given in Theorem 3.2.

THEOREM 3.2

$\begin{displaymath} Z = \frac{W-E(W) }{\sqrt {\mathop{\mathit{Var}}(W)} } \stackrel{\cal L}{\longrightarrow} N(0,1). \end{displaymath}$

(3.13)

The symbol `` $\stackrel{\cal L}{\longrightarrow}$ '' denotes convergence in distribution, which will be explained in more detail in Chapter 4.

Theorem 3.2 allows us to test different hypotheses on correlation. We can fix the level of significance $\alpha$ (the probability of rejecting a true hypothesis) and reject the hypothesis if the difference between the hypothetical value and the calculated value of is greater than the corresponding critical value of the normal distribution. The following example illustrates the procedure.

EXAMPLE 3.5 Let's study the correlation between mileage (

) and weight (

) for the car data set (B.3) where

. We have $r_{X_2X_8}=-0.823$ . Our conclusions from the boxplot in Figure 1.3 (``Japanese cars generally have better mileage than the others'') needs to be revised. From Figure 3.3 and $r_{X_{2}X_{8}}$ , we can see that mileage is highly correlated with weight, and that the Japanese cars in the sample are in fact all lighter than the others!

If we want to know whether $\rho_{X_{2}X_{8}}$ is significantly different from $\rho _0=0$ , we apply Fisher's -transform (3.11). This gives us

$\begin{displaymath}w = \frac{1}{2} \log \left (\frac{1+r_{X_{2}X_{8}}} {1-r_{X_{... ...d } \quad z = \frac{-1.166-0}{\sqrt {\frac{1 }{71 }} }= -9.825,\end{displaymath}$

i.e., a highly significant value to reject the hypothesis that $\rho = 0$ (the 2.5% and 97.5% quantiles of the normal distribution are

and

, respectively). If we want to test the hypothesis that, say, $\rho _0=-0.75$ , we obtain:

$\begin{displaymath}z = \frac{-1.166-(-0.973)}{\sqrt {\frac{1}{71}} }= -1.627.\end{displaymath}$

This is a nonsignificant value at the $\alpha = 0.05$ level for

since it is between the critical values at the 5% significance level (i.e.,

**Figure 3.3:** Mileage ( $X_{2}$ ) vs. weight ( $X_{8}$ ) of U.S. (star), European (plus signs) and Japanese (circle) cars. `MVAscacar.xpl`
$\includegraphics[width=1\defpicwidth]{scacar.ps}$

EXAMPLE 3.6 Let us consider again the pullovers data set from example 3.2. Consider the correlation between the presence of the sales assistants ( $X_{4}$ ) vs. the number of sold pullovers ( $X_{1}$ ) (see Figure 3.4). Here we compute the correlation as

$\begin{displaymath}r_{X_{1}X_{4}}=0.633.\end{displaymath}$

**Figure 3.4:** Hours of sales assistants () vs. sales () of pullovers. `MVAscapull2.xpl`
$\includegraphics[width=1\defpicwidth]{scapull2.ps}$

The -transform of this value is

$\begin{displaymath} w =\frac{1 }{2 }\log_e\left( \frac{1+r_{X_{1}X_{4}}}{1-r_{X_{1}X_{4}}}\right)= 0.746. \end{displaymath}$

(3.14)

The sample size is

, so for the hypothesis $\rho_{X_{1}X_{4}}=0$ , the statistic to consider is:

$\begin{displaymath} z = \sqrt {7}(0.746-0) = 1.974 \end{displaymath}$

(3.15)

which is just statistically significant at the $5\%$ level (i.e., 1.974 is just a little larger than 1.96).

REMARK 3.2 The normalizing and variance stabilizing properties of

are asymptotic. In addition the use of

in small samples (for $n\leq25$ ) is improved by Hotelling's transform (Hotelling; 1953):

$\begin{displaymath}W^*=W-\frac{3W+\tanh(W)}{4(n-1)} \quad\textrm{ with } \quad Var(W^*)=\frac{1}{n-1}.\end{displaymath}$

The transformed variable

is asymptotically distributed as a normal distribution.

EXAMPLE 3.7 From the preceding remark, we obtain

and $\sqrt{10-1} w^*=1.9989$ for the preceding Example 3.6. This value is significant at the $5\%$ level.

REMARK 3.3 Note that the Fisher's Z-transform is the inverse of the hyperbolic tangent function: $W=\tanh^{-1}(r_{XY})$ ; equivalently $r_{XY}= \tanh(W) = \frac{e^{2W}-1}{e^{2W}+1}$ .

REMARK 3.4 Under the assumptions of normality of

and

, we may test their independence ( $\rho_{XY}=0$ ) using the exact

-distribution of the statistic

$\begin{displaymath} T=r_{XY}\sqrt{\frac{n-2}{1-r^2_{XY}}}\stackrel{\rho_{XY}=0}{\sim} t_{n-2}. \end{displaymath}$

Setting the probability of the first error type to $\alpha$ , we reject the null hypothesis $\rho_{XY}=0$ if $\vert T\vert\geq t_{1-\alpha/2;n-2}$ .

Summary

$\ast$: The correlation is a standardized measure of dependence.
$\ast$: The absolute value of the correlation is always less than one.
$\ast$: Correlation measures only linear dependence.
$\ast$: There are nonlinear dependencies that have zero correlation.
$\ast$: Zero correlation does not imply independence.
$\ast$: Independence implies zero correlation.
$\ast$: Negative correlation corresponds to downward-sloping scatterplots.
$\ast$: Positive correlation corresponds to upward-sloping scatterplots.
$\ast$: Fisher's Z-transform helps us in testing hypotheses on correlation.
$\ast$: For small samples, Fisher's Z-transform can be improved by the transformation $W^*=W-\frac{3W+\tanh(W)}{4(n-1)}$ .