5.6 The Pearson Correlation Coefficient


10861 twpearson ()
illustrates the Pearson correlation coefficient

This quantlet illustrates how dependence is reflected in the formula for the estimated Pearson correlation coefficient, and why it is necessary to ``normalize'' the data (subtract means and divide by the standard deviations). More precisely, suppose we have a collection of points $ (x_i,y_i)$ from a bivariate standard normal distribution:

$\displaystyle f(x,y) = \frac{1}{2 \pi \sqrt{1 - \rho^2}}\,
\exp\left\{{-\,\frac{1}{2(1 - \rho ^2)}(x^2 - 2 \rho xy + y^2)}\right\}\,.$

The parameter $ \rho$ is the correlation coefficient and is usually unknown. When that is the case, we must use an estimator for $ \rho$. A common estimator for this is the Pearson correlation coefficient, $ \widehat{\rho}$, with the following formula:

$\displaystyle \widehat{\rho} =
\frac{\frac{1}{n-1}\sum\limits^n_{i=1} (x_i - \...
...r{x})^2}{n-1},
\ S_{yy} = \frac{\sum\limits_{i=1}^n (y_i - \bar{y})^2}{n-1}\,.$

In this quantlet, the user can see why the above formula is preferable to a couple of simpler formulas. To activate this quantlet, the user should type in the following:

  twpearson()
After this, the user should see the following window:

10865

These are specifications for a simulation of the bivariate standard normal distribution, as described above. The correlation in the above window is the correlation coefficient $ \rho$, and can sometimes be used as a measure of dependence -- that is, two variables that are independent of each other will have a correlation of 0, while two variables that are totally dependent on each other will have a correlation of 1 or -1, depending on if an increase in one variable causes an increase or a decrease in the other, respectively. After the user enters his/her specifications and clicks on OK, the following window should appear (this is for the default values of 100 data points, 0 correlation):


10868

10871

The upper half of the Display window is a scatter-plot diagram of the simulated data. On the computer screen, the black squares will be the original data points $ (x_i,y_i)$, and the colored circles will become the transformed data points $ (x_i^\prime, y_i^\prime)$.

At the beginning, as in the above display, the transformed data are the same as the original data (since no transformation has been indicated yet), and only the outermost points are shown as colored circles around the black squares. Again on the computer screen and imagining a grid formed around the central $ x^\prime$ and $ y^\prime$ values (i.e. $ \bar{x}^\prime$ and $ \bar{y}^\prime$), the circles will be red in the first and third quadrants, and blue in the second and fourth quadrants, with the outermost points larger than the others. In the lower half of this window, there are three formulas which are explained below, as well as the sample means and standard deviations for the $ x$ and $ y$ variables. The shifts and scales of the $ x$ and $ y$ variables begin with the default values of 0 and 1, respectively.

As mentioned above, the Pearson correlation coefficient is an estimate of the true correlation coefficient $ \rho$, here given by the user -- for real data (i.e. not a simulation), we wouldn't know the real correlation coefficient, and would have to estimate it.

The three formulas at the bottom of the Display window are three possible estimators of this correlation coefficient. Since these are estimates from a simulation, the numerical values of these estimator (i.e. the estimates) will most likely be different from the actual value entered in the first window, but they should not be too far off. In our example, the actual correlation is 0, but our estimates are -0.0029, 0.0023, and 0.0022, depending on which formula we use for an estimate.

Instinctively, if the two variables are independent of each other, then if we have the data centered around the origin (as we do in the diagram), we would expect the data points to be relatively evenly distributed among the four quadrants. If there is a measure of dependence, we would expect the data to be mainly in the first and third quadrants (for positive dependence), or in the second and fourth quadrants (for negative dependence). This can be expressed by adding the products of the two variables for each point:

$\displaystyle \sum^n_{i=1} x_i y_i\,.$

That is, if the data are mainly in the first and third quadrants, the above sum will be a large positive number (for positive correlation), and if the data are mainly in the second and fourth quadrants, the above sum will be a large negative number (for negative correlations). If the two variables are independent, there will be about the same number of data points evenly spaced within the first and third quadrants as within the second and fourth quadrants, and the products will cancel each other out for the most part, leaving the above sum relatively close to zero, as expected. The precise meanings of the words ``large'' and ``close to zero'' are dependent on various properties of the data, and cannot be defined exactly -- for the moment, it is best to leave these definitions vague.

However, this is under the assumption that the data are centered at the origin. This is usually not the case, so we transform the data by subtracting the mean $ x$-value from the $ x$-variable of each data point, and do the same for the $ y$-variable. Thus, the transformed data are now centered at the origin, and the above formula has been modified to the following:

$\displaystyle \sum\limits^n_{i=1} (x_i - \bar{x})(y_i -\bar{y}) \quad
\textrm{ ...
...}\quad \bar{x} = \frac{1}{n}\sum_1^n x_i, \
\bar{y} = \frac{1}{n} \sum_1^n y_i$

This formula is still not very reliable, however, since it depends on the scale of the variables. For example, if the $ x$-variable is a length, we will get different results if we measure $ x$ by inches than if we measure $ x$ by feet. However, we can correct this by dividing the above formula by the square root of the variances of the two variables:

$\displaystyle \frac{\sum\limits^n_{i=1} (x_i - \bar{x})(y_i -\bar{y})}
{\sqrt{S...
...r{x})^2}{n-1},
\ S_{yy} = \frac{\sum\limits^n_{i=1} (y_i - \bar{y})^2}{n-1}\,.$

The above formula is now ``normalized'' in the sense that we have subtracted the data by the means and divided by the standard deviation. Thus, we have a unitless measure of dependence, which lies between -1 and 1.

The user can prove to himself/herself that this last formula will give the same answer, even as the data are shifted and rescaled. In the Read Value window above, the user can indicate how he/she would like to shift and/or rescale the $ x$ or $ y$ variables. Through this, the user will see that the first two sums (in the bottom half of the Display window) change, while the third always remains the same. Thus, the third formula (the one used for the Pearson correlation coefficient) is the superior estimator.