|
This quantlet illustrates how dependence is reflected in the
formula for the estimated Pearson correlation coefficient,
and why it is necessary to ``normalize'' the data (subtract means
and divide by the standard deviations). More precisely, suppose
we have a collection of points from a bivariate standard
normal distribution:
In this quantlet, the user can see why the above formula is preferable to a couple of simpler formulas. To activate this quantlet, the user should type in the following:
twpearson()After this, the user should see the following window:
These are specifications for a simulation of the bivariate standard
normal distribution, as described above. The correlation in
the above window is the correlation coefficient , and can
sometimes be used as a measure of dependence -- that is, two
variables that are independent of each other will have a correlation
of 0, while two variables that are totally dependent on each other
will have a correlation of 1 or -1, depending on if an increase in
one variable causes an increase or a decrease in the other,
respectively. After the user enters his/her specifications and
clicks on OK, the following window should appear (this is for
the default values of 100 data points, 0 correlation):
The upper half of the Display window is a scatter-plot diagram
of the simulated data. On the computer screen, the black squares
will be the original data points , and the colored circles
will become the transformed data points
.
At
the beginning, as in the above display, the transformed data are
the same as the original data (since no transformation has been
indicated yet), and only the outermost points are shown as colored
circles around the black squares. Again on the computer screen
and
imagining a grid formed around the central and
values (i.e.
and
), the circles
will be red in the first and third quadrants, and blue in the
second and fourth quadrants, with the outermost points larger
than the others. In the lower half of this window, there are
three formulas which are explained below, as well as the sample
means and standard deviations for the
and
variables.
The shifts and scales of the
and
variables begin with
the default values of 0 and 1, respectively.
As mentioned above, the Pearson correlation coefficient is an
estimate of the true correlation coefficient , here given
by the user -- for real data (i.e. not a simulation), we wouldn't
know the real correlation coefficient, and would have to estimate
it.
The three formulas at the bottom of the Display window are three possible estimators of this correlation coefficient. Since these are estimates from a simulation, the numerical values of these estimator (i.e. the estimates) will most likely be different from the actual value entered in the first window, but they should not be too far off. In our example, the actual correlation is 0, but our estimates are -0.0029, 0.0023, and 0.0022, depending on which formula we use for an estimate.
Instinctively, if the two variables are independent of each other, then if we have the data centered around the origin (as we do in the diagram), we would expect the data points to be relatively evenly distributed among the four quadrants. If there is a measure of dependence, we would expect the data to be mainly in the first and third quadrants (for positive dependence), or in the second and fourth quadrants (for negative dependence). This can be expressed by adding the products of the two variables for each point:
However, this is under the assumption that the data are centered
at the origin. This is usually not the case, so we transform
the data by subtracting the mean -value from the
-variable
of each data point, and do the same for the
-variable.
Thus, the transformed data are now centered at the origin, and
the above formula has been modified to the following:
The above formula is now ``normalized'' in the sense that we have subtracted the data by the means and divided by the standard deviation. Thus, we have a unitless measure of dependence, which lies between -1 and 1.
The user can prove to himself/herself that this last formula
will give the same answer, even as the data are shifted and
rescaled. In the Read Value window above, the user can
indicate how he/she would like to shift and/or rescale the
or
variables. Through this, the user will see that
the first two sums (in the bottom half of the Display
window) change, while the third always remains the same.
Thus, the third formula (the one used for the Pearson
correlation coefficient) is the superior estimator.