3.3 Summary Statistics
This section focuses on the representation of basic summary statistics
(means, covariances and correlations)
in matrix notation, since we often apply
linear transformations to data. The matrix notation allows us to derive
instantaneously the corresponding characteristics of the transformed
variables. The Mahalanobis transformation is a prominent example of such
linear transformations.
Assume that we have observed
realizations of a
-dimensional random
variable; we have a data matrix
:
![\begin{displaymath}
\data{X}= \left (\begin{array}{ccc} x_{11} &\cdots&x_{1p}\\ ...
...
\vdots&&\vdots\\
x_{n1} &\cdots&x_{np}\end{array}\right ).
\end{displaymath}](mvahtmlimg759.gif) |
(3.16) |
The rows
denote the
-th
observation of a
-dimensional random variable
.
The statistics that were briefly introduced in Section 3.1
and 3.2 can be rewritten in
matrix form as follows.
The ``center of gravity'' of the
observations in
is given by
the vector
of the means
of the
variables:
![\begin{displaymath}
\overline x=\left (\begin{array}{c} \overline x_1\\ \vdots\\...
..._p\end{array}\right )=n^{-1}\data{X}^{\top}\undertilde 1_{n}.
\end{displaymath}](mvahtmlimg763.gif) |
(3.17) |
The dispersion of the
observations can be characterized by the
covariance matrix of the
variables. The empirical covariances
defined in (3.2) and (3.3)
are the elements of the following matrix:
![\begin{displaymath}
\data{S}=n^{-1}\data{X}^{\top}\data{X}-\overline x\ \overlin...
...}^{\top}\undertilde 1_{n}
\undertilde 1_{n}^{\top}\data{X}).
\end{displaymath}](mvahtmlimg764.gif) |
(3.18) |
Note that this matrix is equivalently defined by
The covariance formula (3.18) can be rewritten as
with the centering matrix
![\begin{displaymath}
\data{H} = \data{I}_{n} -n^{-1}\undertilde 1_{n}\undertilde 1_{n}^{\top}.
\end{displaymath}](mvahtmlimg767.gif) |
(3.19) |
Note that the centering matrix is symmetric and idempotent. Indeed,
As a consequence
is positive semidefinite, i.e.
![\begin{displaymath}
\data{S}\ge 0.
\end{displaymath}](mvahtmlimg769.gif) |
(3.20) |
Indeed for all
,
for
. It is well known from the one-dimensional
case that
as an estimate of the
variance exhibits a bias of the order
(Breiman; 1973).
In the multidimensional case,
is an unbiased estimate of the true covariance. (This will be shown in Example
4.15.)
The sample correlation coefficient between the
-th and
-th variables is
, see (3.8).
If
, then the correlation matrix is
![\begin{displaymath}
\data{R} = \data{D}^{-1/2}\data{S}\data{D}^{-1/2},
\end{displaymath}](mvahtmlimg778.gif) |
(3.21) |
where
is a diagonal matrix with elements
on its main diagonal.
EXAMPLE 3.8
The empirical covariances are calculated for the pullover data set.
The vector of the means of the four variables in the dataset is
.
The sample covariance matrix is
The unbiased estimate of the variance (
=10)
is equal to
The sample correlation matrix is
![${\data R}=\left(\begin{array}{llll}
\phantom{-}1& -0.17& \phantom{-}0.87& \pha...
...\
\phantom{-}0.63& -0.46& \phantom{-} 0.31& \phantom{-}1
\end{array}\right).$](mvahtmlimg784.gif)
Linear Transformation
In many practical applications we need to study linear transformations of the
original data. This motivates the question of how to calculate
summary statistics after such linear transformations.
Let
be a (
) matrix and consider the transformed
data matrix
![\begin{displaymath}
\data{Y} = \data{X}\data{A}^{\top} = (y_{1}, \ldots, y_{n})^{\top}.
\end{displaymath}](mvahtmlimg786.gif) |
(3.22) |
The row
can be viewed as the
-th
observation of a
-dimensional random variable
.
In fact we have
.
We immediately obtain the mean and the empirical covariance of the
variables (columns) forming the data matrix
:
Note that if the linear transformation is
nonhomogeneous, i.e.,
only (3.23) changes:
.
The formula (3.23) and (3.24) are useful in the
particular case of
, i.e.,
:
EXAMPLE 3.9
Suppose that
![$\data{X}$](mvahtmlimg608.gif)
is the pullover data set. The manager wants to
compute his mean expenses for advertisement (
![$X_3$](mvahtmlimg221.gif)
)
and sales assistant (
![$X_4$](mvahtmlimg11.gif)
).
Suppose that the sales assistant charges an hourly wage of 10 EUR. Then the
shop manager calculates the expenses
as
. Formula (3.22) says that this
is equivalent to defining the matrix
as:
Using formulas (
3.23) and (
3.24), it is now computationally
very easy to obtain
the sample mean
![$\overline y$](mvahtmlimg804.gif)
and the sample variance
![$\data{S}_y$](mvahtmlimg805.gif)
of the
overall expenses:
Mahalanobis Transformation
A special case of this linear transformation is
![\begin{displaymath}
z_i=\data{S}^{-1/2}(x_i-\overline x), \quad i=1,\ldots ,n.
\end{displaymath}](mvahtmlimg809.gif) |
(3.25) |
Note that for the transformed data matrix
,
![\begin{displaymath}
\data{S}_{\data{Z}}=n^{-1}\data{Z}^{\top}\data{H}\data{Z}=\data{I}_{p}.
\end{displaymath}](mvahtmlimg811.gif) |
(3.26) |
So the Mahalanobis transformation eliminates the correlation between the
variables and standardizes the variance of each variable.
If we apply (3.24) using
, we obtain
the identity covariance matrix as indicated in (3.26).
Summary
![$\ast$](mvahtmlimg108.gif)
-
The center of gravity of a data matrix is given by its mean vector
.
![$\ast$](mvahtmlimg108.gif)
-
The dispersion of the observations in a data matrix is given by the
empirical covariance matrix
.
![$\ast$](mvahtmlimg108.gif)
-
The empirical correlation matrix is given by
.
![$\ast$](mvahtmlimg108.gif)
-
A linear transformation
of a data matrix
has mean
and empirical covariance
.
![$\ast$](mvahtmlimg108.gif)
-
The Mahalanobis transformation is a linear transformation
which gives a standardized,
uncorrelated data matrix
.