4.2 Moments and Characteristic Functions

Moments--Expectation and Covariance Matrix

If is a random vector with density $f(\undertilde x)$ then the expectation of is

$\begin{displaymath} EX = \left ( \begin{array}{c} EX_1\\ \vdots\\ EX_p \end{arra... ...de x)d\undertilde x \end{array} \right ) = \undertilde \mu. \end{displaymath}$

(4.10)

Accordingly, the expectation of a matrix of random elements has to be understood component by component. The operation of forming expectations is linear:

$\begin{displaymath} E\left (\alpha X+\beta Y \right ) = \alpha EX +\beta EY. \end{displaymath}$

(4.11)

If $A(q \times p)$ is a matrix of real numbers, we have:

$\begin{displaymath} E(AX) = A EX. \end{displaymath}$

(4.12)

When

and

are independent,

$\begin{displaymath} E(XY^{\top}) = EX EY^{\top}. \end{displaymath}$

(4.13)

The matrix

$\begin{displaymath} \Var(X) = \Sigma =E(X-\mu )(X-\mu )^{\top} \end{displaymath}$

(4.14)

is the (theoretical) covariance matrix. We write for a vector

with mean vector $\mu$ and covariance matrix $\Sigma$ ,

$\begin{displaymath} X\sim (\mu ,\Sigma ). \end{displaymath}$

(4.15)

The $(p \times q)$ matrix

$\begin{displaymath} \Sigma_{XY} = \Cov(X,Y)=E(X-\mu )(Y-\nu )^{\top} \end{displaymath}$

(4.16)

is the covariance matrix of $X\sim (\mu ,\Sigma_{XX})$ and $Y\sim (\nu,\Sigma_{YY})$ . Note that $\Sigma_{XY} = \Sigma^{\top}_{YX}$ and that $Z= \left( X\atop Y \right)$ has covariance $\Sigma_{ZZ} = \left( {\Sigma_{XX}\atop\Sigma_{YX}} {\Sigma_{XY}\atop \Sigma_{YY}} \right)$ . From

$\begin{displaymath} \Cov(X,Y) = E(XY^{\top}) - \mu\nu^{\top}=E(XY^{\top}) - EX EY^{\top} \end{displaymath}$

(4.17)

it follows that $\Cov(X,Y)=0$ in the case where

and

are independent. We often say that $\mu = E(X)$ is the first order moment of

and that $E(XX^{\top})$ provides the second order moments of

$\begin{displaymath} E(XX^{\top}) = \{ E(X_iX_j) \}, \textrm{ for } i=1,\ldots,p \textrm{ and } j=1,\ldots,p. \end{displaymath}$

(4.18)

Properties of the Covariance Matrix $\Sigma=\Var(X)$

		$\displaystyle \Sigma=(\sigma_{X_{i}X_{j}}), \quad \sigma_{X_{i}X_{j}} = \mathop{\mathit{Cov}}(X_i,X_j), \quad \sigma_{X_{i}X_{i}} = \mathop{\mathit{Var}}(X_i)$	(4.19)
		$\displaystyle \Sigma = E(XX^{\top}) - \mu \mu^{\top}$	(4.20)
		$\displaystyle \Sigma \ge 0$	(4.21)

Properties of Variances and Covariances

		$\displaystyle \mathop{\mathit{Var}}(a^{\top}X) = a^{\top}\!\Var(X)a = \sum_{i,j} a_ia_j \sigma_{X_{i}X_{j}}$	(4.22)
		$\displaystyle \Var(\data{A}X + b) = \data{A} \Var(X) \data{A}^{\top}$	(4.23)
		$\displaystyle \Cov(X + Y,Z) = \Cov(X,Z) + \Cov(Y,Z)$	(4.24)
		$\displaystyle \Var(X + Y) = \Var(X) + \Cov(X,Y) + \Cov(Y,X) + \Var(Y)$	(4.25)
		$\displaystyle \Cov(\data{A}X,\data{B}Y) = \data{A} \Cov(X,Y) \data{B}^{\top}.$	(4.26)

Let us compute these quantities for a specific joint density.

EXAMPLE 4.5 Consider the pdf of Example 4.1. The mean vector $\mu ={\mu _1\choose\mu _2}$ is

$\begin{eqnarray*} \mu_1 & = & \int \int x_1f(x_1,x_2)dx_1dx_2 = \int ^1_0\int ^... ...\frac{1 }{8 }+\frac{1 }{2 }=\frac{1+4 }{8 }=\frac{5}{8 }\ \cdotp \end{eqnarray*}$

The elements of the covariance matrix are

$\begin{eqnarray*} \sigma_{X_{1}X_{1}} & = & EX^2_1 - \mu^2_1 \quad \textrm{ with... ...+ \frac{3}{4} \left[ \frac{x^3_2}{3} \right]^1_0 = \frac{1}{3}. \end{eqnarray*}$

Hence the covariance matrix is

$\begin{displaymath}\Sigma = \left( \begin{array}{cc} 0.0815 & 0.0052 \\ 0.0052 & 0.0677 \end{array} \right). \end{displaymath}$

Conditional Expectations

The conditional expectations are

$\begin{displaymath} E(X_2\mid x_1) = \int x_2f(x_2\mid x_1)\;dx_2 \quad\textrm{ and }\quad E(X_1\mid x_2) = \int x_1f(x_1\mid x_2)\;dx_1. \end{displaymath}$

(4.27)

$E(X_2\vert x_1)$ represents the location parameter of the conditional pdf of

given that

. In the same way, we can define $\Var(X_2\vert X_1=x_1)$ as a measure of the dispersion of

given that

. We have from (4.20) that

$\begin{displaymath}\Var(X_2\vert X_1=x_1) = E(X_2\: X_2^{\top}\vert X_1=x_1) - E(X_2\vert X_1=x_1) \, E(X_2^{\top}\vert X_1=x_1). \end{displaymath}$

Using the conditional covariance matrix, the conditional correlations may be defined as:

$\begin{displaymath}\rho_{X_{2}\: X_{3}\vert X_1=x_1} = \frac{\Cov (X_{2}, X_{3}\... ...\sqrt{\Var (X_{2}\vert X_1=x_1)\, \Var (X_{3}\vert X_1=x_1)}}. \end{displaymath}$

These conditional correlations are known as partial correlations between $X_{2}$ and $X_{3}$ , conditioned on

being equal to

EXAMPLE 4.6 Consider the following pdf

$\begin{displaymath}f(x_1,x_2,x_3)=\frac{2}{3}(x_1+x_2+x_3)\textrm{ where } 0< x_1,x_2,x_3< 1.\end{displaymath}$

Note that the pdf is symmetric in

and

which facilitates the computations. For instance,

$\begin{displaymath} \begin{array}{lclcl} f(x_1,x_2)&=&\frac{2}{3}(x_1+x_2+\frac{... ...f(x_1)&=&\frac{2}{3}(x_1+1) & & 0< x_1 < 1\nonumber \end{array}\end{displaymath}$

and the other marginals are similar. We also have

$\begin{eqnarray*} f(x_1,x_2\vert x_3)=\frac{x_1+x_2+x_3}{x_3+1}, & & 0< x_1,x_2<... ...(x_1\vert x_3)=\frac{x_1+x_3+\frac{1}{2}}{x_3+1}, & & 0< x_1< 1. \end{eqnarray*}$

It is easy to compute the following moments:

$E(X_i)=\frac{5}{9};\ E(X_i^2)=\frac{7}{18};\ E(X_iX_j)=\frac{11}{36}\ \ \left(... ...}\\ [3mm] E(X_1X_2\vert X_3=x_3)=\frac{1}{12}\left(\frac{3x_3+4}{x_3+1}\right).$

Note that the conditional means of and of , given , are not linear in . From these moments we obtain:

$\begin{displaymath}\Sigma =\left(\begin{array}{rrr} \frac{13}{162}&-\frac{1}{32... ...m{in particular}\ \rho_{X_1X_2}=-\frac{1}{26} \approx -0.0385. \end{displaymath}$

The conditional covariance matrix of

and

, given

$\begin{displaymath}\Var\left({X_1 \choose X_2}\mid X_3=x_3\right)= \left(\begin{... ...} & \frac{12x_3^2+24x_3+11}{144(x_3+1)^2} \end{array}\right). \end{displaymath}$

In particular, the partial correlation between

and

, given that

is fixed at

, is given by $\rho _{X_1X_2\vert X_3=x_3}=-\frac{1}{12x_3^2+24x_3+11}$ which ranges from

when

goes from 0 to 1. Therefore, in this example, the partial correlation may be larger or smaller than the simple correlation, depending on the value of the condition

EXAMPLE 4.7 Consider the following joint pdf

$\begin{displaymath}f(x_1,x_2,x_3)= 2x_2(x_1+x_3);\ 0< x_1,x_2,x_3 < 1.\end{displaymath}$

Note the symmetry of

and

in the pdf and that

is independent of

. It immediately follows that

$\begin{displaymath}f(x_1,x_3)=(x_1+x_3) \qquad 0< x_1,x_3 < 1\end{displaymath}$

$\begin{eqnarray*} f(x_1)&=&x_1+\frac{1}{2};\\ f(x_2)&=\\ f(x_3)&=&x_3+\frac{1}{2}. \end{eqnarray*}$

Simple computations lead to

$\begin{displaymath}E(X)=\left(\begin{array}{c} \frac{7}{12}\\ [3mm] \frac{2}{3}\... ... 0\\ -\frac{1}{144} & 0 & \frac{11}{144} \end{array}\right).\end{displaymath}$

Let us analyze the conditional distribution of

given

. We have

$\begin{eqnarray*} f(x_1,x_2\vert x_3)= \frac{4(x_1+x_3)x_2}{2x_3+1} &\ & 0 < x_1... ... & 0 < x_1 < 1\\ f(x_2\vert x_3)= f(x_2)= 2x_2 &\ & 0 < x_2 < 1 \end{eqnarray*}$

so that again

and

are independent conditional on

. In this case

$\begin{eqnarray*} E\left({X_1 \choose X_2}\left\vert X_3=x_3\right.\right) & = &... ...}{(2x_3+1)^2}\right) & 0\\ 0 & \frac{1}{18} \end{array}\right). \end{eqnarray*}$

Properties of Conditional Expectations

Since $E(X_2\vert X_1=x_1)$ is a function of , say , we can define the random variable $h(X_1) = E(X_2\vert X_1)$ . The same can be done when defining the random variable $\Var(X_2\vert X_1)$ . These two random variables share some interesting properties:

$\displaystyle E(X_2)$	$\textstyle =$	$\displaystyle E\{E(X_2\vert X_1)\}$	(4.28)
$\displaystyle \Var(X_2)$	$\textstyle =$	$\displaystyle E\{\Var(X_2\vert X_1)\} + \Var\{E(X_2\vert X_1)\}.$	(4.29)

EXAMPLE 4.8 Consider the following pdf

$\begin{displaymath}f(x_1,x_2)=2e^{-\frac{x_2}{x_1}} ;\ 0< x_1 < 1, x_2 >0.\end{displaymath}$

It is easy to show that

$\begin{displaymath}f(x_1)=2x_1\ \textrm{ for }\ 0<x_1<1 ;\quad E(X_1)=\frac{2}{3}\ \textrm{ and }\ \Var(X_1)=\frac{1}{18}\end{displaymath}$

$\begin{displaymath}f(x_2\vert x_1)=\frac{1}{x_1}e^{-\frac{x_2}{x_1}}\ \textrm{ f... ... E(X_2\vert X_1)=X_1\ \textrm{ and }\ \Var(X_2\vert X_1)=X_1^2.\end{displaymath}$

Without explicitly computing

, we can obtain:

$\begin{eqnarray*} E(X_2) &= &E\left(E(X_2\vert X_1)\right) = E(X_1) = \frac{2}{3... ...ght) =E(X_1^2)+\Var(X_1)=\frac{2}{4}+\frac{1}{18}=\frac{10}{18}. \end{eqnarray*}$

The conditional expectation $E(X_2\vert X_1)$ viewed as a function of (known as the regression function of on ), can be interpreted as a conditional approximation of by a function of . The error term of the approximation is then given by:

$\begin{displaymath}U = X_2 - E(X_2\vert X_1). \end{displaymath}$

THEOREM 4.3 Let $X_1\in \mathbb{R}^k$ and $X_2\in \mathbb{R}^{p-k}$ and $U = X_2 - E(X_2\vert X_1)$ . Then we have:

(1)

(2)

$E(X_2\vert X_1)$ is the best approximation of

by a function

where $h:\; \mathbb{R}^k \longrightarrow \mathbb{R}^{p-k}$ . ``Best'' is the minimum mean squared error (MSE), where

$\begin{displaymath}MSE(h) = E[\{X_2 - h(X_1)\}^{\top} \, \{X_2 - h(X_1)\}].\end{displaymath}$

Characteristic Functions

The characteristic function (cf) of a random vector $X \in \mathbb{R}^p$ (respectively its density ) is defined as

$\begin{displaymath}\varphi_X(t) = E(e^{{\bf i}t^{\top}X}) = \int e^{{\bf i}t^{\top}x}f(x)\;dx, \quad t \in \mathbb{R}^p, \end{displaymath}$

where ${\bf i}$ is the complex unit: ${\bf i}^2 = -1$ . The cf has the following properties:

$\begin{displaymath} \varphi_X(0) = 1\ \textrm{ and }\ \vert\varphi_X(t)\vert \le 1. \end{displaymath}$

(4.30)

If $\varphi$ is absolutely integrable, i.e., the integral $\int_{-\infty}^\infty \vert\varphi(x)\vert dx$ exists and is finite, then

$\begin{displaymath} f(x) = \frac{1}{(2\pi)^p} \int^\infty_{-\infty}e^{-{\bf i}t^{\top}x} \varphi_X(t)\;dt. \end{displaymath}$

(4.31)

If $X = (X_1,X_2,\ldots,X_{p})^{\top}$ , then for $t = (t_{1},t_{2},\ldots,t_{p})^{\top}$

$\begin{displaymath} \varphi_{X_1}(t_1) = \varphi_X(t_1,0,\ldots,0),\quad\ldots\quad, \varphi_{X_p}(t_p) = \varphi_X(0,\ldots,0,t_{p}).\ \end{displaymath}$

(4.32)

If $X_1,\ldots,X_p$ are independent random variables, then for $t = (t_{1},t_{2},\ldots,t_{p})^{\top}$

$\begin{displaymath} \varphi_X(t) = \varphi_{X_1}(t_1)\cdotp\ldots\cdotp\varphi_{X_p}(t_p). \end{displaymath}$

(4.33)

If $X_1,\ldots,X_p$ are independent random variables, then for $t\in\mathbb{R}$

$\begin{displaymath} \varphi_{X_{1}+\ldots+X_{p}}(t) = \varphi_{X_1}(t)\cdotp\ldots\cdotp\varphi_{X_p}(t). \end{displaymath}$

(4.34)

The characteristic function can recover all the cross-product moments of any order: $\forall j_k \geq 0, k=1,\ldots,p$ and for $t = (t_1,\ldots,t_p)^{\top}$ we have

$\begin{displaymath} E \left( X_1^{j_1}\cdotp\ldots\cdotp X_p^{j_p} \right) = \fr... ...\partial t_1^{j_1} \ldots \partial t_p^{j_p} } \right]_{t=0}. \end{displaymath}$

(4.35)

EXAMPLE 4.9 The cf of the density in example 4.5 is given by

$\begin{eqnarray*} \varphi_X(t) & = & \int^1_0 \int^1_0 e^{{\bf i}t^{\top}x}f(x)d... ...,{ t_1}\,{ t_2} \right) }\over {{{{ t_1}}^2}\,{{{ t_2}}^2}}}. \end{eqnarray*}$

EXAMPLE 4.10 Suppose $X\in\mathbb{R}^1$ follows the density of the standard normal distribution

$\begin{displaymath}f_{X}(x) = \frac{1}{\sqrt{2\pi}} \exp \left(-\frac{x^2}{2}\right) \end{displaymath}$

(see Section 4.4) then the cf can be computed via

$\begin{eqnarray*} \varphi_{X}(t) & = & \frac{1}{\sqrt{2\pi}} \int_{-\infty}^{\in... ...}t)^2}{2}\right\}\,dx \\ & = & \exp\left(-\frac{t^2}{2}\right), \end{eqnarray*}$

since ${\bf i}^2 = -1$ and $\int \frac{1}{\sqrt{2\pi}} \exp \left\{-\frac {(x-{\bf i}t)^2}{2}\right\}\,dx=1$ .

A variety of distributional characteristics can be computed from $\varphi_X(t)$ . The standard normal distribution has a very simple cf, as was seen in Example 4.10. Deviations from normal covariance structures can be measured by the deviations from the cf (or characteristics of it). In Table 4.1 we give an overview of the cf's for a variety of distributions.

Table 4.1: Characteristic functions for some common distributions.

	pdf	cf
Uniform	$f(x)={\boldsymbol{I}}(x\in[a,b])/(b-a)$	$\varphi_X(t)=(e^{{\bf i}bt}-e^{{\bf i}at})/(b-a){\bf i}t$
$N_1(\mu,\sigma^2)$	$f(x)=(2\pi\sigma^2)^{-1/2}exp\{-(x-\mu)^2/2\sigma^2\}$	$\varphi_X(t)=e^{{\bf i}\mu t-\sigma^2t^2/2}$
$\chi^2(n)$	$f(x)={\boldsymbol{I}}(x> 0) x^{n/2-1}e^{-x/2}/\{\Gamma(n/2)2^{n/2}\}$	$\varphi_X(t)=(1-2{\bf i}t)^{-n/2}$
$N_p(\mu,\Sigma)$	$f(x)=\vert 2\pi\Sigma\vert^{-1/2}exp\{-(x-\mu)^{\top}\Sigma(x-\mu)/2 \}$	$\varphi_X(t)=e^{{\bf i}t^{\top}\mu-t^{\top}\Sigma t/2}$

THEOREM 4.4 (Cramer-Wold) The distribution of $X \in \mathbb{R}^p$ is completely determined by the set of all (one-dimensional) distributions of $t^{\top}X$ where $t\in \mathbb{R}^p$ .

This theorem says that we can determine the distribution of in $\mathbb{R}^p$ by specifying all of the one-dimensional distributions of the linear combinations

$\begin{displaymath}\sum^p_{j=1} t_jX_j = t^{\top}X, \quad t = (t_{1},t_{2},\ldots,t_{p})^{\top}. \end{displaymath}$

Cumulant functions

Moments $m_k=\int x^k f(x) dx$ often help in describing distributional characteristics. The normal distribution in dimension is completely characterized by its standard normal density $f=\varphi$ and the moment parameters are $\mu=m_1$ and $\sigma^2=m_2-m_1^2$ . Another helpful class of parameters are the cumulants or semi-invariants of a distribution. In order to simplify notation we concentrate here on the one-dimensional () case.

For a given random variable with density and finite moments of order the characteristic function $\varphi_X(t)=E(e^{itX})$ has the derivative

$\begin{displaymath} \frac{1}{i^j} \left[ \frac{\partial^j \varphi_X(t)}{\partial t^j }\right]_{t=0} = \kappa_j, \qquad j=1,\dots,k. \end{displaymath}$

The values $\kappa_j$ are called cumulants or semi-invariants since $\kappa_j$ does not change (for

) under a shift transformation $X\mapsto X+a$ . The cumulants are natural parameters for dimension reduction methods, in particular the Projection Pursuit method (see Section 18.2).

The relationship between the first moments $m_1,\dots,m_k$ and the cumulants is given by

$\begin{displaymath} \kappa_k=(-1)^{k-1}\left\vert \begin{array}{cccc} m_1 & 1 &... ...y}{c}k-1\\ k-2\end{array}\right)m_1\\ \end{array}\right\vert. \end{displaymath}$

(4.36)

EXAMPLE 4.11 Suppose that

, then formula (4.36) above yields

$\begin{displaymath} \kappa_1=m_1. \end{displaymath}$

For

we obtain

$\begin{displaymath} \kappa_2 = - \left\vert\begin{array}{cc} m_1 & 1 \\ m_2 & \... ... 0\end{array}\right) m_1 \\ \end{array}\right\vert=m_2-m_1^2. \end{displaymath}$

For

we have to calculate

$\begin{displaymath} \kappa_3 = \left\vert\begin{array}{ccc} m_1 & 1 & 0\\ m_2 & m_1 & 1\\ m_3&m_2&2m_1\\ \end{array}\right\vert. \end{displaymath}$

Calculating the determinant we have:

$\displaystyle \kappa_3$	$\textstyle =$	$\displaystyle m_1 \left\vert\begin{array}{cc} m_1 & 1 \\ m_2 & 2m_1\\ \end{ar... ...rt +m_3 \left\vert\begin{array}{cc} 1 & 0 \\ m_1 & 1\\ \end{array}\right\vert$
	$\textstyle =$	$\displaystyle m_1(2m_1^2-m_2)-m_2(2m_1)+m_3$
	$\textstyle =$	$\displaystyle m_3 -3m_1m_2+2m_1^3.$	(4.37)

Similarly one calculates

$\begin{displaymath} \kappa_4=m_4-4m_3m_1-3m_2^2+12m_2m_1^2-6m_1^4. \end{displaymath}$

(4.38)

The same type of process is used to find the moments of the cumulants:

$\displaystyle m_1$	$\textstyle =$	$\displaystyle \kappa_1$
$\displaystyle m_2$	$\textstyle =$	$\displaystyle \kappa_2+\kappa_1^2$
$\displaystyle m_3$	$\textstyle =$	$\displaystyle \kappa_3 + 3\kappa_2\kappa_1 + \kappa_1^3$
$\displaystyle m_4$	$\textstyle =$	$\displaystyle \kappa_4 + 4\kappa_3\kappa_1 +3\kappa_2^2+6\kappa_2\kappa_1^2 +\kappa_1^4.$	(4.39)

A very simple relationship can be observed between the semi-invariants and the central moments $\mu_k=E(X-\mu)^k$ , where $\mu=m_1$ as defined before. In fact, $\kappa_2=\mu_2$ , $\kappa_3=\mu_3$ and $\kappa_4=\mu_4-3\mu_2^2$ .

Skewness $\gamma_3$ and kurtosis $\gamma_4$ are defined as:

$\displaystyle \gamma_3$	$\textstyle =$	$\displaystyle E(X-\mu)^3/\sigma^3$
$\displaystyle \gamma_4$	$\textstyle =$	$\displaystyle E(X-\mu)^4/\sigma^4.$	(4.40)

The skewness and kurtosis determine the shape of one-dimensional distributions. The skewness of a normal distribution is 0 and the kurtosis equals 3. The relation of these parameters to the cumulants is given by:

$\displaystyle \gamma_3$	$\textstyle =$	$\displaystyle \frac{\kappa_3}{\kappa_2^{3/2}}$	(4.41)
$\displaystyle \gamma_4$	$\textstyle =$	$\displaystyle \frac{\kappa_4}{\kappa_2^2}.$	(4.42)

These relations will be used later in Section 18.2 on Projection Pursuit to determine deviations from normality.

Summary

$\ast$: The expectation of a random vector is $\mu=\int xf(x)\;dx$ , the covariance matrix $\Sigma=\Var(X)=E(X-\mu)(X-\mu)^{\top}$ . We denote $X\sim(\mu,\Sigma)$ .
$\ast$: Expectations are linear, i.e., $E(\alpha X+\beta Y) =\alpha E X+\beta E Y$ . If and are independent, then $E (XY^{\top})=E X EY^{\top}$ .
$\ast$: The covariance between two random vectors and is $\Sigma_{XY}= \Cov(X,Y)= E(X-E X)(Y-E Y)^{\top}=E(XY^{\top})-EX EY^{\top}$ . If and are independent, then $\Cov(X,Y)=0$ .
$\ast$: The characteristic function (cf) of a random vector is $\varphi_{X}(t) = E (e^{\textrm{\bf i}t^{\top}X})$ .
$\ast$: The distribution of a -dimensional random variable is completely determined by all one-dimensional distributions of $t^{\top}X$ where $t\in \mathbb{R}^p$ (Theorem of Cramer-Wold).
$\ast$: The conditional expectation $E(X_2\vert X_1)$ is the MSE best approximation of by a function of .