6.2 The Cramer-Rao Lower Bound

As pointed out above, an important question in estimation theory is whether an estimator $\widehat \theta$ has certain desired properties, in particular, if it converges to the unknown parameter $\theta$ it is supposed to estimate. One typical property we want for an estimator is unbiasedness, meaning that on the average, the estimator hits its target: $E(\widehat
\theta)=\theta$. We have seen for instance (see Example 6.2) that $\bar x$ is an unbiased estimator of $\mu$ and ${\cal{S}}$ is a biased estimator of $\Sigma$ in finite samples. If we restrict ourselves to unbiased estimation then the natural question is whether the estimator shares some optimality properties in terms of its sampling variance. Since we focus on unbiasedness, we look for an estimator with the smallest possible variance.

In this context, the Cramer-Rao lower bound will give the minimal achievable variance for any unbiased estimator. This result is valid under very general regularity conditions (discussed below). One of the most important applications of the Cramer-Rao lower bound is that it provides the asymptotic optimality property of maximum likelihood estimators. The Cramer-Rao theorem involves the score function and its properties which will be derived first.

The score function $s(\data{X};\theta )$ is the derivative of the log likelihood function w.r.t. $\theta \in \mathbb{R}^k$

s(\data{X};\theta )=\frac{\partial } {\partial \theta }\ell
...ta ) }\frac{\partial }{\partial \theta
}L(\data{X};\theta ).
\end{displaymath} (6.9)

The covariance matrix $\data{F}_{n}=\Var\{s(\data{X};\theta)\}$ is called the Fisher information matrix. In what follows, we will give some interesting properties of score functions.

THEOREM 6.1   If $s=s(\data{X};\theta )$ is the score function and if $\hat\theta = t =t(\data{X},\theta)$ is any function of $\data{X}$ and $\theta$, then under regularity conditions
E(st^{\top})=\frac{\partial }{\partial \theta } E(t^{\top})-...
...ft (\frac{
\partial t^{\top}}{\partial \theta }\right )\cdotp
\end{displaymath} (6.10)

The proof is left as an exercise (see Exercise 6.9). The regularity conditions required for this theorem are rather technical and ensure that the expressions (expectations and derivations) appearing in (6.10) are well defined. In particular, the support of the density $f(x;\theta)$ should not depend on $\theta$. The next corollary is a direct consequence.

COROLLARY 6.1   If $s=s(\data{X};\theta )$ is the score function, and $\hat\theta= t=t(\data{X})$ is any unbiased estimator of $\theta$ (i.e., $E(t) = \theta$), then
\end{displaymath} (6.11)

Note that the score function has mean zero (see Exercise 6.10).

E\{s(\data{X};\theta )\} = 0.
\end{displaymath} (6.12)

Hence, $E(ss^{\top}) = \Var(s) =\data{F}_n$ and by setting $s=t$ in Theorem 6.1 it follows that

\begin{displaymath}\data{F}_n = -E \left \{ \frac{\partial^2}{\partial \theta \partial \theta^{\top}}\ell (\data{X};\theta) \right \}. \end{displaymath}

REMARK 6.1   If $x_1,\cdots,x_n$ are i.i.d., $\data{F}_n = n\data{F}_1$ where $\data{F}_1$ is the Fisher information matrix for sample size n=1.

EXAMPLE 6.4   Consider an i.i.d. sample $\{x_i\}_{i=1}^n$ from $N_p(\theta ,\data{I})$. In this case the parameter $\theta$ is the mean $\mu$. It follows from (6.3) that:

s(\data{X};\theta ) &=& \frac{\partial}{\partial \theta}\,
(x_{i}-\theta )\right\}\\
&=& n(\overline x-\theta).

Hence, the information matrix is

\begin{displaymath}\data{F}_{n} =\Var\{n(\overline x-\theta)\} =n\data{I}_{p}.\end{displaymath}

How well can we estimate $\theta$? The answer is given in the following theorem which is due to Cramer and Rao. As pointed out above, this theorem gives a lower bound for unbiased estimators. Hence, all estimators, which are unbiased and attain this lower bound, are minimum variance estimators.

THEOREM 6.2 (Cramer-Rao)   If $\hat\theta= t=t(\data{X})$ is any unbiased estimator for $\theta$, then under regularity conditions
\Var(t)\ge \data{F}_{n}^{-1},
\end{displaymath} (6.13)

\data{F}_{n} = E\{s(\data{X};\theta )s(\data{X};\theta )^{\top}\}
= \Var\{s(\data{X};\theta)\}
\end{displaymath} (6.14)

is the Fisher information matrix.

Consider the correlation $ \rho_{Y,Z} $ between $Y$ and $Z$ where $Y =a^{\top}t$, $Z =c^{\top}s$. Here $s$ is the score and the vectors $a$, $c\in \mathbb{R}^p$. By Corollary 6.1 $\Cov(s,t)=\data{I}$ and thus

\mathop{\mathit{Cov}}(Y,Z) & =a^{\top}\Cov(t,s)c=a^{\top}c\\
\mathop{\mathit{Var}}(Z) & =c^{\top}\Var(s)c=c^{\top}\data{F}_{n}c.

\rho^2_{Y,Z} = \frac{\mathop{\mathit{Cov}}^2(Y,Z) }{\mathop{...
...op}c)^2 }{a^{\top}\Var(t)a\cdotp c^{\top}\data{F}_{n}c}\le 1.
\end{displaymath} (6.15)

In particular, this holds for any $c \neq 0$. Therefore it holds also for the maximum of the left-hand side of (6.15) with respect to $c$. Since

\begin{displaymath}\max_{c} \frac{c^{\top}aa^{\top}c}{c^{\top}\data{F}_{n}c} = \max_{c^{\top}\data{F}_{n}c=1}
c^{\top}aa^{\top}c \end{displaymath}


\begin{displaymath}\max_{c^{\top}\data{F}_{n}c=1} c^{\top}aa^{\top}c= a^{\top}\data{F}_{n}^{-1}a \end{displaymath}

by our maximization Theorem 2.5 we have

\begin{displaymath}\frac{a^{\top} \data{F}_{n}^{-1}a}{a^{\top}\Var(t)a }\le 1 \quad
\forall \ a \in \mathbb{R}^p, \quad a \neq 0, \end{displaymath}


\begin{displaymath}a^{\top}\{\Var(t)-\data{F}_{n}^{-1}\}a\ge 0\ \quad\forall\ a\in\mathbb{R}^p,\quad a\neq 0, \end{displaymath}

which is equivalent to $\Var(t) \ge \data{F}_{n}^{-1}$. ${\Box}$

Maximum likelihood estimators (MLE's) attain the lower bound if the sample size $n$ goes to infinity. The next Theorem 6.3 states this and, in addition, gives the asymptotic sampling distribution of the maximum likelihood estimation, which turns out to be multinormal.

THEOREM 6.3   Suppose that the sample $\{x_i\}_{i=1}^n$ is i.i.d. If $\widehat \theta$ is the MLE for $\theta \in \mathbb{R}^k$ , i.e., $\widehat\theta
=\arg\max\limits_\theta L(\data{X};\theta)$, then under some regularity conditions, as $n\to\infty$:
\sqrt n(\widehat{\theta} -\theta )
\mathrel{\mathop{\longrightarrow}\limits_{}^{\cal L}} N_{k}(0,\data{F}_{1}^{-1})
\end{displaymath} (6.16)

where $\data{F}_{1}$ denotes the Fisher information for sample size $n=1$.

As a consequence of Theorem 6.3 we see that under regularity conditions the MLE is asymptotically unbiased, efficient (minimum variance) and normally distributed. Also it is a consistent estimator of $\theta$.

Note that from property (5.4) of the multinormal it follows that asymptotically

\stackrel{\cal L}{\to}\chi^2_p.
\end{displaymath} (6.17)

If $\widehat{\data{F}}_1$ is a consistent estimator of $ \data{F}_1\ (e.g.\
\widehat{\data{F}}_1=\data{F}_1(\widehat{\theta}))$, we have equivalently
\stackrel{\cal L}{\to}\chi^2_p
.\end{displaymath} (6.18)

This expression is sometimes useful in testing hypotheses about $\theta$ and in constructing confidence regions for $\theta$ in a very general setup. These issues will be raised in more details in the next chapter but from (6.18) it can be seen, for instance, that when $n$ is large,

\leq \chi^2_{1-\alpha;p}
\approx 1- \alpha,

where $\chi^2_{\nu;p}$ denotes the $\nu$-quantile of a $\chi^2_p$ random variable. So, the ellipsoid $n(\widehat{\theta}-\theta)^{\top}\widehat{\data{F}}_1(\widehat{\theta}-\theta)\leq \chi^2_{1-\alpha;p}$ provides in $\mathbb{R}^p$ an asymptotic $(1-\alpha)$-confidence region for $\theta$.

The score function is the derivative $s(\data{X};\theta)=\frac{ \displaystyle \partial }
{ \displaystyle \partial\theta } \ell(\data{X};\theta)$ of the log-likelihood with respect to $\theta$. The covariance matrix of $s(\data{X};\theta )$ is the Fisher information matrix.
The score function has mean zero: $E\{s(\data{X};\theta )\} = 0$.
The Cramer-Rao bound says that any unbiased estimator $\hat\theta= t=t(\data{X})$ has a variance that is bounded from below by the inverse of the Fisher information. Thus, an unbiased estimator, which attains this lower bound, is a minimum variance estimator.
For i.i.d. data $\{x_i\}_{i=1}^n$ the Fisher information matrix is: $\data{F}_n = n\data{F}_1$.
MLE's attain the lower bound in an asymptotic sense, i.e.,

\begin{displaymath}\sqrt n(\widehat \theta -\theta ) \mathrel{\mathop{\longrightarrow}\limits_{}^{\cal L}} N_k(0,\data{F}_{1}^{-1})\end{displaymath}

if $\widehat \theta$ is the MLE for $\theta \in \mathbb{R}^k$, i.e., $\widehat\theta
=\arg\max\limits_\theta L(\data{X};\theta)$.