6.1 The Likelihood Function

Suppose that $\{x_i\}^n_{i=1}$ is an i.i.d. sample from a population with pdf $f(x;\theta)$ . The aim is to estimate $\theta \in \mathbb{R}^k$ which is a vector of unknown parameters. The likelihood function is defined as the joint density $L(\data{X};\theta )$ of the observations $x_{i}$ considered as a function of $\theta$ :

The following examples illustrate cases where the maximization process can be performed analytically, i.e., we will obtain an explicit analytical expression for $\widehat \theta$ . Unfortunately, in other situations, the maximization process can be more intricate, involving nonlinear optimization techniques. In the latter case, given a sample ${\cal{X}}$ and the likelihood function, numerical methods will be used to determine the value of $\theta$ maximizing $L({\cal{X}};\theta)$ or $\ell({\cal{X}};\theta)$ . These numerical methods are typically based on Newton-Raphson iterative techniques.

EXAMPLE 6.1 Consider a sample $\{x_{i}\}_{i=1}^n$ from $N_{p}(\mu,\data{I})$ , i.e., from the pdf

$\begin{displaymath}f(x;\theta )=\left (2\pi\right )^{-p/2} \exp \left\{-\frac{ 1}{2}(x-\theta )^{\top}(x-\theta )\right\}\end{displaymath}$

where $\theta =\mu\in\mathbb{R}^p$ is the mean vector parameter. The log-likelihood is in this case

$\begin{displaymath} \ell (\data{X};\theta )= \sum_{i=1}^n \log\{f(x_{i};\theta )... ...c{ 1}{2} \sum_{i=1}^n (x_{i}-\theta )^{\top} (x_{i}-\theta). \end{displaymath}$

(6.3)

The term $(x_i-\theta )^{\top}(x_i-\theta )$ equals

$\begin{displaymath}(x_i-\overline x)^{\top}(x_i-\overline x) +(\overline x-\the... ...{x}-\theta ) +2(\overline x-\theta )^{\top} (x_i-\overline x).\end{displaymath}$

Summing this term over $i=1,\ldots ,n$ we see that

$\begin{displaymath}\sum^n_{i=1}(x_i-\theta )^{\top}(x_i-\theta ) =\sum ^n_{i=1}... ...verline x) +n(\overline x-\theta)^{\top}(\overline x-\theta ).\end{displaymath}$

Hence

$\begin{displaymath}\ell(\data{X}; \theta) = \log (2\pi)^{-np/2} - \frac{1}{2} \s... ... - \frac{n}{2} (\overline x- \theta )^{\top}(\bar{x}-\theta ). \end{displaymath}$

Only the last term depends on $\theta$ and is obviously maximized for

$\begin{displaymath}\widehat{\theta} = \widehat{\mu} = \overline x. \end{displaymath}$

Thus $\overline x$ is the MLE of $\theta$ for this family of pdfs $f(x,\theta)$ .

A more complex example is the following one where we derive the MLE's for $\mu$ and $\Sigma$ .

EXAMPLE 6.2 Suppose $\{x_i\}^n_{i=1}$ is a sample from a normal distribution $N_p(\mu,\Sigma)$ . Here $\theta =(\mu ,\Sigma )$ with $\Sigma$ interpreted as a vector. Due to the symmetry of $\Sigma$ the unknown parameter $\theta$ is in fact $\{p+\frac{1}{2}p(p+1)\}$ -dimensional. Then

$\begin{displaymath} L(\data{X};\theta ) = \vert 2\pi \Sigma \vert^{-n/2} \exp\le... ...}\sum ^n_{i=1} (x_i-\mu)^{\top}\Sigma ^{-1}(x_i-\mu )\right\} \end{displaymath}$

(6.4)

and

$\begin{displaymath} \ell (\data{X};\theta )=-\frac{n }{ 2}\log\vert 2\pi \Sigma ... ...{1 }{2 }\sum ^n_{i=1}(x_i-\mu )^{\top}\Sigma ^{-1}(x_i-\mu ). \end{displaymath}$

(6.5)

The term $(x_i-\mu )^{\top}\Sigma ^{-1}(x_i-\mu )$ equals

$\begin{displaymath}(x_i-\overline x)^{\top}\Sigma^{-1}(x_i-\overline x) +(\over... ...mu ) +2(\overline x-\mu )^{\top} \Sigma^{-1}(x_i-\overline x).\end{displaymath}$

Summing this term over $i=1,\ldots ,n$ we see that

$\begin{displaymath}\sum^n_{i=1}(x_i-\mu )^{\top}\Sigma^{-1}(x_i-\mu ) =\sum ^n_... ...ne x) +n(\overline x-\mu)^{\top}\Sigma^{-1}(\overline x-\mu ).\end{displaymath}$

Note that from (2.14)

$\begin{eqnarray*} (x_i-\overline x)^{\top}\Sigma ^{-1}(x_i-\overline x) &=& \mat... ...{\Sigma ^{-1}(x_i-\overline x)(x_i-\overline x)^{\top}\right\}. \end{eqnarray*}$

Therefore, by summing over the index

we finally arrive at

$\begin{eqnarray*} \sum^n_{i=1}(x_i-\mu )^{\top}\Sigma^{-1}(x_i-\mu)&=&\mathop{\h... ...\}+n(\overline x-\mu )^{\top} \Sigma^{-1}(\overline x-\mu ).\\ \end{eqnarray*}$

Thus the log-likelihood function for $N_p(\mu,\Sigma)$ is

$\begin{displaymath} \ell (\data{X};\theta )=-\frac{n }{2}\log \vert 2\pi \Sigma ... ... }{2}(\overline x-\mu )^{\top}\Sigma ^{-1} (\overline x-\mu). \end{displaymath}$

(6.6)

We can easily see that the third term is maximized by $\mu = \bar{x}$ . In fact the MLE's are given by

$\begin{displaymath}\widehat \mu =\overline x,\quad \widehat \Sigma=\data{S}.\end{displaymath}$

The derivation of $\widehat\Sigma$ is a lot more complicated. It involves derivatives with respect to matrices with their notational complexities and will not be presented here: for a more elaborate proof see Mardia et al. (1979, p.103-104). Note that the unbiased covariance estimator $\data{S}_{u}=\frac{n}{n-1} \data{S}$ is not the MLE of $\Sigma$ !

EXAMPLE 6.3 Consider the linear regression model $y_{i} = \beta^{\top}x_{i} + \varepsilon_{i}$ for $i=1,\ldots ,n$ , where $\varepsilon_{i}$ is i.i.d. and $N(0,\sigma^2)$ and where $x_{i} \in \mathbb{R}^p$ . Here $\theta = (\beta^{\top}, \sigma)$ is a

-dimensional parameter vector. Denote

$\begin{displaymath}y = \left( \begin{array}{c} y_{1} \\ \vdots \\ y_{n} \end{arr... ...c} x_{1}^{\top} \\ \vdots \\ x_{n}^{\top} \end{array} \right). \end{displaymath}$

Then

$\begin{displaymath}L(y,{\data X}; \theta) = \prod_{i=1}^n \frac{1}{\sqrt{2\pi}\s... ...t\{ -\frac{1}{2\sigma^2} (y_{i} -\beta^{\top}x_{i})^2 \right\} \end{displaymath}$

and

$\begin{eqnarray*} \ell (y,{\data X};\theta) & = & \log \left( \frac{1}{(2\pi)^{n... ...a{X}^{\top} \data{X} \beta - 2\beta^{\top} \data{X}^{\top} y). \end{eqnarray*}$

Differentiating w.r.t. the parameters yields

$\begin{eqnarray*} \frac{\partial}{\partial\beta}\ell & = & -\frac{1}{2\sigma^2} ... ...a^3} \big\{ (y-\data{X}\beta)^{\top}(y-\data{X}\beta) \big \}. \end{eqnarray*}$

Note that $\frac{\partial}{\partial\beta} \ell$ denotes the vector of the derivatives w.r.t. all components of $\beta$ (the gradient). Since the first equation only depends on $\beta$ , we start with deriving $\widehat{\beta}$ .

$\begin{displaymath}\data{X}^{\top} \data{X} \widehat{\beta} = \data{X}^{\top} y ... ...dehat{\beta} = (\data{X}^{\top}\data{X})^{-1} \data{X}^{\top}y \end{displaymath}$

Plugging $\widehat{\beta}$ into the second equation gives

$\begin{displaymath}\frac{n}{\widehat{\sigma}} = \frac{1}{\widehat{\sigma}^3} (y-... ... \frac{1}{n} \vert\vert y-\data{X}\widehat{\beta}\vert\vert^2, \end{displaymath}$

where $\vert\vert\bullet\vert\vert^2$ denotes the Euclidean vector norm from Section 2.6. We see that the MLE $\widehat{\beta}$ is identical with the least squares estimator (3.52). The variance estimator

$\begin{displaymath}\widehat{\sigma}^2 = \frac{1}{n} \sum_{i=1}^n (y_{i} - \widehat{\beta}^{\top}x_{i})^2 \end{displaymath}$

is nothing else than the residual sum of squares (RSS) from (3.37) generalized to the case of multivariate $x_{i}$ .

Note that when the $x_{i}$ are considered to be fixed we have

$\begin{displaymath}E(y)=\data{X}\beta\ \textrm{and}\ \Var(y)=\sigma^2\data{I}_n.\end{displaymath}$

Then, using the properties of moments from Section 4.2 we have

$\begin{displaymath} E(\widehat{\beta})=(\data{X}^{\top}\data{X})^{-1}\data{X}^{\top}E(y)=\beta, \end{displaymath}$

(6.7)

$\begin{displaymath} \Var(\widehat{\beta})=\sigma^2(\data{X}^{\top}\data{X})^{-1}. \end{displaymath}$

(6.8)