6.1 The Likelihood Function

Suppose that $\{x_i\}^n_{i=1}$ is an i.i.d. sample from a population with pdf $f(x;\theta)$. The aim is to estimate $\theta \in \mathbb{R}^k$ which is a vector of unknown parameters. The likelihood function is defined as the joint density $L(\data{X};\theta )$ of the observations $x_{i}$ considered as a function of $\theta$:

\begin{displaymath}
L(\data{X};\theta )=\prod ^n_{i=1}f(x_i;\theta ),
\end{displaymath} (6.1)

where $\data{X}$ denotes the sample of the data matrix with the observations ${x_1^{\top}},\ldots,{x_n^{\top}}$ in each row. The maximum likelihood estimator (MLE) of $\theta$ is defined as

\begin{displaymath}\widehat \theta =\arg\max_\theta L(\data{X};\theta). \end{displaymath}

Often it is easier to maximize the log-likelihood function
\begin{displaymath}
\ell (\data{X};\theta )=\log L(\data{X};\theta),
\end{displaymath} (6.2)

which is equivalent since the logarithm is a monotone one-to-one function. Hence

\begin{displaymath}\widehat \theta =\arg\max_\theta L(\data{X};\theta)
=\arg\max_\theta \ell(\data{X};\theta ).\end{displaymath}

The following examples illustrate cases where the maximization process can be performed analytically, i.e., we will obtain an explicit analytical expression for $\widehat \theta$. Unfortunately, in other situations, the maximization process can be more intricate, involving nonlinear optimization techniques. In the latter case, given a sample ${\cal{X}}$ and the likelihood function, numerical methods will be used to determine the value of $\theta$ maximizing $L({\cal{X}};\theta)$ or $\ell({\cal{X}};\theta)$. These numerical methods are typically based on Newton-Raphson iterative techniques.

EXAMPLE 6.1   Consider a sample $ \{x_{i}\}_{i=1}^n $ from $N_{p}(\mu,\data{I})$, i.e., from the pdf

\begin{displaymath}f(x;\theta )=\left (2\pi\right )^{-p/2}
\exp \left\{-\frac{ 1}{2}(x-\theta )^{\top}(x-\theta )\right\}\end{displaymath}

where $\theta =\mu\in\mathbb{R}^p$ is the mean vector parameter. The log-likelihood is in this case
\begin{displaymath}
\ell (\data{X};\theta )= \sum_{i=1}^n \log\{f(x_{i};\theta )...
...c{ 1}{2} \sum_{i=1}^n
(x_{i}-\theta )^{\top} (x_{i}-\theta).
\end{displaymath} (6.3)

The term $(x_i-\theta )^{\top}(x_i-\theta )$ equals

\begin{displaymath}(x_i-\overline x)^{\top}(x_i-\overline x)
+(\overline x-\the...
...{x}-\theta )
+2(\overline x-\theta )^{\top} (x_i-\overline x).\end{displaymath}

Summing this term over $i=1,\ldots ,n$ we see that

\begin{displaymath}\sum^n_{i=1}(x_i-\theta )^{\top}(x_i-\theta )
=\sum ^n_{i=1}...
...verline x)
+n(\overline x-\theta)^{\top}(\overline x-\theta ).\end{displaymath}

Hence

\begin{displaymath}\ell(\data{X}; \theta) = \log (2\pi)^{-np/2} - \frac{1}{2} \s...
... - \frac{n}{2} (\overline x-
\theta )^{\top}(\bar{x}-\theta ). \end{displaymath}

Only the last term depends on $\theta$ and is obviously maximized for

\begin{displaymath}\widehat{\theta} = \widehat{\mu} = \overline x. \end{displaymath}

Thus $\overline x$ is the MLE of $\theta$ for this family of pdfs $f(x,\theta)$.

A more complex example is the following one where we derive the MLE's for $\mu$ and $\Sigma$.

EXAMPLE 6.2   Suppose $\{x_i\}^n_{i=1}$ is a sample from a normal distribution $N_p(\mu,\Sigma)$. Here $\theta =(\mu ,\Sigma )$ with $\Sigma$ interpreted as a vector. Due to the symmetry of $\Sigma$ the unknown parameter $\theta$ is in fact $\{p+\frac{1}{2}p(p+1)\}$-dimensional. Then
\begin{displaymath}
L(\data{X};\theta ) = \vert 2\pi \Sigma \vert^{-n/2}
\exp\le...
...}\sum ^n_{i=1}
(x_i-\mu)^{\top}\Sigma ^{-1}(x_i-\mu )\right\}
\end{displaymath} (6.4)

and
\begin{displaymath}
\ell (\data{X};\theta )=-\frac{n }{ 2}\log\vert 2\pi \Sigma ...
...{1 }{2 }\sum
^n_{i=1}(x_i-\mu )^{\top}\Sigma ^{-1}(x_i-\mu ).
\end{displaymath} (6.5)

The term $(x_i-\mu )^{\top}\Sigma ^{-1}(x_i-\mu )$ equals

\begin{displaymath}(x_i-\overline x)^{\top}\Sigma^{-1}(x_i-\overline x)
+(\over...
...mu )
+2(\overline x-\mu )^{\top} \Sigma^{-1}(x_i-\overline x).\end{displaymath}

Summing this term over $i=1,\ldots ,n$ we see that

\begin{displaymath}\sum^n_{i=1}(x_i-\mu )^{\top}\Sigma^{-1}(x_i-\mu )
=\sum ^n_...
...ne x)
+n(\overline x-\mu)^{\top}\Sigma^{-1}(\overline x-\mu ).\end{displaymath}

Note that from (2.14)

\begin{eqnarray*}
(x_i-\overline x)^{\top}\Sigma ^{-1}(x_i-\overline x) &=&
\mat...
...{\Sigma ^{-1}(x_i-\overline x)(x_i-\overline x)^{\top}\right\}.
\end{eqnarray*}



Therefore, by summing over the index $i$ we finally arrive at

\begin{eqnarray*}
\sum^n_{i=1}(x_i-\mu )^{\top}\Sigma^{-1}(x_i-\mu)&=&\mathop{\h...
...\}+n(\overline x-\mu )^{\top}
\Sigma^{-1}(\overline x-\mu ).\\
\end{eqnarray*}



Thus the log-likelihood function for $N_p(\mu,\Sigma)$ is
\begin{displaymath}
\ell (\data{X};\theta )=-\frac{n }{2}\log \vert 2\pi \Sigma ...
... }{2}(\overline x-\mu )^{\top}\Sigma ^{-1}
(\overline x-\mu).
\end{displaymath} (6.6)

We can easily see that the third term is maximized by $\mu =
\bar{x}$. In fact the MLE's are given by

\begin{displaymath}\widehat \mu =\overline x,\quad \widehat \Sigma=\data{S}.\end{displaymath}

The derivation of $\widehat\Sigma$ is a lot more complicated. It involves derivatives with respect to matrices with their notational complexities and will not be presented here: for a more elaborate proof see Mardia et al. (1979, p.103-104). Note that the unbiased covariance estimator $\data{S}_{u}=\frac{n}{n-1} \data{S}$ is not the MLE of $\Sigma$!

EXAMPLE 6.3   Consider the linear regression model $y_{i} = \beta^{\top}x_{i} +
\varepsilon_{i}$ for $i=1,\ldots ,n$, where $\varepsilon_{i}$ is i.i.d. and $N(0,\sigma^2)$ and where $x_{i} \in \mathbb{R}^p$. Here $\theta = (\beta^{\top},
\sigma)$ is a $(p+1)$-dimensional parameter vector. Denote

\begin{displaymath}y = \left( \begin{array}{c} y_{1} \\ \vdots \\ y_{n} \end{arr...
...c} x_{1}^{\top} \\ \vdots
\\ x_{n}^{\top} \end{array} \right). \end{displaymath}

Then

\begin{displaymath}L(y,{\data X}; \theta) = \prod_{i=1}^n \frac{1}{\sqrt{2\pi}\s...
...t\{ -\frac{1}{2\sigma^2} (y_{i} -\beta^{\top}x_{i})^2 \right\} \end{displaymath}

and

\begin{eqnarray*}
\ell (y,{\data X};\theta) & = & \log \left( \frac{1}{(2\pi)^{n...
...a{X}^{\top} \data{X} \beta - 2\beta^{\top} \data{X}^{\top} y).
\end{eqnarray*}



Differentiating w.r.t. the parameters yields

\begin{eqnarray*}
\frac{\partial}{\partial\beta}\ell & = & -\frac{1}{2\sigma^2}
...
...a^3} \big\{ (y-\data{X}\beta)^{\top}(y-\data{X}\beta) \big \}.
\end{eqnarray*}



Note that $ \frac{\partial}{\partial\beta} \ell $ denotes the vector of the derivatives w.r.t. all components of $\beta$ (the gradient). Since the first equation only depends on $\beta$, we start with deriving $\widehat{\beta}$.

\begin{displaymath}\data{X}^{\top} \data{X} \widehat{\beta} = \data{X}^{\top} y ...
...dehat{\beta} = (\data{X}^{\top}\data{X})^{-1} \data{X}^{\top}y \end{displaymath}

Plugging $\widehat{\beta}$ into the second equation gives

\begin{displaymath}\frac{n}{\widehat{\sigma}} = \frac{1}{\widehat{\sigma}^3}
(y-...
... \frac{1}{n} \vert\vert y-\data{X}\widehat{\beta}\vert\vert^2, \end{displaymath}

where $\vert\vert\bullet\vert\vert^2$ denotes the Euclidean vector norm from Section 2.6. We see that the MLE $\widehat{\beta}$ is identical with the least squares estimator (3.52). The variance estimator

\begin{displaymath}\widehat{\sigma}^2 = \frac{1}{n} \sum_{i=1}^n
(y_{i} - \widehat{\beta}^{\top}x_{i})^2 \end{displaymath}

is nothing else than the residual sum of squares (RSS) from (3.37) generalized to the case of multivariate $x_{i}$.

Note that when the $x_{i}$ are considered to be fixed we have

\begin{displaymath}E(y)=\data{X}\beta\ \textrm{and}\ \Var(y)=\sigma^2\data{I}_n.\end{displaymath}


Then, using the properties of moments from Section 4.2 we have
\begin{displaymath}
E(\widehat{\beta})=(\data{X}^{\top}\data{X})^{-1}\data{X}^{\top}E(y)=\beta,
\end{displaymath} (6.7)


\begin{displaymath}
\Var(\widehat{\beta})=\sigma^2(\data{X}^{\top}\data{X})^{-1}.
\end{displaymath} (6.8)

Summary
$\ast$
If $ \{x_{i}\}_{i=1}^n $ is an i.i.d. sample from a distribution with pdf $f(x;\theta)$, then $L(\data{X};\theta)=\prod_{i=1}^n
f(x_{i};\theta)$ is the likelihood function. The maximum likelihood estimator (MLE) is that value of $\theta$ which maximizes $L(\data{X};\theta )$. Equivalently one can maximize the log-likelihood $\ell(\data{X};\theta)$.
$\ast$
The MLE's of $\mu$ and $\Sigma$ from a $N_{p}(\mu,\Sigma)$ distribution are $\widehat\mu=\overline{x}$ and $\widehat\Sigma=\data{S}$. Note that the MLE of $\Sigma$ is not unbiased.
$\ast$
The MLE's of $\beta$ and $\sigma$ in the linear model $y = \data{X} \beta + \varepsilon, \; \varepsilon \sim N_n(0, \sigma^2 \data{I})$ are given by the least squares estimator $\widehat{\beta} = (\data{X}^{\top}\data{X})^{-1} \data{X}^{\top}y$ and $\widehat{\sigma}^2 = \frac{1}{n} \vert\vert y-\data{X} \widehat{\beta} \vert\vert^2$. $E(\widehat{\beta})=\beta$ and $\Var(\widehat{\beta})=\sigma^2(\data{X}^{\top}\data{X})^{-1}$.