next up previous contents index
Next: 7.4 Practical Aspects Up: 7. Generalized Linear Models Previous: 7.2 Model Characteristics

Subsections


7.3 Estimation

Recall that the least squares estimator for the ordinary linear regression model is also the maximum-likelihood estimator in the case of normally distributed error terms. By assuming that the distribution of $ Y$ belongs to the exponential family it is possible to derive maximum-likelihood estimates for the coefficients of a GLM. Moreover we will see that even though the estimation needs a numerical approximation, each step of the iteration can be given by a weighted least squares fit. Since the weights are varying during the iteration the likelihood is optimized by an iteratively reweighted least squares algorithm.


7.3.1 Properties of the Exponential Family

To derive the details of the maximum-likelihood algorithm we need to discuss some properties of the probability mass or density function $ f(\bullet)$. For the sake of brevity we consider $ f$ to be a density function in the following derivation. However, the conclusions will hold for a probability mass function as well.

First, we start from the fact that $ \int f(y,\theta,\psi)\,\mathrm{d}y=1.$ Under suitable regularity conditions (it is possible to exchange differentiation and integration) this implies

0 $\displaystyle =$ $\displaystyle \frac{\partial}{\partial\theta} \int f(y,\theta,\psi)\,\mathrm{d}y
= \int \frac{\partial}{\partial\theta} f(y,\theta,\psi)\,\mathrm{d}y$  
  $\displaystyle =$ $\displaystyle \int \left\{\frac{\partial}{\partial\theta} \log f(y,\theta,\psi)...
...m{d}y
= E\left\{\frac{\partial}{\partial\theta}
\ell(y,\theta,\psi) \right\}\,,$  

where $ \ell(y,\theta,\psi)=\log f(y,\theta,\psi)$ denotes the log-likelihood function. The function derivative of $ \ell$ with respect to $ \theta$ is typically called the score function for which it is known that

$\displaystyle E\left\{ \frac{\partial^2}{\partial\theta^2}
\ell(y,\theta,\psi)...
...= -E \left\{ \frac{\partial}{\partial\theta}
\ell(y,\theta,\psi) \right\}^2\,.$

This and taking first and second derivatives of (7.1) results in

$\displaystyle 0 = E \left\{ \frac{Y-b'(\theta)}{a(\psi)}\right\}, \quad\textrm{...
...heta)}{a(\psi)}\right\}
= - E \left\{ \frac{Y-b'(\theta)}{a(\psi)}\right\}^2\,,$

such that we can conclude
$\displaystyle E(Y)$ $\displaystyle =$ $\displaystyle \mu = b'(\theta),$ (7.2)
$\displaystyle {\text{Var}}(Y)$ $\displaystyle =$ $\displaystyle V(\mu) a(\psi) = b''(\theta) a(\psi).$ (7.3)

Note that as a consequence from (7.1) the expectation of $ Y$ depends only on the parameter of interest $ \theta$. We also assume that the factor $ a(\psi)$ is identical over all observations.


7.3.2 Maximum-Likelihood and Deviance Minimization

As pointed out before the estimation method of choice for $ \boldsymbol {\beta }$ is maximum-likelihood. As an alternative the literature refers to the minimization of the deviance. We will see during the following derivation that both approaches are identical.

Suppose that we have observed a sample of independent pairs $ (Y_i,\boldsymbol{X}_i)$ where $ i=1,\ldots,n$. For a more compact notation denote now the vector of all response observations by $ \boldsymbol{Y}=(Y_1,\ldots,Y_n)^\top $ and their conditional expectations (given $ \boldsymbol{X}_i$) by $ \boldsymbol{\mu}=(\mu_1,\ldots,\mu_n)^\top $. Recall that we study

$\displaystyle E(Y_i\vert\boldsymbol{X}_i)= \mu_i= G(\boldsymbol{X}_i^\top \boldsymbol{\beta}) = G(\eta_i).$

The sample log-likelihood of the vector $ \boldsymbol{Y}$ is then given by

$\displaystyle \ell(\boldsymbol{Y},\boldsymbol{\mu},\psi) = \sum_{i=1}^n \ell(Y_i,\theta_i,\psi).$ (7.4)

Here $ \theta _i$ is a function of $ \eta_i=\boldsymbol{X}_i^\top \boldsymbol{\beta}$ and we use $ \ell(Y_i,\theta_i,\psi)=\log f(Y_i,\theta_i,\psi)$ to denote the individual log-likelihood contributions for all observations $ i$.

Example 5 (Normal log-likelihood)  
For normal responses $ Y_i\sim N(\mu_i,\sigma^2)$ we have $ \ell(Y_i,\theta_i,\psi)= -(Y_i-\mu_i)^2/(2\sigma^2)-
\log\left(\sqrt{2\pi}\sigma\right).$ This gives the sample log-likelihood

$\displaystyle \ell(\boldsymbol{Y},\boldsymbol{\mu},\sigma)= n\log\left(\frac{1}{\sqrt{2\pi}\sigma}\right) -\frac {1}{2\sigma^2} \sum_{i=1}^n (Y_i-\mu_i)^2.$ (7.5)

Obviously, maximizing this log-likelihood is equivalent to minimizing the least squares criterion.

Example 6 (Bernoulli log-likelihood)  
The calculation in Example 3 shows that the individual log-likelihoods for the binary responses equal $ \ell(Y_i,\theta_i,\psi)=Y_i\log(\mu_i)+(1-Y_i)\log(1-\mu_i)$. This leads to

$\displaystyle \ell(\boldsymbol{Y},\boldsymbol{\mu},\psi)=\sum_{i=1}^n \left\{Y_i\log(\mu_i)+(1-Y_i)\log(1-\mu_i)\right\}$ (7.6)

for the sample version.

The deviance defines an alternative objective function for optimization. Let us first introduce the scaled deviance which is defined as

$\displaystyle D(\boldsymbol{Y},\boldsymbol{\mu},\psi) = 2 \left\{ \ell(\boldsym...
...dsymbol{\mu}^{\max},\psi) -\ell(\boldsymbol{Y},\boldsymbol{\mu},\psi) \right\}.$ (7.7)

Here $ \boldsymbol{\mu}^{\max}$ (which typically equals $ \boldsymbol{Y}$) is the vector that maximizes the saturated model, i.e. the function $ \ell(\boldsymbol{Y},\boldsymbol{\mu},\psi)$ without imposing any restriction on $ \boldsymbol{\mu}$. Since the term $ \ell(\boldsymbol{Y},\boldsymbol{\mu}^{\max},\psi)$ does not depend on the parameter $ \boldsymbol {\beta }$ we see that indeed the minimization of the scaled deviance is equivalent to the maximization of the sample log-likelihood (7.4).

If we now plug-in the exponential family form (7.1) into (7.4) we obtain

$\displaystyle \ell(\boldsymbol{Y},\boldsymbol{\mu},\psi) = \sum_{i=1}^n \left\{\frac{Y_i\theta_i-b(\theta_i)}{a(\psi)} - c(Y_i,\psi)\right\}.$ (7.8)

Obviously, neither $ a(\psi)$ nor $ c(Y_i,\psi)$ depend on the unknown parameter vector $ \boldsymbol {\beta }$. Therefore, it is sufficient to consider

$\displaystyle \sum_{i=1}^n \left\{Y_i\theta_i-b(\theta_i)\right\}$ (7.9)

for the maximization. The deviance analog of (7.9) is the (non-scaled) deviance function

$\displaystyle D(\boldsymbol{Y},\boldsymbol{\mu}) = D(\boldsymbol{Y},\boldsymbol{\mu},\psi)\,a(\psi).$ (7.10)

The (non-scaled) deviance $ D(\boldsymbol{Y},\boldsymbol{\mu})$ can be seen as the GLM equivalent of the residual sum of squares (RSS) in linear regression as it compares the log-likelihood $ \ell$ for the ``model'' $ \mu$ with the maximal achievable value of $ \ell$.


7.3.3 Iteratively Reweighted Least Squares Algorithm

We will now minimize the deviance with respect to $ \boldsymbol {\beta }$. If we denote the gradient of (7.10) by

$\displaystyle \nabla(\boldsymbol{\beta}) =\frac{\partial}{\partial\boldsymbol{\...
..._i-b'(\theta_i)\right\} \frac{\partial}{\partial\boldsymbol{\beta}} \theta_i\,,$ (7.11)

our optimization problem consists in solving

$\displaystyle \nabla(\boldsymbol{\beta}) = 0.$ (7.12)

Note that this is (in general) a nonlinear system of equations in $ \boldsymbol {\beta }$ and an iterative solution has to be computed. The smoothness of the link function allows us to compute the Hessian of $ D(\boldsymbol{Y},\boldsymbol{\mu})$, which we denote by $ {\mathcal{H}}(\boldsymbol{\beta})$. Now a Newton-Raphson algorithm can be applied which determines the optimal $ \widehat{\boldsymbol{\beta}}$ using the following iteration steps:

$\displaystyle \widehat{\boldsymbol{\beta}}^{\text{new}} = \widehat{\boldsymbol{...
...ht) \right\}^{-1}
\nabla\left(\widehat{\boldsymbol{\beta}}^{\text{old}}\right).$

A variant of the Newton-Raphson is the Fisher scoring algorithm that replaces the Hessian by its expectation with respect to the observations $ Y_i$:

$\displaystyle \widehat{\boldsymbol{\beta}}^{\text{new}} = \widehat{\boldsymbol{...
...ht) \right\}^{-1}
\nabla\left(\widehat{\boldsymbol{\beta}}^{\text{old}}\right).$

To find simpler representations for these iterations, recall that we have $ \mu_i=G(\eta_i)=G(\boldsymbol{X}_i^\top \boldsymbol{\beta}) =b'(\theta_i)$. By taking the derivative of the right hand term with respect to $ \boldsymbol {\beta }$ this implies

$\displaystyle b'(\theta_i)\frac{\partial}{\partial\boldsymbol{\beta}} \theta_i
=G(\boldsymbol{X}_i^\top \boldsymbol{\beta})\, \boldsymbol{X}_i.$

Using that $ b''(\theta_i)=V(\mu_i)$ as established in (7.3) and taking derivatives again, we finally obtain
$\displaystyle \frac{\partial}{\partial\boldsymbol{\beta}} \theta_i$ $\displaystyle =$ $\displaystyle \frac{G'(\eta_i)}{V(\mu_i)} \, \boldsymbol{X}_i$  
$\displaystyle \frac{\partial^2}{\partial\boldsymbol{\beta}\boldsymbol{\beta}^\top } \theta_i$ $\displaystyle =$ $\displaystyle \frac{G''(\eta_i)V(\mu_i) - G'(\eta_i)^2V'(\mu_i)}{V(\mu_i)^2} \,
\boldsymbol{X}_i \boldsymbol{X}_i^\top .$  

From this we can express the gradient and the Hessian of the deviance by
$\displaystyle \nabla(\boldsymbol{\beta})$ $\displaystyle =$ $\displaystyle -2 \sum_{i=1}^n \left\{Y_i-\mu_i\right\}
\frac{G'(\eta_i)}{V(\mu_i)} \, \boldsymbol{X}_i$  
$\displaystyle {\mathcal{H}}(\boldsymbol{\beta})$ $\displaystyle =$ $\displaystyle 2 \sum_{i=1}^n \left\{\frac{G'(\eta_i)^2}{V(\mu_i)}
- \{Y_i-\mu_i...
...eta_i)^2V'(\mu_i)}{V(\mu_i)^2}
\right\}\boldsymbol{X}_i \boldsymbol{X}_i^\top .$  

The expectation of $ {\mathcal{H}}(\boldsymbol{\beta})$ in the Fisher scoring algorithm equals

$\displaystyle E{\mathcal{H}}(\boldsymbol{\beta})
= 2 \sum_{i=1}^n \left\{\frac{G'(\eta_i)^2}{V(\mu_i)}
\right\}\boldsymbol{X}_i \boldsymbol{X}_i^\top .$

Let us consider only the Fisher scoring algorithm for the moment. We define the weight matrix

$\displaystyle {\boldsymbol{W}} = {\text{diag}}\left( \frac{G'(\eta_1)^2}{V(\mu_1)},\ldots,
\frac{G'(\eta_n)^2}{V(\mu_n)}\right)$

and the vectors $ \widetilde{\boldsymbol{Y}}=(\widetilde{Y}_1,\ldots,\widetilde{Y}_n)^\top
$, $ \boldsymbol{Z}=(Z_1,\ldots,Z_n)^\top $ by

$\displaystyle \widetilde{Y}_i = \frac{Y_i-\mu_i}{G'(\eta_i)}\,,\quad
Z_i= \eta_...
...bol{X}_i^\top \boldsymbol{\beta}^{\text{old}} + \frac{Y_i-\mu_i}{G'(\eta_i)}\,.$

Denote further by $ {\boldsymbol{X}}$ the design matrix given by the rows $ x_i^\top$. Then, the Fisher scoring iteration step for $ \boldsymbol {\beta }$ can be rewritten as

$\displaystyle \boldsymbol{\beta}^{\text{new}}=\boldsymbol{\beta}^{\text{old}}+ ...
...\boldsymbol{X}})^{-1} {\boldsymbol{X}}^\top {\boldsymbol{W}}{\boldsymbol{Z}}\,.$ (7.13)

This immediately shows that each Fisher scoring iteration step is the result of a weighted least squares regression of the adjusted dependent variables $ Z_i$ on the explanatory variables $ \boldsymbol{X}_i$. Since the weights are recalculated in each step we speak of the iteratively reweighted least squares (IRLS) algorithm. For the Newton-Raphson algorithm a representation equivalent to (7.13) can be found, only the weight matrix $ {\boldsymbol{W}}$ differs.

The iteration will be stopped when the parameter estimate and/or the deviance do not change significantly anymore. We denote the final parameter estimate by $ \widehat{\boldsymbol{\beta}}$.

7.3.4 Remarks on the Algorithm

Let us first note two special cases for the algorithm:

There are several further remarks on the algorithm which concern in particular starting values and the computation of relevant terms for the statistical analysis:

7.3.5 Model Inference

The resulting estimator $ \widehat{\boldsymbol{\beta}}$ has an asymptotic normal distribution (except of course for the normal linear regression case when this is an exact normal distribution).

Theorem 1  
Under regularity conditions we have for the estimated coefficient vector

$\displaystyle \sqrt{n} (\widehat{\boldsymbol{\beta}} - \boldsymbol{\beta}) \to N(0,{\boldsymbol{\mathit\Sigma}}) \quad \textrm{ as }n\to\infty\,.$    

As a consequence for the scaled deviance and the log-likelihood approximately hold $ D(\boldsymbol{Y},\widehat{\boldsymbol{\mu}},\psi)\sim
\chi^2_{n-p}$ and $ 2\{\ell(\boldsymbol{Y},\widehat{\boldsymbol{\mu}},\psi)-\ell(\boldsymbol{Y},{\boldsymbol{\mu}},\psi)\}\sim\chi^2_{p}$.

For details on the necessary conditions see for example [12]. Note also that the asymptotic covariance $ {\boldsymbol{\mathit{\Sigma}}}$ for the coefficient estimator $ \widehat{\boldsymbol{\beta}}$ is the inverse of the Fisher information matrix, i.e.

$\displaystyle {\boldsymbol{I}}= - E\left\{\frac{\partial^2}{\partial\boldsymbol{\beta}\boldsymbol{\beta}^T}\ell(Y,\mu,\psi) \right\}\,.$    

Since $ {\boldsymbol{I}}$ can be estimated by the negative Hessian of the log-likelihood or its expectation, this suggests the estimator

$\displaystyle \widehat{{\boldsymbol{\mathit{\Sigma}}}} = a(\widehat{\psi})\left...
...i,\text{last}})} \right\}\boldsymbol{X}_i \boldsymbol{X}_i^\top \right]^{-1}\,.$    

Using the estimated covariance we are able to test hypotheses about the components of $ \boldsymbol {\beta }$.

For model choice between two nested models a likelihood ratio test (LR test) is used. Assume that $ {\mathcal{M}}_0$ ($ p_0$ parameters) is a submodel of the model $ {\mathcal{M}}$ ($ p$ parameters) and that we have estimated them as $ \widehat{\boldsymbol{\mu}_0}$ and $ \widehat{\boldsymbol{\mu}}$. For one-parameter exponential families (without a nuisance parameter $ \psi $) we use that asymptotically

$\displaystyle D(\boldsymbol{Y},\boldsymbol{\mu}_0) - D(\boldsymbol{Y},\boldsymbol{\mu}) \sim \chi^2_{p-p_0}.$ (7.16)

The left hand side of (7.16) is a function of the ratio of the two likelihoods deviance difference equals minus twice the log-likelihood difference. In a two-parameter exponential family ($ \psi $ is to be estimated) one can approximate the likelihood ratio test statistic by

$\displaystyle \frac{(n-p) \{D(\boldsymbol{Y},\boldsymbol{\mu}_0) - D(\boldsymbo...
...dsymbol{\mu})\}}{(p-p_0)D(\boldsymbol{Y},\boldsymbol{\mu}) } \sim F_{p-p_0,n-p}$ (7.17)

using the analog to the normal linear regression case ([36]), Chap. 7.

Model selection procedures for possibly non-nested models can be based on Akaike's information criterion [3]

$\displaystyle AIC=D(\boldsymbol{Y},\widehat{\boldsymbol{\mu}},\widehat{\psi}) + 2p,$

or Schwarz' Bayes information criterion [32]

$\displaystyle BIC=D(\boldsymbol{Y},\widehat{\boldsymbol{\mu}},\widehat{\psi}) + \log(n)p,$

where again $ p$ denotes the number of estimated parameters. For a general overview on model selection techniques see also Chap. III.1 of this handbook.


next up previous contents index
Next: 7.4 Practical Aspects Up: 7. Generalized Linear Models Previous: 7.2 Model Characteristics