7.5 Statistical Evaluation and Presentation

To assess the estimated model it is useful to check the significance of single parameter values or of linear combinations of parameters. To compare two nested models a likelihood ratio test can be performed. Last but not least, an optimal submodel can be selected by model selection via Akaike's AIC or Schwarz' BIC.


7.5.1 Statistical Characteristics


stat = 16306 glmstat (code, x, y, b, bv {, opt})
computes statistical characteristics for an estimated GLM

The functions 16309 doglm and 16312 glmest provide a number of statistical characteristics of the estimated model in the output component stat. Alternatively, the function 16315 glmstat can be used to create the above-mentioned statistics by hand. Suppose we have input x, y and have estimated the vector of coefficients b with covariance bv by the model "nopow". Then the list of statistics can be found from

  stat=glmstat("nopow",x,y,b,bv)
Of course, a list of options opt can be added. If options from opt have been used for the estimation, these should be included in 16318 glmstat .

The following characteristics are contained in the output stat. This itself is a list and covers the components

df
degrees of freedom (typically sample size minus number of estimated parameters).
deviance
the deviance of the estimated model.
pearson
the Pearson statistic.
loglik
the log-likelihood of the estimated model, using the estimated dispersion parameter.
dispersion
an estimate for the dispersion parameter (deviance/df).
aic, bic
Akaike's AIC and Schwarz' BIC criterion, respectively.
r2, adr2
the (pseudo) coefficient of determination and its adjusted version, respectively.
it
the number of iterations needed.
ret
the return code, which is 0 if everything went without problems, 1 if the maximal number of iterations was reached, and -1 if missing values have been encountered. In the latter case, the parameter estimates and its covariance come from the penultimate iteration step.
nr
the number of replicated observation in x, if they were searched for.
Sometimes one or the other statistic may not be available when it is not applicable. This can always be checked by searching for the components in stat:
  names(stat)
The function names will report all components of the list stat.


7.5.2 Output Display


16430 glmout (code, x, y, b, bv, stat {, opt})
creates a nice output display for an estimated GLM

Recall the example 16433 XLGglm03.xpl , which estimated the lizard data. The last line of this quantlet creates the output display shown in Figure 7.1:

  glmout("bilo",x,y,g.b,g.bv,g.stat,opt)

Figure 7.1: Output display for lizard example.
\includegraphics[scale=1]{glmoutbilo}

Note that the option list opt should also be given to 16439 glmout to adjust the resulting estimated curve by the weights. In the binomial case (as in our lizard example), the right panel shows the predicted probabilities. For all other distributions, the estimated regression function (the index $ x^T\beta$ vs. $ G(x^T\beta)$) will be shown.


7.5.3 Significance of Parameters

Let us continue with the lizard example from Subsection 7.5.2. As a result of the estimation, we have an output list g containing components g.b (the estimated parameter vector), g.bv (the estimated covariance of g.b), and g.stat (contains the statistics).

The significance of coefficients can be measured by a $ t$-test. To obtain $ t$-values and $ p$-values, simply calculate:

  tvalue=g.b/sqrt(xdiag(g.bv))
  pvalue=2.*cdfn(-abs(tvalue))
16498 XLGglm05.xpl

The 16503 xdiag extracts the diagonal of a quadratic matrix, sqrt takes the square root. The functions cdfn (or cdft for small samples) provide the cumulative distribution functions of the Gaussian distribution ($ t$-distribution). In our running example, the latter instruction yields
  Content of object pvalue
  [1,] 1.2155e-06
  [2,] 1.0972e-05
  [3,] 0.0003059
  [4,] 0.0085538
  [5,] 0.3639
  [6,] 0.013712
which means that all except the 5th coefficient are significant (at the 5% level).

For linear hypotheses on the parameters, a Wald test can be used. Suppose we want to test if $ 2 \beta_1=\beta_2$. This can be written as $ Ab=a$ with $ A=(2,-1,0,...,0)$. Hence, define

  A=2 ~ (-1) ~ 0.*matrix(rows(g.b)-2)'
  a=0
  W=(A*g.b-a)'*(A*g.bv*A')*(A*g.b-a)
  pvalue=1-cdfc(W,1)
16507 XLGglm05.xpl

W denotes the test statistic which has an asymptotic $ \chi^2$ distribution with rank($ A$) degrees of freedom. The $ p$-value for this test problem is calculated by means of cdfc, the cumulative distribution function of the $ \chi^2$ distribution, the rank of the matrix A is 1 in this case. For our running example, this produces a $ p$-value of
  Content of object pvalue
  [1,] 0.033318
i.e. the relation $ 2 \beta_1=\beta_2$ is significant at 3.3318%.


7.5.4 Likelihood Ratio Tests for Comparing Nested Models


{lr,alpha} = 16606 glmlrtest (loglik0,dim0,loglik1,dim1)
computes a likelihood ratio test for two nested GLMs on the basis of the $ \chi^2$ distribution

Suppose now we have two (nested) models estimated and obtained two estimation results c (the smaller model) and g (the larger model). To compare both models, one needs to calculate the likelihood ratio test statistic.

In some cases, the distribution of this test statistic can be derived exactly. Otherwise, the (negative) doubled logarithm of the likelihood ratio can be computed which has an asymptotic $ \chi^2$ distribution. In this case, the test statistic lr and the $ p$-value pvalue can be obtained from 16609 glmlrtest . Recall the lizard example, where we estimated the full model g and the constrained model c. Now we determine if the difference between both models is significant. Computing

  lc=c.stat.loglik
  lg=g.stat.loglik
  pc=rows(c.b)
  pg=rows(g.b)
  {lr,pvalue}=glmlrtest(lc,pc,lg,pg)
16613 XLGglm05.xpl

gives
  Contents of pvalue
  [1,]  0.37944
i.e. there is no statistically significant difference between the two models.


7.5.5 Subset Selection


select = 16732 glmselect (code, x, y {,opt})
performs a complete search model selection by choosing the best of all subset models with respect to the AIC or BIC criterion
select = 16735 glmforward (code, x, y {,opt})
performs a forward search model selection by choosing the best of all subset models with respect to the AIC or BIC criterion
select = 16738 glmbackward (code, x, y {,opt})
performs a backward search model selection by choosing the best of all subset models with respect to the AIC or BIC criterion
The model selection functions 16741 glmselect , 16744 glmforward and 16747 glmbackward have essentially the same syntax as 16750 glmest . Additionally, the optional parameters shm (show model selection going on), crit ("aic" or "bic" for Akaike or Schwarz criterion) and fix (columns of x to be held fixed) will be recognized.

In the following we generate a $ 200\times 4$ matrix x and a response y which only depends on the first two columns of x:

  randomize(0)
  n=200
  b=1|2|0|0
  p=rows(b)
  x=normal(n,p)
  y=x*b+normal(n)
16754 XLGglm06.xpl

Now add a constant column to x and set options shm and fix. The optional parameter fix is such that the first two columns of x (constant and first explanatory variable) are always in the model. The last line of the following quantlet
  x=matrix(n)~x
  opt=glmopt("shm",1,"fix",1|2)
  g=glmselect("noid",x,y,opt)
16760 XLGglm06.xpl

now starts the model selection. For many possible submodels, this can take a while. In our case we have 3 free variables (we fixed two out of five), hence the total number of models to estimate is 7. The output list g consists of 5 components:
best
the 5 best models.
bestcrit
a list containing bestcrit.aic and bestcrit.bic, the Akaike and the Schwarz criteria for the 5 best models.
bestord
the best models of each order.
bestordcrit
like bestcrit, but for the best model of each order.
bestfit
containing bestfit.b, bestfit.bv and bestfit.stat, the estimation results for the best model.

Hence, g.best will display the five best models in our example. The contents of g.best reads columnwise:

  Content of object g.best
  [1,]    1   1  1   1   1
  [2,]    2   2  2   2   2
  [3,]    3   3  3   3   0
  [4,]    0   0  4   4   4
  [5,]    0   5  0   5   0
Those components which are not in a submodel are indicated by the value 0. Hence the model selection procedure found indeed that the last two columns of x do not explain y.

The functions 16765 glmforward and 16768 glmbackward have the same functionality as 16771 glmselect , except that they do a forward and backward search, respectively.