For further reading on GLM we refer to the textbooks of [11], [27] and [19] (the latter with a special focus on STATA). [36, Chap. 7] and [15] present the topic of generalized linear models in a very compact form. [7], [2], [9], and [5] are standard references for analyzing categorical responses. We recommend the monographs of [13] and [25] for a detailed introduction to GLM with a focus on multivariate, longitudinal and spatial data. In the following sections we will shortly review some specific variants and enhancements of the GLM.
Prior weights can be incorporated to the generalized linear model by considering the exponential density in the form
The weights can be or in the simplest case that one wants to exclude specific observations from the estimation. The typical case of applying weights is the case of repeated independent realizations.
Overdispersion may occur in one-parameter exponential families where the variance is supposed to be a function of the mean. This concerns in particular the binomial or Poisson families where we have and or , respectively. Overdispersion means that the actually observed variance from the data is larger than the variance imposed by the model. The source for this may be a lack of independence in the data or a misspecification of the model. One possible approach is to use alternative models that allows for a nuisance parameter in the variance, as an example think of the negative binomial instead of the Poisson distribution. For detailed discussions on overdispersion see [7] and [1].
Let us remark that in the case that the distribution of
itself is unknown but its two first moments
can be specified, the quasi-likelihood function may replace
the log-likelihood function. This means we still assume that
A multinomial model (or nominal logistic regression) is applied if the response for each observation is one out of more than two alternatives (categories). For identification one of the categories has to be chosen as reference category; without loss of generality we use here the first category. Denote by the probability , then we can consider the logits with respect to the first category, i.e.
If the categories are ordered in some natural way then this additional information can be taken into account. A latent variable approach leads to the cumulative logit model or the ordered probit model. We refer here to [11, Sect. 8.4] and [18, Chap. 21] for ordinal logistic regression and ordered probit analysis, respectively.
The simplest form of a contingency table
Category | |||||
Frequency |
with one factor and a predetermined sample size of observations is appropriately described by a multinomial distribution and can hence be fitted by the multinomial logit model introduced in Sect. 7.5.4. We could be for instance be interested in comparing the trivial model to the model (again we use the first category as reference). As before further explanatory variables can be included into the model.
Two-way or higher dimensional contingency tables involve a large variety of possible models. Let explain this with the help of the following two-way setup:
Category | |||||
1 | |||||
2 | |||||
Here we assume to have two factors, one with realizations , the other with realizations . If the are independent Poisson variables with parameters , then their sum is a Poisson variable with parameter . The Poisson assumption implies that the number of observations is a random variable. Conditional on , the joint distribution of the is the multinomial distribution. Without additional explanatory variables, one is typically interested in estimating models of the type
Survival data are characterized by non-negative observations which typically have a skewed distribution. An additional complication arises due to the fact that the observation period may end before the individual fails such that censored data may occur. The exponential distribution with density is a very simple example for a survival distribution. In this special case the survivor function (the probability to survive beyond ) is given by and the hazard function (the probability of death within and after survival up to ) equals . Given additional explanatory variables this function is typically modeled by
Clustered data in relation to regression models mean that data from known groups (``clusters'') are observed. Often these are the result of repeated measurements on the same individuals at different time points. For example, imagine the analysis of the effect of a medical treatment on patients or the repeated surveying of households in socio-economic panel studies. Here, all observations on the same individual form a cluster. We speak of longitudinal or panel data in that case. The latter term is typically used in the econometric literature.
When using clustered data we have to take into account that observations from the same cluster are correlated. Using a model designed for independent data may lead to biased results or at least significantly reduce the efficiency of the estimates.
A simple individual model equation could be written as follows:
There is a waste amount of literature which deals with many different possible model specifications. A comprehensive resource for linear and nonlinear mixed effect models (LME, NLME) for continuous responses is [30]. The term ``mixed'' here refers to the fact that these models include additional random and/or fixed effect components to allow for correlation within and heterogeneity between the clusters.
For generalized linear mixed models (GLMM), i.e. clustered observations with responses from GLM-type distribution, several approaches are possible. For repeated observations, [24] and [38] propose to use generalized estimating equations (GEE) which result in a quasi-likelihood estimator. They show that the correlation matrix of , the response observations from one cluster, can be replaced by a ``working correlation'' as long as the moments of are correctly specified. Useful working correlations depend on a small number of parameters. For longitudinal data an autoregressive working correlation can be used for example. For more details on GEE see also the monograph by [10]. In the econometric literature longitudinal or panel data are analyzed with a focus on continuous and binary responses. Standard references for econometric panel data analyses are [22] and [4]. Models for clustered data with complex hierarchical structure are often denoted as multilevel models. We refer to the monograph of [16] for an overview.
Nonparametric components can be incorporated into the GLM at different places. For example, it is possible to estimate a single index model
Local regression in combination with likelihood-based estimation is introduced in [26]. This concerns models of the form
|
Some more material on semiparametric regression can be found in Chaps. III.5 and III.10 of this handbook. For a detailed introduction to semiparametric extensions of GLM we refer to the textbooks by [21], [20], [31], and [17].