One of the classical formulations of the two-group discriminant analysis or the statistical pattern recognition problem involves a mixture of two -dimensional normal distributions with a common covariance matrix. The problem of two-group cluster analysis with multiple continuous observations has also been formulated in this way. Here, we have independent observations from the mixture density
Consider the corresponding ''supervised learning problem'', where observations on the random vector are . Here is an indicator variable which identifies the th observation as coming from the first or the second component . The MLE problem is far simpler here with easy closed-form MLE. The classificatory variable could be called the ''missing variable'' and data the missing data. The unsupervised learning problem could be called the incomplete-data problem and the supervised learning problem the complete-data problem. A relatively simple iterative method for computing the MLE for the unsupervised problem could be given exploiting the simplicity of the MLE for the supervised problem. This is the essence of the EM algorithm.
The complete-data log likelihood function for is given by
Now the EM algorithm for this problem starts with some initial value for the parameters. As in (5.10) is a linear function of the unobservable data for this problem, the calculation of on the E-step is effected simply by replacing by its current conditional expectation given the observed data , which is the usual posterior probability of the th observation arising from component 2
The M-step then consists of substituting these values for in (5.11) to (5.13). The E- and M-steps are then iterated until convergence. Unlike in the MLE for the supervised problem, in the M-step of the unsupervised problem, the posterior probabilities , which are between 0 and 1, are used. The mean vectors and the covariance matrix are computed using the as weights in weighted averages.
It is easy to extend the above method to a mixture of multinormal mixtures or even to a mixture of distributions from other (identifiable) families. For a detailed discussion of the applications of the EM algorithm in the resolution of finite mixtures and other issues of finite mixtures, see McLachlan and Peel (2000).
In survival or reliability analyses, the focus is the distribution of time to the occurrence of some event that represents failure (for computational methods in survival analysis see also Chap. III.12). In many situations, there will be individuals who do not fail at the end of the study, or individuals who withdraw from the study before it ends. Such observations are censored, as we know only that their failure times are greater than particular values. We let denote the observed failure-time data, where or according as the th observation is censored or uncensored at . That is, if is uncensored, , whereas if , it is censored at .
In the particular case where the p.d.f. for is exponential with mean , we have
The complete-data vector can be declared to be where contains the unobservable realizations of the censored random variables. The complete-data log likelihood is given by
From (5.17), it can be seen that has the exponential family form (5.4) with canonical parameter and sufficient statistic Hence, from (5.18), the E-step requires the calculation of . The M-step then yields as the value of that satisfies the equation
Examples 1 and 2 may have given an impression that the E-step consists in replacing the missing data by their conditional expectations given the observed data at current parameter values. Although in many examples this may be the case as is a linear function of the missing data , it is not quite so in general. Rather, as should be clear from the general theory described in Sect. 5.2.1, the E-step consists in replacing by its conditional expectation given the observed data at current parameter values. Flury and Zoppé (2000) give the following interesting example to demonstrate the point that the E-step does not always consist in plugging in ''estimates'' for missing data. This is also an example where the E-step cannot be correctly executed at all since the expected value of the complete-data log likelihood does not exist, showing thereby that the EM algorithm is not applicable to this problem, at least for this formulation of the complete-data problem.
Let the lifetimes of electric light bulbs of a certain type have a uniform distribution in the interval , and unknown. A total of bulbs are tested in two independent experiments. The observed data consist of and , where are exact lifetimes of a random sample of bulbs and are indicator observations on a random sample of bulbs, taking value 1 if the bulb is still burning at a fixed time point and 0 if it is expired. The missing data is . Let be the number of 's with value 1 and .
In this example, the unknown parameter vector is a scalar, being equal to . Let us first work out the MLE of directly. The likelihood is
Now let us try the EM algorithm for the case . The complete data can be formulated as and the complete-data MLE is
The reason for the apparent EM algorithm not resulting in the MLE is that the E-step is wrong. In the E-step, we are supposed to find the conditional expectation of given at current parameter values. Now given the data with , we have and hence the conditional distributions of are uniform in . Thus for the conditional density of missing takes value 0 with positive probability and hence the conditional expected value of the complete-data log likelihood we are seeking does not exist.
The EM algorithm will converge very slowly if a poor choice of initial value were used. Indeed, in some cases where the likelihood is unbounded on the edge of the parameter space, the sequence of estimates generated by the EM algorithm may diverge if is chosen too close to the boundary. Also, with applications where the likelihood equation has multiple roots corresponding to local maxima, the EM algorithm should be applied from a wide choice of starting values in any search for all local maxima. A variation of the EM algorithm (Wright and Kennedy, 2000) uses interval analysis methods to locate multiple stationary points of a log likelihood within any designated region of the parameter space.
Here, we illustrate different ways of specification of initial value within mixture models framework. For independent data in the case of mixture models of components, the effect of the E-step is to update the posterior probabilities of component membership. Hence the first E-step can be performed by specifying a value for each , where is the vector containing the posterior probabilities of component membership for . The latter is usually undertaken by setting , where
Another way of specifying an initial partition of the data is to randomly divide the data into groups corresponding to the components of the mixture model. With random starts, the effect of the central limit theorem tends to have the component parameters initially being similar at least in large samples. One way to reduce this effect is to first select a small random subsample from the data, which is then randomly assigned to the components. The first M-step is then performed on the basis of the subsample. The subsample has to be sufficiently large to ensure that the first M-step is able to produce a nondegenerate estimate of the parameter vector (McLachlan and Peel, 2000, Sect. 2.12). In the context of normal components, a method of specifying a random start is to generate the means independently as
and |
Ueda and Nakano (1998) considered a deterministic annealing EM (DAEM) algorithm in order for the EM iterative process to be able to recover from a poor choice of starting value. They proposed using the principle of maximum entropy and the statistical mechanics analogy, whereby a parameter, say , is introduced with corresponding to the ''temperature'' in an annealing sense. With their DAEM algorithm, the E-step is effected by averaging over the distribution taken to be proportional to that of the current estimate of the conditonal density of the complete data (given the observed data) raised to the power of ; see for example McLachlan and Peel (2000, pp. 58-60).
Several methods have been suggested in the EM literature for augmenting the EM computation with some computation for obtaining an estimate of the covariance matrix of the computed MLE. Many such methods attempt to exploit the computations in the EM steps. These methods are based on the observed information matrix , the expected information matrix or on resampling methods. Baker (1992) reviews such methods and also develops a method for computing the observed information matrix in the case of categorical data. Jamshidian and Jennrich (2000) review more recent methods including the Supplemented EM (SEM) algorithm of Meng and Rubin (1991) and suggest some newer methods based on numerical differentiation.
Theorectically one may compute the asymptotic covariance matrix by inverting the observed or expected information matrix at the MLE. In practice, however, this may be tedious analytically or computationally, defeating one of the advantages of the EM approach. Louis (1982) extracts the observed information matrix in terms of the conditional moments of the gradient and curvature of the complete-data log likelihood function introduced within the EM framework. These conditional moments are generally easier to work out than the corresponding derivatives of the incomplete-data log likelihood function. An alternative approach is to numerically differentiate the likelihood function to obtain the Hessian. In a EM-aided differentiation approach, Meilijson (1989) suggests perturbation of the incomplete-data score vector to compute the observed information matrix. In the SEM algorithm (Meng and Rubin, 1991), numerical techniques are used to compute the derivative of the EM operator to obtain the observed information matrix. The basic idea is to use the fact that the rate of convergence is governed by the fraction of the missing information to find the increased variability due to missing information to add to the assessed complete-data covariance matrix. More specifically, let denote the asymptotic covariance matrix of the MLE . Meng and Rubin (1991) show that
It is important to emphasize that estimates of the covariance matrix of the MLE based on the expected or observed information matrices are guaranteed to be valid inferentially only asymptotically. In particular for mixture models, it is well known that the sample size has to be very large before the asymptotic theory of maximum likelihood applies. A resampling approach, the bootstrap (Efron, 1979; Efron and Tibshirani, 1993), has been considered to tackle this problem. Basford et al. (1997) compared the bootstrap and information-based approaches for some normal mixture models and found that unless the sample size was very large, the standard errors obtained by an information-based approach were too unstable to be recommended.
The bootstrap is a powerful technique that permits the variability in a random quantity to be assessed using just the data at hand. Standard error estimation of may be implemented according to the bootstrap as follows. Further discussion on bootstrap and resampling methods can be found in Chaps. III.2 and III.3 of this handbook.
In Step 1 above, the nonparametric version of the bootstrap would take to be the empirical distribution function formed from the observed data . Situations where we may wish to use the latter include problems where the observed data are censored or are missing in the conventional sense. In these cases the use of the nonparametric bootstrap avoids having to postulate a suitable model for the underlying mechanism that controls the censorship or the absence of the data. A generalization of the nonparametric version of the bootstrap, known as the weighted bootstrap, has been studied by Newton and Raftery (1994).