Next: 5.2 Basic Theory of Up: 5. The EM Algorithm Previous: 5. The EM Algorithm

5.1 Introduction

5.1.1 Maximum Likelihood Estimation

The Expectation-Maximization (EM) algorithm is a broadly applicable approach to the iterative computation of maximum likelihood (ML) estimates, useful in a variety of incomplete-data problems. Maximum likelihood estimation and likelihood-based inference are of central importance in statistical theory and data analysis. Maximum likelihood estimation is a general-purpose method with attractive properties. It is the most-often used estimation technique in the frequentist framework; it is also relevant in the Bayesian framework (Chap. III.11). Often Bayesian solutions are justified with the help of likelihoods and maximum likelihood estimates (MLE), and Bayesian solutions are similar to penalized likelihood estimates. Maximum likelihood estimation is an ubiquitous technique and is used extensively in every area where statistical techniques are used.

We assume that the observed data $\mathbf{y}$ has probability density function (p.d.f.) $g(\mathbf{y}; \mathbf{\Psi})$ , where $\mathbf{\Psi}$ is the vector containing the unknown parameters in the postulated form for the p.d.f. of $\mathbf{Y}$ . Our objective is to maximize the likelihood $L(\mathbf{\Psi}) = g(\mathbf{y}; \mathbf{\Psi})$ as a function of $\mathbf{\Psi}$ , over the parameter space $\mathbf{\Omega}$ . That is,

$\displaystyle \notag \partial L(\mathbf{\Psi}) / \partial \mathbf{\Psi} = \boldsymbol{0}\,,$

or equivalently, on the log likelihood,

$\displaystyle \partial \log L(\mathbf{\Psi}) / \partial \mathbf{\Psi} = \boldsymbol{0}\,.$

(5.1)

The aim of ML estimation is to determine an estimate $\widehat{\mathbf{\Psi}}$ , so that it defines a sequence of roots of (5.1) that is consistent and asymptotically efficient. Such a sequence is known to exist under suitable regularity conditions (Cramér, 1946). With probability tending to one, these roots correspond to local maxima in the interior of $\mathbf{\Omega}$ . For estimation models in general, the likelihood usually has a global maximum in the interior of $\mathbf{\Omega}$ . Then typically a sequence of roots of (5.1) with the desired asymptotic properties is provided by taking $\widehat{\mathbf{\Psi}}$ to be the root that globally maximizes $L(\mathbf{\Psi})$ ; in this case, $\widehat{\mathbf{\Psi}}$ is the MLE. We shall henceforth refer to $\widehat{\mathbf{\Psi}}$ as the MLE, even in situations where it may not globally maximize the likelihood. Indeed, in some of the examples on mixture models (McLachlan and Peel, 2000, Chap. 3), the likelihood is unbounded. However, for these models there may still exist under the usual regularity conditions a sequence of roots of (5.1) with the properties of consistency, efficiency, and asymptotic normality (McLachlan and Basford, 1988, Chap. 12).

When the likelihood or log likelihood is quadratic in the parameters as in the case of independent normally distributed observations, its maximum can be obtained by solving a system of linear equations in parameters. However, often in practice the likelihood function is not quadratic giving rise to nonlinearity problems in ML estimation. Examples of such situations are: (a) models leading to means which are nonlinear in parameters; (b) despite a possible linear structure, the likelihood is not quadratic in parameters due to, for instance, non-normal errors, missing data, or dependence.

Traditionally ML estimation in these situations has been carried out using numerical iterative methods of solution of equations such as the Newton-Raphson (NR) method and its variants like Fisher's method of scoring. Under reasonable assumptions on $L(\mathbf{\Psi})$ and a sufficiently accurate starting value, the sequence of iterates $\{ \mathbf{\Psi}^{(k)} \}$ produced by the NR method enjoys local quadratic convergence to a solution $\mathbf{\Psi}^{\ast}$ of (5.1). Quadratic convergence is regarded as the major strength of the NR method. But in applications, these methods could be tedious analytically and computationally even in fairly simple cases; see McLachlan and Krishnan (1997, Sect. 1.3) and Meng and van Dyk (1997). The EM algorithm offers an attractive alternative in a variety of settings. It is now a popular tool for iterative ML estimation in a variety of problems involving missing data or incomplete information.

5.1.2 EM Algorithm: Incomplete-Data Structure

In the application of statistical methods, one is often faced with the problem of estimation of parameters when the likelihood function is complicated in structure resulting in difficult-to-compute maximization problems. This difficulty could be analytical or computational or both. Some examples are grouped, censored or truncated data, multivariate data with some missing observations, multiway frequency data with a complex cell probability structure, and data from mixtures of distributions. In many of these problems, it is often possible to formulate an associated statistical problem with the same parameters with ''augmented data'' from which it is possible to work out the MLE in an analytically and computationally simpler manner. The augmented data could be called the ''complete data'' and the available data could be called the ''incomplete data'', and the corresponding likelihoods, the ''complete-data likelihood'' and the ''incomplete-data likelihood'', respectively, and the corresponding ML estimations, the ''complete-data problem'' and the ''incomplete-data problem''. The EM Algorithm is a generic method for computing the MLE of an incomplete-data problem by formulating an associated complete-data problem, and exploiting the simplicity of the MLE of the latter to compute the MLE of the former. The augmented part of the data could also be called ''missing data'', with respect to the actual incomplete-data problem on hand. The missing data need not necessarily be missing in the practical sense of the word. It may just be a conceptually convenient technical device. Thus the phrase ''incomplete data'' is used quite broadly to represent a variety of statistical data models, including mixtures, convolutions, random effects, grouping, censoring, truncated and missing observations.

The EM algorithm is an iterative algorithm, in each iteration of which there are two steps, the Expectation Step (E-step) and the Maximization Step (M-step). A brief history of the EM algorithm can be found in McLachlan and Krishnan (1997, Sect. 1.8). The name EM algorithm was coined by Dempster et al. (1977), who synthesized earlier formulations of this algorithm in many particular cases and presented a general formulation of this method of finding MLE in a variety of problems and provided an initial catalogue of problems where this method could be profitably applied. Since then the EM algorithm has been applied in a staggering variety of general statistical problems such as resolution of mixtures, multiway contingency tables, variance components estimation, factor analysis, as well as in specialized applications in such areas as genetics, medical imaging, and neural networks.

5.1.3 Overview of the Chapter

In Sect. 5.2, the basic theory of the EM algorithm is presented. In particular, the monotonicity of the algorithm, convergence, and rate of convergence properties are systematically examined. In Sect. 5.3, the EM methodology presented in this chapter is illustrated in some commonly occurring situations such as the fitting of normal mixtures and missing observations in terms of censored failure times. We also provide an example in which the EM algorithm may not be applicable. Consideration is given also to the two important issues associated with the use of the EM algorithm, namely the initialization of the EM and the provision of standard errors.

We discuss further modifications and extensions to the EM algorithm in Sect. 5.4. In particular, the extensions of the EM algorithm known as the Monte Carlo EM, ECM, ECME, AECM, and PX-EM algorithms are considered. With the considerable attention being given to the analysis of large data sets, as in typical data mining applications, recent work on speeding up the implementation of the EM algorithm is discussed. These include the IEM, SPIEM, the scalable EM algorithms, and the use of multiresolution kd-trees.

In Sect. 5.5, the relationship of the EM algorithm to other data augmentation techniques, such as the Gibbs sampler and MCMC methods is presented briefly. The Bayesian perspective is also included by showing how the EM algorithm and its variants can be adapted to compute the maximum a posteriori (MAP) estimate. We conclude the chapter with a brief account of the applications of the EM algorithm in such topical and interesting areas as Bioinformatics and Image Analysis.

Next: 5.2 Basic Theory of Up: 5. The EM Algorithm Previous: 5. The EM Algorithm