Next: References Up: 5. The EM Algorithm Previous: 5.4 Variations on the

5.5 Miscellaneous Topics on the EM Algorithm

5.5.1 EM Algorithm for MAP Estimation

Although we have focussed on the application of the EM algorithm for computing MLEs in a frequentist framework, it can be equally applied to find the mode of the posterior distribution in a Bayesian framework. This problem is analogous to MLE and hence the EM algorithm and its variants can be adapted to compute MAP estimates. The computation of the MAP estimate in a Bayesian framework via the EM algorithm corresponds to the consideration of some prior density for $\boldsymbol{\Psi}$ . The E-step is effectively the same as for the computation of the MLE of $\boldsymbol{\Psi}$ in a frequentist framework, requiring the calculation of the

-function. The M-step differs in that the objective function for the maximization process is equal to the

-function, augmented by the log prior density. The combination of prior and sample information provides a posterior distribution of the parameter on which the estimation is based.

The advent of inexpensive high speed computers and the simultaneous rapid development in posterior simulation techniques such as Markov chain Monte Carlo (MCMC) methods (Gelfand and Smith, 1990) enable Bayesian estimation to be undertaken. In particular, posterior quantities of interest can be approximated through the use of MCMC methods such as the Gibbs sampler. Such methods allow the construction of an ergodic Markov chain with stationary distribution equal to the posterior distribution of the parameter of interest. A detailed description of the MCMC technology can be found in Chap. II.3.

Although the application of MCMC methods is now routine, there are some difficulties that have to be addressed with the Bayesian approach, particularly in the context of mixture models. One main hindrance is that improper priors yield improper posterior distributions. Another hindrance is that when the number of components is unknown, the parameter space is simultaneously ill-defined and of infinite dimension. This prevents the use of classical testing procedures and priors (McLachlan and Peel, 2000, Chap. 4). A fully Bayesian approach with taken to be an unknown parameter has been considered by Richardson and Green (1997). Their MCMC methods allow jumps to be made for variable dimension parameters and thus can handle being unspecified. A further hindrance is the effect of label switching, which arises when there is no real prior information that allows one to discriminate between the components of a mixture model belonging to the same parametric family. This effect is very important when the solution is being calculated iteratively and there is the possibility that the labels of the components may be switched on different iterations (McLachlan and Peel, 2000, Chap. 4).

5.5.2 Iterative Simulation Algorithms

In computing Bayesian solutions to incomplete-data problems, iterative simulation techniques have been adopted to find the MAP estimates or estimating the entire posterior density. These iterative simulation techniques are conceptually similar to the EM algorithm, simply replacing the E- and M-steps by draws from the current conditional distribution of the missing data and $\boldsymbol{\Psi}$ , respectively. However, in some methods such as the MCEM algorithm described in Sect. 5.4.1, only the E-step is so implemented. Many of these methods can be interpreted as iterative simulation analogs of the various versions of the EM and its extensions. Some examples are Stochastic EM, Data Augmentation algorithm, and MCMC methods such as the Gibbs sampler (McLachlan and Krishnan, 1997, Chap. 6). Here, we give a very brief outline of the Gibbs sampler; see also Chap. II.3 of this handbook and the references therein.

The Gibbs sampler is extensively used in many Bayesian problems where the joint distribution is too complicated to handle, but the conditional distributions are often easy enough to draw from; see Casella and George (1992). On the Gibbs sampler, an approximate sample from $p(\boldsymbol{\Psi} \mid \boldsymbol{y})$ is obtained by simulating directly from the (full) conditional distribution of a subvector of $\boldsymbol{\Psi}$ given all the other parameters in $\boldsymbol{\Psi}$ and $\boldsymbol {y}$ . We write $\boldsymbol{\Psi}=(\boldsymbol{\Psi}_1,\ldots,\boldsymbol{\Psi}_d)$ in component form, a -dimensional Gibbs sampler makes a Markov transition from $\boldsymbol{\Psi}^{(k)}$ to $\boldsymbol{\Psi}^{(k+1)}$ via successive simulations as follows:

(1) Draw $\Psi_1^{(k+1)}$ from $p\left(\Psi_1 \mid \boldsymbol{y}; \Psi_2^{(k)}, \ldots, \Psi_d^{(k)}\right)$ .

(2) Draw $\Psi_2^{(k+1)}$ from $p\left(\Psi_2 \mid \boldsymbol{y}; \Psi_1^{(k+1)}, \Psi_3^{(k)}, \ldots, \Psi_d^{(k)}\right)$ .

$\vdots$ $\vdots$ $\vdots$

(d) Draw $\Psi_d^{(k+1)}$ from $p\left(\Psi_d \mid \boldsymbol{y}; \Psi_1^{(k+1)}, \ldots, \Psi_{d-1}^{(k+1)}\right)$ .

The vector sequence $\{\boldsymbol{\Psi}^{(k)}\}$ thus generated is known to be a realization of a homogeneous Markov Chain. Many interesting properties of such a Markov sequence have been established, including geometric convergence, as $k \rightarrow \infty$ ; to a unique stationary distribution that is the posterior density $p(\Psi_1^{(k)}, \ldots, \Psi_d^{(k)} \mid \boldsymbol{y})$ under certain conditions; see Roberts and Polson (1994). Among other sampling methods, there is the Metropolis-Hastings algorithm (Hastings, 1970), which, in contrast to the Gibbs sampler, accepts the candidate simulated component in $\boldsymbol{\Psi}$ with some defined probability (McLachlan and Peel, 2000, Chap. 4).

The Gibbs sampler and other such iterative simulation techniques being Bayesian in their point of view consider both parameters and missing values as random variables and both are subjected to random draw operations. In the iterative algorithms under a frequentist framework, like the EM-type algorithms, parameters are subjected to a maximization operation and missing values are subjected to an averaging operation. Thus the various versions of the Gibbs sampler can be viewed as stochastic analogs of the EM, ECM, and ECME algorithms. Besides these connections, the EM-type algorithms also come in useful as starting points for iterative simulation algorithms where typically regions of high density are not known a priori (McLachlan and Krishnan, 1997, Sect. 6.7.3). The relationship between the EM algorithm and the Gibbs sampler and the connection between their convergence properties have been examined in Sahu and Roberts (1999).

5.5.3 Further Applications of the EM Algorithm

Since the publication of Dempster et al. (1977), the number, variety, and range of applications of the EM algorithm and its extensions have been tremendous. Applications in many different contexts can be found in monographs Little and Rubin (2002), McLachlan and Krishnan (1997), and McLachlan and Peel (2000). We conclude the chapter with a quick summary of some of the more interesting and topical applications of the EM algorithm.

5.5.3.1 Bioinformatics: Mixture of Factor Analyzers

The analysis of gene expression microarray data using clustering techniques has an important role to play in the discovery, validation, and understanding of various classes of cancer; see for example Alon et al. (1999) and van't Veer et al. (2002). Clustering algorithms can be applied to the problem of clustering genes and tumour tissues (McLachlan et al., 2002) and also in the discovery of motif patterns in DNA sequences (Bailey and Elkan, 1995); see also Chap. IV.3 for the description of biomolecular sequences and structures. The EM algorithm and its variants have been applied to tackle some of the problems arisen in such applications. For example, the clustering of tumour tissues on the basis of genes expression is a nonstandard cluster analysis problem since the dimension of each tissue sample is so much greater than the number of tissues. In McLachlan et al. (2002), mixture of factor analyzers is adopted to reduce effectively the dimension of the feature space of genes. The AECM algorithm (Meng and van Dyk, 1997) can be used to fit the mixture of factor analyzers by ML (McLachlan and Peel, 2000, Chap. 8).

5.5.3.2 Image Analysis: Hidden Markov Models

In image analysis, the observed data $\boldsymbol{y}_j$ refers to intensities measured on

pixels in a scene, the associated component indicator vectors $\boldsymbol{z}_j$ will not be independently distributed as the intensities between neighboring pixels are spatially correlated. The set of hidden states $\boldsymbol{z}_j$ is viewed as missing data (McLachlan and Peel, 2000, Chap. 13; van Dyk and Meng, 2001) and a stationary Markovian model over a finite state space is generally formulated for the distribution of the hidden variable $\boldsymbol{Z}$ . In one dimension, this Markovian model is a Markov chain, and in two and higher dimensions a Markov random field (MRF) (Besag, 1986).

The use of the EM algorithm in a hidden Markov chain, known in the Hidden Markov model literature as the Baum-Welch algorithm (Baum et al., 1970), has been formulated long before Dempster et al. (1977). Also, Robert et al. (1993) consider a stochastic Bayesian approach to parameter estimation for a hidden Markov chain. Lystig and Hughes (2002) provide a means of implementing a NR approach to obtain parameter estimates and an exact computation of the observed information matrix for hidden Markov models.

The EM algorithm for the hidden MRF is considerably more difficult; see McLachlan (1992, Chap. 13) and the references therein. Even in the exponential family case (see Sect. 5.2.1) the E- and M-steps are difficult to perform even by numerical methods, except in some very simple cases like a one-parameter case; in some cases they may be implemented by suitable Gibbs sampler algorithms. A variety of practical procedures has been considered in the literature. They are reviewed by Qian and Titterington (1992), who also suggest a Monte Carlo restoration-estimation algorithm. An approximation to the E-step, based on a fractional weight version of Besag's iterated conditional modes (ICM) algorithm (Besag, 1986), has been adopted for the segmentation of magnetic resonance images (McLachlan and Peel, 2000, Sect. 13.4). An alternative approach is a Bayesian one, where the likelihood can be regularized using a prior, resulting in a better-conditioned log likelihood. This can also be interpreted as a penalized likelihood approach. Random field models such as Gibbs priors are often used in this context to capture the local smooth structures of the images (Geman and Geman, 1984).

Next: References Up: 5. The EM Algorithm Previous: 5.4 Variations on the