Next: 5.3 Examples of the Up: 5. The EM Algorithm Previous: 5.1 Introduction

5.2 Basic Theory of the EM Algorithm

5.2.1 The E- and M-Steps

Within the incomplete-data framework of the EM algorithm, we let $\mathbf{x}$ denote the vector containing the complete data and we let $\mathbf{z}$ denote the vector containing the missing data. Even when a problem does not at first appear to be an incomplete-data one, computation of the MLE is often greatly facilitated by artificially formulating it to be as such. This is because the EM algorithm exploits the reduced complexity of ML estimation given the complete data. For many statistical problems the complete-data likelihood has a nice form.

We let $g_c(\mathbf{x}; \mathbf{\Psi})$ denote the p.d.f. of the random vector $\mathbf{X}$ corresponding to the complete-data vector $\mathbf{x}$ . Then the complete-data log likelihood function that could be formed for $\mathbf{\Psi}$ if $\mathbf{x}$ were fully observable is given by

$\displaystyle \notag \log L_c(\mathbf{\Psi}) = \log g_c(\mathbf{x}; \mathbf{\Psi})\,.$

The EM algorithm approaches the problem of solving the incomplete-data likelihood (5.1) indirectly by proceeding iteratively in terms of $\log L_c(\mathbf{\Psi})$ . As it is unobservable, it is replaced by its conditional expectation given $\mathbf{y}$ , using the current fit for $\mathbf{\Psi}$ . On the

th iteration of the EM algorithm,

E-Step: Compute $Q\left(\mathbf{\Psi}; \mathbf{\Psi}^{(k)}\right)$ , where

$\displaystyle Q\left(\mathbf{\Psi}; \mathbf{\Psi}^{(k)}\right) = E_{\mathbf{\Psi}^{(k)}} \{ \log L_c(\mathbf{\Psi}) \vert \mathbf{y} \}\,.$

(5.2)

M-Step: Choose $\mathbf{\Psi}^{(k+1)}$ to be any value of $\mathbf{\Psi} \in \mathbf{\Omega}$ that maximizes $Q\left(\mathbf{\Psi}; \mathbf{\Psi}^{(k)}\right)$ :

$\displaystyle Q\left(\mathbf{\Psi}^{(k+1)}; \mathbf{\Psi}^{(k)}\right) \geq Q\l... ...; \mathbf{\Psi}^{(k)}\right) \quad \forall \mathbf{\Psi} \in \mathbf{\Omega}\,.$

(5.3)

In (5.2) and elsewhere in this chapter, the operator $E_{\boldsymbol{\Psi}^{(k)}}$ denotes expectation using the parameter vector $\boldsymbol{\Psi}^{(k)}$ . The E- and M-steps are alternated repeatedly until convergence, which may be determined, for instance, by using a suitable stopping rule like $\Vert\mathbf{\Psi}^{(k+1)} - \mathbf{\Psi}^{(k)}\Vert < \varepsilon$ for some $\varepsilon > 0$ with some appropriate norm $\Vert \cdot \Vert$ or the difference $L(\mathbf{\Psi}^{(k+1)}) - L(\mathbf{\Psi}^{(k)})$ changes by an arbitrarily small amount in the case of convergence of the sequence of likelihood values $\{ L(\mathbf{\Psi}^{(k)}) \}$ .

It can be shown that both the E- and M-steps will have particularly simple forms when $g_c(\mathbf{x}; \mathbf{\Psi})$ is from an exponential family:

$\displaystyle g_c(\mathbf{x}; \mathbf{\Psi}) = b(\mathbf{x}) \exp \left\{ \mathbf{c}^{\top}(\mathbf{\Psi}) \mathbf{t}(\mathbf{x}) \right\} / a(\mathbf{\Psi})\,,$

(5.4)

where $\mathbf{t}(\mathbf{x})$ is a $k \times 1 \ (k \geq d)$ vector of complete-data sufficient statistics and $\mathbf{c}(\mathbf{\Psi})$ is a $k \times 1$ vector function of the $d \times 1$ parameter vector $\mathbf{\Psi}$ , and $a(\mathbf{\Psi})$ and $b(\mathbf{x})$ are scalar functions. Members of the exponential family include most common distributions, such as the multivariate normal, Poisson, multinomial and others. For exponential families, the E-step can be written as

$\displaystyle \notag Q\left(\mathbf{\Psi}; \mathbf{\Psi}^{(k)}\right) = E_{\mat... ... + \mathbf{c}^{\top}(\mathbf{\Psi}) \mathbf{t}^{(k)} - \log a(\mathbf{\Psi})\,,$

where $\mathbf{t}^{(k)} = E_{\mathbf{\Psi}^{(k)}} \{ \mathbf{t}(\mathbf{X})\vert\mathbf{y} \}$ is an estimator of the sufficient statistic. The M-step maximizes the

-function with respect to $\mathbf{\Psi}$ ; but $E_{\mathbf{\Psi}^{(k)}}(\log b(\mathbf{x})\vert\mathbf{y})$ does not depend on $\mathbf{\Psi}$ . Hence it is sufficient to write:
E-Step: Compute

$\displaystyle \notag \mathbf{t}^{(k)} = E_{\mathbf{\Psi}^{(k)}} \{ \mathbf{t}(\mathbf{X})\vert\mathbf{y} \}\,.$

M-Step: Compute

$\displaystyle \notag \mathbf{\Psi}^{(k+1)} = \arg \mathop{\max}_{\mathbf{\Psi}}... ...hbf{c}^{\top}(\mathbf{\Psi}) \mathbf{t}^{(k)} - \log a(\mathbf{\Psi})\right]\,.$

In Example 2 of Sect. 5.3.2, the complete-data p.d.f. has an exponential family representation. We shall show how the implementation of the EM algorithm can be simplified.

5.2.2 Generalized EM Algorithm

Often in practice, the solution to the M-step exists in closed form. In those instances where it does not, it may not be feasible to attempt to find the value of $\mathbf{\Psi}$ that globally maximizes the function $Q(\mathbf{\Psi}; \mathbf{\Psi}^{(k)})$ . For such situations, Dempster et al. (1977) defined a generalized EM (GEM) algorithm for which the M-Step requires $\mathbf{\Psi}^{(k+1)}$ to be chosen such that

$\displaystyle Q\left(\mathbf{\Psi}^{(k+1)}; \mathbf{\Psi}^{(k)}\right) \geq Q(\mathbf{\Psi}^{(k)}; \mathbf{\Psi}^{(k)})$

(5.5)

holds. That is, one chooses $\mathbf{\Psi}^{(k+1)}$ to increase the

-function, $Q(\mathbf{\Psi}; \mathbf{\Psi}^{(k)})$ , over its value at $\mathbf{\Psi}=\mathbf{\Psi}^{(k)}$ , rather than to maximize it over all $\mathbf{\Psi} \in \mathbf{\Omega}$ in (5.3).

It is of interest to note that the EM (GEM) algorithm as described above implicitly defines a mapping $\mathbf{\Psi} \rightarrow \mathbf{M}(\mathbf{\Psi})$ , from the parameter space $\mathbf{\Omega}$ to itself such that

$\displaystyle \notag \mathbf{\Psi}^{(k+1)} = \mathbf{M}\left(\mathbf{\Psi}^{(k)}\right) \qquad (k=0,1,2,\ldots)\,.$

The function $\mathbf{M}$ is called the EM mapping. We shall use this function in our subsequent discussion on the convergence property of the EM algorithm.

5.2.3 Convergence of the EM Algorithm

Let $k(\mathbf{x}\vert\mathbf{y}; \mathbf{\Psi}) = g_c(\mathbf{x}; \mathbf{\Psi}) / g(\mathbf{y}; \mathbf{\Psi})$ be the conditional density of $\mathbf{X}$ given $\mathbf{Y}=\mathbf{y}$ . Then the complete-data log likelihood can be expressed by

$\displaystyle \log L_c(\mathbf{\Psi}) = \log g_c(\mathbf{x}; \mathbf{\Psi}) = \log L(\mathbf{\Psi}) + \log k(\mathbf{x}\vert\mathbf{y}; \mathbf{\Psi})\,.$

(5.6)

Taking expectations on both sides of (5.6) with respect to the conditional distribution $\mathbf{x}\vert\mathbf{y}$ using the fit $\mathbf{\Psi}^{(k)}$ for $\mathbf{\Psi}$ , we have

$\displaystyle Q\left(\mathbf{\Psi}; \mathbf{\Psi}^{(k)}\right) = \log L(\mathbf{\Psi}) + H\left(\mathbf{\Psi}; \mathbf{\Psi}^{(k)}\right)\,,$

(5.7)

where $H\left(\mathbf{\Psi}; \mathbf{\Psi}^{(k)}\right)=E_{\mathbf{\Psi}^{(k)}} \{ \log k(\mathbf{X}\vert\mathbf{y}; \mathbf{\Psi})\vert\mathbf{y} \}.$ It follows from (5.7) that

$\displaystyle \log L\left(\mathbf{\Psi}^{(k+1)}\right)-\log L\left(\mathbf{\Psi}^{(k)}\right) =\,$	$\displaystyle \left\{ Q\left(\mathbf{\Psi}^{(k+1)}; \mathbf{\Psi}^{(k)}\right) - Q\left(\mathbf{\Psi}^{(k)}; \mathbf{\Psi}^{(k)}\right) \right\}$
	$\displaystyle \ - \left\{ H\left(\mathbf{\Psi}^{(k+1)}; \mathbf{\Psi}^{(k)}\right) - H\left(\mathbf{\Psi}^{(k)}; \mathbf{\Psi}^{(k)}\right) \right\}\,.$	(5.8)

By Jensen's inequality, we have $H(\mathbf{\Psi}^{(k+1)}; \mathbf{\Psi}^{(k)}) \leq H(\mathbf{\Psi}^{(k)}; \mathbf{\Psi}^{(k)})$ . From (5.3) or (5.5), the first difference on the right-hand side of (5.8) is nonnegative. Hence, the likelihood function is not decreased after an EM or GEM iteration:

$\displaystyle L\left(\mathbf{\Psi}^{(k+1)}\right) \geq L\left(\mathbf{\Psi}^{(k)}\right) \qquad (k=0,1,2,\ldots)\,.$

(5.9)

A consequence of (5.9) is the self-consistency of the EM algorithm. Thus for a bounded sequence of likelihood values $\{ L(\mathbf{\Psi}^{(k)}) \}$ , $L(\mathbf{\Psi}^{(k)})$ converges monotonically to some $L^{\ast}$ . Now questions naturally arise as to the conditions under which $L^{\ast}$ corresponds to a stationary value and when this stationary value is at least a local maximum if not a global maximum. Examples are known where the EM algorithm converges to a local minimum and to a saddle point of the likelihood (McLachlan and Krishnan, 1997, Sect. 3.6). There are also questions of convergence of the sequence of EM iterates, that is, of the sequence of parameter values $\{ \mathbf{\Psi}^{(k)} \}$ to the MLE.

Before the general formulation of the EM algorithm in Dempster et al. (1977), there have been convergence results for special cases, notable among them being those of Baum et al. (1970) for what is now being called the hidden Markov model. In the article of Dempster et al. (1977) itself, there are some convergence results. However, it is Wu (1983) who investigates in detail several convergence issues of the EM algorithm in its generality. Wu examines these issues through their relationship to other optimization methods. He shows that when the complete data are from a curved exponential family with compact parameter space, and when the -function satisfies a certain mild differentiability condition, then any EM sequence converges to a stationary point (not necessarily a maximum) of the likelihood function. If $L(\mathbf{\Psi})$ has multiple stationary points, convergence of the EM sequence to either type (local or global maximizers, saddle points) depends upon the starting value $\mathbf{\Psi}^{(0)}$ for $\mathbf{\Psi}$ . If $L(\mathbf{\Psi})$ is unimodal in $\mathbf{\Omega}$ and satisfies the same differentiability condition, then any sequence $\{ \mathbf{\Psi}^{(k)} \}$ will converge to the unique MLE of $\mathbf{\Psi}$ , irrespective of its starting value.

To be more specific, one of the basic convergence results of the EM algorithm is the following:

$\displaystyle \notag \log L(\mathbf{M}(\mathbf{\Psi})) \geq \log L(\mathbf{\Psi})$

with equality if and only if

$\displaystyle \notag Q(\mathbf{M}(\mathbf{\Psi}); \mathbf{\Psi}) = Q(\mathbf{\Psi}; \mathbf{\Psi})$ and $\displaystyle \quad k(\mathbf{x}\vert\mathbf{y}; \mathbf{M}(\mathbf{\Psi})) = k(\mathbf{x}\vert\mathbf{y}; \mathbf{\Psi})\,.$

This means that the likelihood function increases at each iteration of the EM algorithm, until the condition for equality is satisfied and a fixed point of the iteration is reached. If $\widehat{\mathbf{\Psi}}$ is an MLE, so that $\log L(\widehat{\mathbf{\Psi}}) \geq \log L(\mathbf{\Psi}),\ \forall\ \mathbf{\Psi} \in \mathbf{\Omega}$ , then $\log L(\mathbf{M}(\widehat{\mathbf{\Psi}})) = \log L(\widehat{\mathbf{\Psi}})$ . Thus MLE are fixed points of the EM algorithm. If we have the likelihood function bounded (as might happen in many cases of interest), the EM sequence $\{ \mathbf{\Psi}^{(k)} \}$ yields a bounded nondecreasing sequence $\{\log L(\mathbf{\Psi}^{(k)})\}$ which must converge as $k \to \infty$ .

The theorem does not quite imply that fixed points of the EM algorithm are in fact MLEs. This is however true under fairly general conditions. For proofs and other details, see McLachlan and Krishnan (1997, Sect. 3.5) and Wu (1983). Furthermore, if a sequence of EM iterates $\{ \mathbf{\Psi}^{(k)} \}$ satisfy the conditions

$[\partial Q(\mathbf{\Psi}; \mathbf{\Psi}^{(k)}) / \partial \mathbf{\Psi}]_{\mathbf{\Psi}=\mathbf{\Psi}^{(k+1)}} = \boldsymbol{0}$ , and
the sequence $\{ \mathbf{\Psi}^{(k)} \}$ converges to some value $\mathbf{\Psi}^{\ast}$ and $\log k(\mathbf{x}\vert\mathbf{y}; \mathbf{\Psi})$ is sufficiently smooth,

then we have $[\partial \log L(\mathbf{\Psi}) / \partial \mathbf{\Psi}]_{\mathbf{\Psi}=\mathbf{\Psi}^{\ast}} = \boldsymbol{0}$ ; see Little and Rubin (2002) and Wu (1983). Thus, despite the earlier convergence results, there is no guarantee that the convergence will be to a global maximum. For likelihood functions with multiple maxima, convergence will be to a local maximum which depends on the starting value $\mathbf{\Psi}^{(0)}$ .

Nettleton (1999) extends Wu's convergence results to the case of constrained parameter spaces and establishes some stricter conditions to guarantee convergence of the EM likelihood sequence to some local maximum and the EM parameter iterates to converge to the MLE.

5.2.4 Rate of Convergence of the EM Algorithm

The rate of convergence of the EM algorithm is also of practical interest. The convergence rate is usually slower than the quadratic convergence typically available with Newton-type methods. Dempster et al. (1977) show that the rate of convergence of the EM algorithm is linear and the rate depends on the proportion of information in the observed data. Thus in comparison to the formulated complete-data problem, if a large portion of data is missing, convergence can be quite slow.

Recall the EM mapping $\mathbf{M}$ defined in Sect. 5.2.2. If $\mathbf{\Psi}^{(k)}$ converges to some point $\mathbf{\Psi}^{\ast}$ and $\mathbf{M}(\mathbf{\Psi})$ is continuous, then $\mathbf{\Psi}^{\ast}$ is a fixed point of the algorithm; that is, $\mathbf{\Psi}^{\ast}$ must satisfy $\mathbf{\Psi}^{\ast} = \mathbf{M}(\mathbf{\Psi}^{\ast}).$ By a Taylor series expansion of $\mathbf{\Psi}^{(k+1)}=\mathbf{M}(\mathbf{\Psi}^{(k)})$ about the point $\mathbf{\Psi}^{(k)}=\mathbf{\Psi}^{\ast}$ , we have in a neighborhood of $\mathbf{\Psi}^{\ast}$ that

$\displaystyle \notag \mathbf{\Psi}^{(k+1)}-\mathbf{\Psi}^{\ast} \approx \mathbf{J}\left(\mathbf{\Psi}^{\ast}\right) (\mathbf{\Psi}^{(k)}-\mathbf{\Psi}^{\ast})\,,$

where $\mathbf{J}(\mathbf{\Psi})$ is the $d \times d$ Jacobian matrix for $\mathbf{M}(\mathbf{\Psi})=(M_1(\mathbf{\Psi}),\ldots,M_d(\mathbf{\Psi}))^{\top}$ , having

th element $r_{ij}(\mathbf{\Psi})$ equal to

$\displaystyle \notag r_{ij}(\mathbf{\Psi}) = \partial M_i(\mathbf{\Psi}) / \partial \Psi_j\,,$

where $\Psi_j=(\mathbf{\Psi})_j$ and

is the dimension of $\mathbf{\Psi}$ . Thus, in a neighborhood of $\mathbf{\Psi}^{\ast}$ , the EM algorithm is essentially a linear iteration with rate matrix $\mathbf{J}(\mathbf{\Psi}^{\ast})$ , since $J(\mathbf{\Psi}^{\ast})$ is typically nonzero. For this reason, $\mathbf{J}(\mathbf{\Psi}^{\ast})$ is often referred to as the matrix rate of convergence. For vector $\mathbf{\Psi}$ , a measure of the actual observed convergence rate is the global rate of convergence, which is defined as

$\displaystyle \notag r = \lim_{k \rightarrow \infty} \parallel \mathbf{\Psi}^{(... ...} \parallel / \parallel \mathbf{\Psi}^{(k)} - \mathbf{\Psi}^{\ast} \parallel\,,$

where $\parallel \cdot \parallel$ is any norm on

-dimensional Euclidean space $\Re^d$ . It is noted that the observed rate of convergence equals the largest eigenvalue of $\mathbf{J}(\mathbf{\Psi}^{\ast})$ under certain regularity conditions (Meng and van Dyk, 1997). As a large value of

implies slow convergence, the global speed of convergence is defined to be

(Meng, 1994).

5.2.5 Properties of the EM Algorithm

The EM algorithm has several appealing properties, some of which are:

It is numerically stable with each EM iteration increasing the likelihood.
Under fairly general conditions, it has reliable global convergence.
It is easily implemented, analytically and computationally. In particular, it is generally easy to program and requires small storage space. By watching the monotone increase in likelihood (if evaluated easily) over iterations, it is easy to monitor convergence and programming errors (McLachlan and Krishnan, 1997, Sect. 1.7).
The cost per iteration is generally low, which can offset the larger number of iterations needed for the EM algorithm compared to other competing procedures.
It can be used to provide estimates of missing data.

Some of its drawbacks are:

It does not automatically provide an estimate of the covariance matrix of the parameter estimates. However, this disadvantage can be easily removed by using appropriate methodology associated with the EM algorithm (McLachlan and Krishnan, 1997, Chap. 4).
It is sometimes very slow to converge.
In some problems, the E- or M-steps may be analytically intractable.

We shall briefly address these three issues in Sects. 5.3.5 and 5.4.

Next: 5.3 Examples of the Up: 5. The EM Algorithm Previous: 5.1 Introduction