41 EM Algorithm

41.1 Rationale

For any likelihood function, \(L({\boldsymbol{\theta}}; {\boldsymbol{x}}) = f({\boldsymbol{x}}; {\boldsymbol{\theta}})\), there is an abundance of optimization methods that can be used to find the MLE or MAP. However:

Optimization methods can be messy to implement
There may be probabilistic structure that we can use to simplify the optimization process and also provide theoretical guarantees on its convergence
Optimization isn’t necessarily the only goal, but one may also be interested in point estimates of the latent variable values

41.2 Requirement

The expectation-maximization (EM) algorithm allows us to calculate MLEs and MAPs when certain geometric properties are satisfied in the probabilistic model.

In order for the EM algorithm to be a practical approach, then we should have a latent variable model \(f({\boldsymbol{x}}, {\boldsymbol{z}}; {\boldsymbol{\theta}})\) that is used to do inference on \(f({\boldsymbol{x}}; {\boldsymbol{\theta}})\) or \(f({\boldsymbol{\theta}}| {\boldsymbol{x}})\).

Note: Sometimes \(({\boldsymbol{x}}, {\boldsymbol{z}})\) is called the complete data and \({\boldsymbol{x}}\) is called the observed data when we are using the EM as a method for dealing with missing data.

41.3 The Algorithm

Choose initial value \({\boldsymbol{\theta}}^{(0)}\)
Calculate \(f({\boldsymbol{z}}| {\boldsymbol{x}}, {\boldsymbol{\theta}}^{(t)})\)
Calculate \[Q({\boldsymbol{\theta}}, {\boldsymbol{\theta}}^{(t)}) = {\operatorname{E}}_{{\boldsymbol{Z}}|{\boldsymbol{X}}={\boldsymbol{x}}}\left[\log f({\boldsymbol{x}}, {\boldsymbol{Z}}; {\boldsymbol{\theta}}); {\boldsymbol{\theta}}^{(t)}\right]\]
Set \[{\boldsymbol{\theta}}^{(t+1)} = {\text{argmax}}_{{\boldsymbol{\theta}}} Q({\boldsymbol{\theta}}, {\boldsymbol{\theta}}^{(t)})\]
Iterate until convergence and set \(\widehat{{\boldsymbol{\theta}}} = {\boldsymbol{\theta}}^{(\infty)}\)

41.4 \(Q({\boldsymbol{\theta}}, {\boldsymbol{\theta}}^{(t)})\)

Continuous \({\boldsymbol{Z}}\):

\[Q({\boldsymbol{\theta}}, {\boldsymbol{\theta}}^{(t)}) = \int \log f({\boldsymbol{x}}, {\boldsymbol{z}}; {\boldsymbol{\theta}}) f({\boldsymbol{z}}| {\boldsymbol{x}}; {\boldsymbol{\theta}}^{(t)}) d{\boldsymbol{z}}\]

Discrete \({\boldsymbol{Z}}\):

\[Q({\boldsymbol{\theta}}, {\boldsymbol{\theta}}^{(t)}) = \sum_{{\boldsymbol{z}}} \log f({\boldsymbol{x}}, {\boldsymbol{z}}; {\boldsymbol{\theta}}) f({\boldsymbol{z}}| {\boldsymbol{x}}; {\boldsymbol{\theta}}^{(t)})\]

41.5 EM for MAP

If we wish to calculate the MAP we replace \(Q({\boldsymbol{\theta}}, {\boldsymbol{\theta}}^{(t)})\) with

\[Q({\boldsymbol{\theta}}, {\boldsymbol{\theta}}^{(t)}) = {\operatorname{E}}_{{\boldsymbol{Z}}|{\boldsymbol{X}}={\boldsymbol{x}}}\left[\log f({\boldsymbol{x}}, {\boldsymbol{Z}}; {\boldsymbol{\theta}}); {\boldsymbol{\theta}}^{(t)}\right] + \log f({\boldsymbol{\theta}})\]

where \(f({\boldsymbol{\theta}})\) is the prior distribution on \({\boldsymbol{\theta}}\).