# 43 Theory of EM

## 43.1 Decomposition

Let $$q({\boldsymbol{z}})$$ be a probability distribution on the latent variables, $${\boldsymbol{z}}$$. Consider the following decomposition:

$\log f({\boldsymbol{x}}; {\boldsymbol{\theta}}) = \mathcal{L}(q({\boldsymbol{z}}), {\boldsymbol{\theta}}) + {\text{KL}}(q({\boldsymbol{z}}) \|f({\boldsymbol{z}}| {\boldsymbol{x}}; {\boldsymbol{\theta}}))$

where

$\mathcal{L}(q({\boldsymbol{z}}), {\boldsymbol{\theta}}) = \int q({\boldsymbol{z}}) \log\left(\frac{f({\boldsymbol{x}}, {\boldsymbol{z}}; {\boldsymbol{\theta}})}{q({\boldsymbol{z}})}\right) d{\boldsymbol{z}}$

${\text{KL}}(q({\boldsymbol{z}}) \| f({\boldsymbol{z}}| {\boldsymbol{x}}; {\boldsymbol{\theta}})) = - \int q({\boldsymbol{z}}) \log\left(\frac{f({\boldsymbol{z}}| {\boldsymbol{x}}; {\boldsymbol{\theta}})}{q({\boldsymbol{z}})}\right) d{\boldsymbol{z}}$

## 43.2 Kullback-Leibler Divergence

The KL divergence provides an asymmetric measure of the difference between two probability distributions.

The KL divergence is such that $${\text{KL}}(q \| f) \geq 0$$ where $${\text{KL}}(q \| f) = 0$$ if and only if $$q=f$$. This property is known as Gibbs inequality.

## 43.3 Lower Bound

Note that $$\mathcal{L}(q({\boldsymbol{z}}), {\boldsymbol{\theta}})$$ provides a lower bound on the likelihood function:

$\log f({\boldsymbol{x}}; {\boldsymbol{\theta}}) \geq \mathcal{L}(q({\boldsymbol{z}}), {\boldsymbol{\theta}})$

If we set $$q({\boldsymbol{z}}) = f({\boldsymbol{z}}| {\boldsymbol{x}}; {\boldsymbol{\theta}}^{(t)})$$, then for a fixed $${\boldsymbol{\theta}}^{(t)}$$ and as a function of $${\boldsymbol{\theta}}$$,

\begin{aligned} \mathcal{L}(q({\boldsymbol{z}}), {\boldsymbol{\theta}}) & \propto \int f({\boldsymbol{z}}| {\boldsymbol{x}}; {\boldsymbol{\theta}}^{(t)}) \log f({\boldsymbol{x}}, {\boldsymbol{z}}; {\boldsymbol{\theta}}) d{\boldsymbol{z}}\\ & = Q({\boldsymbol{\theta}}, {\boldsymbol{\theta}}^{(t)}) \end{aligned}

## 43.4 EM Increases Likelihood

Since $${\boldsymbol{\theta}}^{(t+1)} = {\text{argmax}}_{{\boldsymbol{\theta}}} Q({\boldsymbol{\theta}}, {\boldsymbol{\theta}}^{(t)})$$, it follows that

$Q({\boldsymbol{\theta}}^{(t+1)}, {\boldsymbol{\theta}}^{(t)}) \geq Q({\boldsymbol{\theta}}^{(t)}, {\boldsymbol{\theta}}^{(t)}).$

Also, by the properties of KL divergence stated above, we have

${\text{KL}}(f({\boldsymbol{z}}| {\boldsymbol{x}}; {\boldsymbol{\theta}}^{(t+1)}) \| f({\boldsymbol{z}}| {\boldsymbol{x}}; {\boldsymbol{\theta}}^{(t)})) \geq {\text{KL}}(f({\boldsymbol{z}}| {\boldsymbol{x}}; {\boldsymbol{\theta}}^{(t)}) \| f({\boldsymbol{z}}| {\boldsymbol{x}}; {\boldsymbol{\theta}}^{(t)})).$

Putting these together we have

$\log f({\boldsymbol{x}}; {\boldsymbol{\theta}}^{(t+1)}) \geq \log f({\boldsymbol{x}}; {\boldsymbol{\theta}}^{(t)}).$