43 Theory of EM
43.1 Decomposition
Let \(q({\boldsymbol{z}})\) be a probability distribution on the latent variables, \({\boldsymbol{z}}\). Consider the following decomposition:
\[ \log f({\boldsymbol{x}}; {\boldsymbol{\theta}}) = \mathcal{L}(q({\boldsymbol{z}}), {\boldsymbol{\theta}}) + {\text{KL}}(q({\boldsymbol{z}}) \|f({\boldsymbol{z}}| {\boldsymbol{x}}; {\boldsymbol{\theta}})) \]
where
\[ \mathcal{L}(q({\boldsymbol{z}}), {\boldsymbol{\theta}}) = \int q({\boldsymbol{z}}) \log\left(\frac{f({\boldsymbol{x}}, {\boldsymbol{z}}; {\boldsymbol{\theta}})}{q({\boldsymbol{z}})}\right) d{\boldsymbol{z}}\]
\[ {\text{KL}}(q({\boldsymbol{z}}) \| f({\boldsymbol{z}}| {\boldsymbol{x}}; {\boldsymbol{\theta}})) = - \int q({\boldsymbol{z}}) \log\left(\frac{f({\boldsymbol{z}}| {\boldsymbol{x}}; {\boldsymbol{\theta}})}{q({\boldsymbol{z}})}\right) d{\boldsymbol{z}}\]
43.2 Kullback-Leibler Divergence
The KL divergence provides an asymmetric measure of the difference between two probability distributions.
The KL divergence is such that \({\text{KL}}(q \| f) \geq 0\) where \({\text{KL}}(q \| f) = 0\) if and only if \(q=f\). This property is known as Gibbs inequality.
43.3 Lower Bound
Note that \(\mathcal{L}(q({\boldsymbol{z}}), {\boldsymbol{\theta}})\) provides a lower bound on the likelihood function:
\[ \log f({\boldsymbol{x}}; {\boldsymbol{\theta}}) \geq \mathcal{L}(q({\boldsymbol{z}}), {\boldsymbol{\theta}}) \]
If we set \(q({\boldsymbol{z}}) = f({\boldsymbol{z}}| {\boldsymbol{x}}; {\boldsymbol{\theta}}^{(t)})\), then for a fixed \({\boldsymbol{\theta}}^{(t)}\) and as a function of \({\boldsymbol{\theta}}\),
\[ \begin{aligned} \mathcal{L}(q({\boldsymbol{z}}), {\boldsymbol{\theta}}) & \propto \int f({\boldsymbol{z}}| {\boldsymbol{x}}; {\boldsymbol{\theta}}^{(t)}) \log f({\boldsymbol{x}}, {\boldsymbol{z}}; {\boldsymbol{\theta}}) d{\boldsymbol{z}}\\ & = Q({\boldsymbol{\theta}}, {\boldsymbol{\theta}}^{(t)}) \end{aligned} \]
43.4 EM Increases Likelihood
Since \({\boldsymbol{\theta}}^{(t+1)} = {\text{argmax}}_{{\boldsymbol{\theta}}} Q({\boldsymbol{\theta}}, {\boldsymbol{\theta}}^{(t)})\), it follows that
\[Q({\boldsymbol{\theta}}^{(t+1)}, {\boldsymbol{\theta}}^{(t)}) \geq Q({\boldsymbol{\theta}}^{(t)}, {\boldsymbol{\theta}}^{(t)}).\]
Also, by the properties of KL divergence stated above, we have
\[ {\text{KL}}(f({\boldsymbol{z}}| {\boldsymbol{x}}; {\boldsymbol{\theta}}^{(t+1)}) \| f({\boldsymbol{z}}| {\boldsymbol{x}}; {\boldsymbol{\theta}}^{(t)})) \geq {\text{KL}}(f({\boldsymbol{z}}| {\boldsymbol{x}}; {\boldsymbol{\theta}}^{(t)}) \| f({\boldsymbol{z}}| {\boldsymbol{x}}; {\boldsymbol{\theta}}^{(t)})). \]
Putting these together we have
\[ \log f({\boldsymbol{x}}; {\boldsymbol{\theta}}^{(t+1)}) \geq \log f({\boldsymbol{x}}; {\boldsymbol{\theta}}^{(t)}). \]