45 Markov Chain Monte Carlo

45.1 Motivation

When performing Bayesian inferece, it is often (but not always) possible to calculate

\[f({\boldsymbol{\theta}}| {\boldsymbol{x}}) \propto L({\boldsymbol{\theta}}; {\boldsymbol{x}}) f({\boldsymbol{\theta}})\]

but it is typically much more difficult to calculate

\[f({\boldsymbol{\theta}}| {\boldsymbol{x}}) = \frac{L({\boldsymbol{\theta}}; {\boldsymbol{x}}) f({\boldsymbol{\theta}})}{f({\boldsymbol{x}})}.\]

Markov chain Monte Carlo is a method for simulating data approximately from \(f({\boldsymbol{\theta}}| {\boldsymbol{x}})\) with knowledge of only \(L({\boldsymbol{\theta}}; {\boldsymbol{x}}) f({\boldsymbol{\theta}})\).

45.2 Note

MCMC can be used to approximately simulate data from any distribution that is only proportionally characterized, but it is probably most well know for doing so in the context of Bayesian infererence.

We will explain MCMC in the context of Bayesian inference.

45.3 Big Picture

We draw a Markov chain of \({\boldsymbol{\theta}}\) values so that, in some asymptotic sense, these are equivalent to iid draws from \(f({\boldsymbol{\theta}}| {\boldsymbol{x}})\).

The draws are done competitively so that the next draw of a realization of \({\boldsymbol{\theta}}\) depends on the current value.

The Markov chain is set up so that it only depends on \(L({\boldsymbol{\theta}}; {\boldsymbol{x}}) f({\boldsymbol{\theta}})\).

A lot of practical decisions need to be made by the user, so utilize MCMC carefully.

45.4 Metropolis-Hastings Algorithm

  1. Initialize \({\boldsymbol{\theta}}^{(0)}\)

  2. Generate \({\boldsymbol{\theta}}^{*} \sim q({\boldsymbol{\theta}}| {\boldsymbol{\theta}}^{(b)})\) for some pdf or pmf \(q(\cdot | \cdot)\)

  3. With probablity \[A({\boldsymbol{\theta}}^{*}, {\boldsymbol{\theta}}^{(b)}) = \min\left( 1, \frac{L({\boldsymbol{\theta}}^{*}; {\boldsymbol{x}}) f({\boldsymbol{\theta}}^{*}) q({\boldsymbol{\theta}}^{(b)} | {\boldsymbol{\theta}}^{*})}{L({\boldsymbol{\theta}}^{(b)}; {\boldsymbol{x}}) f({\boldsymbol{\theta}}^{(b)}) q({\boldsymbol{\theta}}^{*} | {\boldsymbol{\theta}}^{(b)})} \right)\] set \({\boldsymbol{\theta}}^{(b+1)} = {\boldsymbol{\theta}}^{*}\). Otherise, set \({\boldsymbol{\theta}}^{(b+1)} = {\boldsymbol{\theta}}^{(b)}\)

  4. Continue for \(b = 1, 2, \ldots, B\) iterations and carefully select which \({\boldsymbol{\theta}}^{(b)}\) are utilized to approximate iid observations from \(f({\boldsymbol{\theta}}| {\boldsymbol{x}})\)

45.5 Metropolis Algorithm

The Metropolis algorithm restricts \(q(\cdot, \cdot)\) to be symmetric so that \(q({\boldsymbol{\theta}}^{(b)} | {\boldsymbol{\theta}}^{*}) = q({\boldsymbol{\theta}}^{*} | {\boldsymbol{\theta}}^{(b)})\) and

\[ A({\boldsymbol{\theta}}^{*}, {\boldsymbol{\theta}}^{(b)}) = \min\left( 1, \frac{L({\boldsymbol{\theta}}^{*}; {\boldsymbol{x}}) f({\boldsymbol{\theta}}^{*})}{L({\boldsymbol{\theta}}^{(b)}; {\boldsymbol{x}}) f({\boldsymbol{\theta}}^{(b)})} \right). \]

45.6 Utilizing MCMC Output

Two common uses of the output from MCMC are as follows:

  1. \({\operatorname{E}}[f({\boldsymbol{\theta}}) | {\boldsymbol{x}}]\) is approximated by \[ \hat{{\operatorname{E}}}[f({\boldsymbol{\theta}}) | {\boldsymbol{x}}] = \frac{1}{B} \sum_{b=1}^B f\left({\boldsymbol{\theta}}^{(b)}\right). \]

  2. Some subsequence \({\boldsymbol{\theta}}^{(b_1)}, {\boldsymbol{\theta}}^{(b_2)}, \ldots, {\boldsymbol{\theta}}^{(b_m)}\) from \(\left\{{\boldsymbol{\theta}}^{(b)}\right\}_{b=1}^{B}\) is utilized as an empirical approximation to iid draws from \(f({\boldsymbol{\theta}}| {\boldsymbol{x}})\).

45.7 Remarks

  • The random draw \({\boldsymbol{\theta}}^{*} \sim q({\boldsymbol{\theta}}| {\boldsymbol{\theta}}^{(b)})\) perturbs the current value \({\boldsymbol{\theta}}^{(b)}\) to the next value \({\boldsymbol{\theta}}^{(b+1)}\). It is often a Normal distribution for continuous \({\boldsymbol{\theta}}\).
  • Choosing the variance of \(q({\boldsymbol{\theta}}| {\boldsymbol{\theta}}^{(b)})\) is important as it requires enough variance for the theory to be applicable within a reasonable number of computations, but it cannot be so large that new values of \({\boldsymbol{\theta}}^{(b+1)}\) are rarely generated.
  • \(A({\boldsymbol{\theta}}^{*}, {\boldsymbol{\theta}}^{(b)})\) is called the acceptance probability.
  • The algorithm must be run for a certain number of iterations (“burn in”) before observed \({\boldsymbol{\theta}}^{(b)}\) can be utilized.
  • The generated \({\boldsymbol{\theta}}^{(b)}\) are typically “thinned” (only sampled every so often) to reduce Markov dependence.

45.8 Full Conditionals

Suppose that \({\boldsymbol{\theta}}= (\theta_1, \theta_2, \ldots, \theta_K)\). Define the subset vector as \({\boldsymbol{\theta}}_{a:b} = (\theta_a, \theta_{a+1}, \ldots, \theta_{b-1}, \theta_b)\) for any \(1 \leq a \leq b \leq K\).

The full conditional of \(\theta_k\) is

\[ \Pr(\theta_k | {\boldsymbol{\theta}}_{1:k-1}, {\boldsymbol{\theta}}_{k+1:K}, {\boldsymbol{x}}) \]

45.9 Gibbs Sampling

Gibbs sampling a special type of Metropolis-Hasting MCMC. The algorithm samples one coordinate of \({\boldsymbol{\theta}}\) at a time.

  1. Initialize \({\boldsymbol{\theta}}^{(0)}\).
  2. Sample:
    \(\theta_1^{(b+1)} \sim \Pr(\theta_1 | {\boldsymbol{\theta}}_{2:K}^{(b)}, {\boldsymbol{x}})\)
    \(\theta_2^{(b+1)} \sim \Pr(\theta_2 | \theta_{1}^{(b+1)}, {\boldsymbol{\theta}}_{3:K}^{(b)}, {\boldsymbol{x}})\)
    \(\theta_3^{(b+1)} \sim \Pr(\theta_3 | {\boldsymbol{\theta}}_{1:2}^{(b+1)}, {\boldsymbol{\theta}}_{3:K}^{(b)}, {\boldsymbol{x}})\)
    \(\theta_K^{(b+1)} \sim \Pr(\theta_K | {\boldsymbol{\theta}}_{1:K-1}^{(b+1)}, {\boldsymbol{x}})\)
  3. Continue for \(b = 1, 2, \ldots, B\) iterations.

45.10 Gibbs and MH

As an exercise, show that Gibbs sampling is a special case of the Metropolis-Hastings algorithm where \(A({\boldsymbol{\theta}}^{*}, {\boldsymbol{\theta}}^{(b)}) = 1\).

45.11 Latent Variables

Note that MCMC is often used to calculate a posterior distribution on latent variables.

This makes sense because unobserved random paramaters are a special type of latent variable.

45.12 Theory

The goal of MCMC is to construct a Markov chain that converges to a stationary distribution that is equivalent to the target probability distribution.

Under reasonably general assumptions, one can show that the Metropolis-Hastings algorithm produces a Markov chain that is homogeneous and achieves detailed balance, which implies the Markov chain is ergodic so that \({\boldsymbol{\theta}}^{(B)}\) converges in distribution to \(f({\boldsymbol{\theta}}| {\boldsymbol{x}})\) as \(B \rightarrow \infty\) and that

\[ \hat{{\operatorname{E}}}[f({\boldsymbol{\theta}}) | {\boldsymbol{x}}] = \frac{1}{B} \sum_{b=1}^B f\left({\boldsymbol{\theta}}^{(b)}\right) \stackrel{B \rightarrow \infty}{\longrightarrow} {\operatorname{E}}[f({\boldsymbol{\theta}}) | {\boldsymbol{x}}]. \]

45.13 Software

Stan is probably the currently most popular software for doing Bayesian computation, including MCMC and variational inference.

There are also popular R packages, such as MCMCpack.