Monte Carlo Markov Chain Methods

Table of Contents

Resources

Introduction

  • This idea of selecting a statistical sample to approximate a hard combinatorial problem by a much simpler problem is at the heart of modern Monte Carlo simulation.
  • We should emphasize that in order to obtain the best results out of this class of algorithms, it is important that we do not treat them as black boxes, but instead try to incorporate as much domain specific knowledge as possible into their design
  • MCMC algorithms typically require the design of proposal mechanisms to generate candidate hypotheses. Many existing machine learning algorithms can be adapted to become proposal mechanisms (de Freitas et al., 2001). This is often essential to obtain MCMC algorithms that converge quickly.

Where is its used in Bayesian inference & learning?

  • Normalization
  • Marginalisation
  • Expectation

The Monte Carlo principle

  • To draw i.i.d samples from a "target" density \(p(x)\) defined on a high-dimensional space \(\mathcal{X}\). These N samples can be used to approximate the target density.
  • Estimated integral is unbiased by strong law of large numbers - converges almost surely.

Strong Law of Large Numbers

  • Probability that, as the number of trials n goes to infinity, the average of the observations converges to the expected value, is equal to one aka almost surely convergence.

Advantage of Monte Carlo Integration over deterministic integration.

  • Former positions the integration grid (samples) in regions of high probability.

TODO How?

Sampling from standard distributions

  • When p(x) has standard form, e.g. Gaussian, it is straight forward to sample from it using easily available routines. However, when this is not the case, we need to introduce more sophisticated techniques based on rejection sampling, importance sampling and MCMC.

Rejection Sampling

Goal: To sample from p(x) q(x) is a easier to sample from proposal distribution s.t p(x) ≤ Mq(x) for some finite M

  1. Sample xi from proposal distribution q(x).
  2. Sample u from from U(0,1).
  3. ACCEPT If uMq(x) < p(x); Else REJECT
  4. Goto Step 1

Limitations

  • Difficult to bound p(x) with suitable M.
  • M often too large => Acceptance probability too low (acceptance probability = 1/M).
  • Method suffers in high dimensions. Why? Too many rejections?

Importance Sampling

https://www.youtube.com/watch?v=S3LAOZxGcnk

Importance weight

\begin{equation} w(x) = \frac{p(x)}{q(x)}h \end{equation}

Posterior Density Approximation

\begin{equation} \hat{p}_N(x) = \Sigma^N_{i=1} w(x_i)\delta_{x_i}(x) \end{equation}

\[ E(f(x)) = \int f(x)w(x)q(x)dx \approx \frac{1}{n} \sum_{i=1}^{n} f(x_i) w(x_i) \]

  • Idea: Correct for the fact that you are sampling from a different proposal distribution q(x) instead of sampling from p(x). This correction using importance weights keeps the estimator unbiased.
  • Often times q(x) is an approximate of p(x).
  • Whether or not you can sample from p(x) we want to improve upon the variance of basic standard Monte Carlo sampling with an appropriate selection of q(x)

\[ \sigma^2(I_n) = \frac{1}{n}\sigma^2(f(x)(\frac{p(x)}{q(x)})) \leq \frac{1}{n} \sigma^2(f(x)) \]

Note: Not all q(x) reduce the variance. And the q(x) which minimizes the variance may not be feasible to sample from.

Here xi s are sampled from the proposal distribution \(q(x)\).

How to choose optimal porposal distribution q(x)?

What makes a good choice of q(x)?

What makes a bad choice of q(x)?

Adaptive Importance Sampling

Dirac Delta Function

Equal to zero everywhere except for zero and whose integral over the entire real line is equal to one.

Limitations

MCMC Methods

  • Sampling from space \(\mathcal{X}\) using a Markov Chain whose limiting distribution on its states is p(x), the target distribution.
  • We assume that we cannot directly sample from p(x) but can calculate p(x) upto a normalization constant.
  • The markov chain is designed in to spend more time in the higher probability regions of the space \(\mathcal{X}\).
  • In order for the Markov Chain to have the capability to explore the entire state space, it needs to be irreducible.
  • In order for the Markov Chain to not get stuck on cycles of states, it needs to be aperiodic.
  • In order to make it aperiodic, add noise to the transition matrix.
  • A sufficient but not necessary condition for the Markoc Chain to have a particular invariant(limiting) distribution p(x) is as follows,

\[ p(x_i) = \sum_{x_{i-1}} p(x_{i-1})T(x_i|x_{i-1}) \]

where p(x) is the invariant(target) distribution and T is a transition matrix. This could be seen as a eigenvalue equation with eigenvalue 1 and eigenvector p(x).

Using Perron-Frobenius form Linear Algebra, we can say

  • The remaining eigenvalues have absolute value of to be less than 1.
  • The second largest eigenvalue determines the rate of convergence of the chain. This should be as small as possible.

In case of continuous state space, \[ p(x_i) = \int p(x_{i-1})T(x_i|x_{i-1}) dx_{i-1} \]

where, K is the conditional density function and p is an eigenfunction.

Metropolis-Hastings algorithm

Two simple isntances of MH algorithm.

Independent Sampler

Proposal independent of current state, i.e, \(q(x^*| x^{(i)}) = q(x^*)\).

\[ \mathcal{A}(x^{(i)}, x^*) = \min \Bigg\{ 1, \frac{p(x^*)q(x^{(i)})}{p(x^{(i)})q(x^*)} \Bigg\} = \min \Bigg\{ 1, \frac{w(x^*))}{w(x^{(i)})} \Bigg\} \]

Metropolis Algorithm

Assumes symmetric random walk proposal, i.e, \(q(x^*| x^{(i)}) = q( x^{(i)} | x^*)\). \[ \mathcal{A}(x^{(i)}, x^*) = \min \Bigg\{ 1, \frac{p(x^*))}{p(x^{(i)})} \Bigg\} \]

Simulated annealing for global optimization

Mixtures and cycles of MCMC kernels

Gibbs Sampler

Author: Sharan Yalburgi

Created: 2020-02-24 Mon 14:07

Validate