Monte Carlo Markov Chain Methods

Resources
Introduction
The Monte Carlo principle
Rejection Sampling
- Limitations
Importance Sampling
MCMC Methods

Resources

https://www.cs.ubc.ca/~arnaud/andrieu_defreitas_doucet_jordan_intromontecarlomachinelearning.pdf

Introduction

This idea of selecting a statistical sample to approximate a hard combinatorial problem by a much simpler problem is at the heart of modern Monte Carlo simulation.
We should emphasize that in order to obtain the best results out of this class of algorithms, it is important that we do not treat them as black boxes, but instead try to incorporate as much domain specific knowledge as possible into their design
MCMC algorithms typically require the design of proposal mechanisms to generate candidate hypotheses. Many existing machine learning algorithms can be adapted to become proposal mechanisms (de Freitas et al., 2001). This is often essential to obtain MCMC algorithms that converge quickly.

Where is its used in Bayesian inference & learning?

Normalization
Marginalisation
Expectation

The Monte Carlo principle

To draw i.i.d samples from a "target" density \(p(x)\) defined on a high-dimensional space \(\mathcal{X}\). These N samples can be used to approximate the target density.
Estimated integral is unbiased by strong law of large numbers - converges almost surely.

Strong Law of Large Numbers

Probability that, as the number of trials n goes to infinity, the average of the observations converges to the expected value, is equal to one aka almost surely convergence.

Advantage of Monte Carlo Integration over deterministic integration.

Former positions the integration grid (samples) in regions of high probability.

TODO How?

Sampling from standard distributions

When p(x) has standard form, e.g. Gaussian, it is straight forward to sample from it using easily available routines. However, when this is not the case, we need to introduce more sophisticated techniques based on rejection sampling, importance sampling and MCMC.

Rejection Sampling

Goal: To sample from p(x) q(x) is a easier to sample from proposal distribution s.t p(x) ≤ Mq(x) for some finite M

Sample x_i from proposal distribution q(x).
Sample u from from U(0,1).
ACCEPT If uMq(x) < p(x); Else REJECT
Goto Step 1

Limitations

Difficult to bound p(x) with suitable M.
M often too large => Acceptance probability too low (acceptance probability = 1/M).
Method suffers in high dimensions. Why? Too many rejections?

Importance Sampling

https://www.youtube.com/watch?v=S3LAOZxGcnk

Importance weight

\begin{equation} w(x) = \frac{p(x)}{q(x)}h \end{equation}

Posterior Density Approximation

\begin{equation} \hat{p}_N(x) = \Sigma^N_{i=1} w(x_i)\delta_{x_i}(x) \end{equation}

\[ E(f(x)) = \int f(x)w(x)q(x)dx \approx \frac{1}{n} \sum_{i=1}^{n} f(x_i) w(x_i) \]

Idea: Correct for the fact that you are sampling from a different proposal distribution q(x) instead of sampling from p(x). This correction using importance weights keeps the estimator unbiased.
Often times q(x) is an approximate of p(x).
Whether or not you can sample from p(x) we want to improve upon the variance of basic standard Monte Carlo sampling with an appropriate selection of q(x)

\[ \sigma^2(I_n) = \frac{1}{n}\sigma^2(f(x)(\frac{p(x)}{q(x)})) \leq \frac{1}{n} \sigma^2(f(x)) \]

Note: Not all q(x) reduce the variance. And the q(x) which minimizes the variance may not be feasible to sample from.

Here x_i s are sampled from the proposal distribution \(q(x)\).

How to choose optimal porposal distribution q(x)?

What makes a good choice of q(x)?

What makes a bad choice of q(x)?

Adaptive Importance Sampling

Dirac Delta Function

Equal to zero everywhere except for zero and whose integral over the entire real line is equal to one.

Limitations

MCMC Methods

Sampling from space \(\mathcal{X}\) using a Markov Chain whose limiting distribution on its states is p(x), the target distribution.
We assume that we cannot directly sample from p(x) but can calculate p(x) upto a normalization constant.
The markov chain is designed in to spend more time in the higher probability regions of the space \(\mathcal{X}\).
In order for the Markov Chain to have the capability to explore the entire state space, it needs to be irreducible.
In order for the Markov Chain to not get stuck on cycles of states, it needs to be aperiodic.
In order to make it aperiodic, add noise to the transition matrix.

A sufficient but not necessary condition for the Markoc Chain to have a particular invariant(limiting) distribution p(x) is as follows,

\[ p(x_i) = \sum_{x_{i-1}} p(x_{i-1})T(x_i|x_{i-1}) \]

where p(x) is the invariant(target) distribution and T is a transition matrix. This could be seen as a eigenvalue equation with eigenvalue 1 and eigenvector p(x).

Using Perron-Frobenius form Linear Algebra, we can say

The remaining eigenvalues have absolute value of to be less than 1.
The second largest eigenvalue determines the rate of convergence of the chain. This should be as small as possible.

In case of continuous state space, \[ p(x_i) = \int p(x_{i-1})T(x_i|x_{i-1}) dx_{i-1} \]

where, K is the conditional density function and p is an eigenfunction.

Metropolis-Hastings algorithm

Two simple isntances of MH algorithm.

Independent Sampler

Proposal independent of current state, i.e, \(q(x^*| x^{(i)}) = q(x^*)\).

\[ \mathcal{A}(x^{(i)}, x^*) = \min \Bigg\{ 1, \frac{p(x^*)q(x^{(i)})}{p(x^{(i)})q(x^*)} \Bigg\} = \min \Bigg\{ 1, \frac{w(x^*))}{w(x^{(i)})} \Bigg\} \]

Metropolis Algorithm

Assumes symmetric random walk proposal, i.e, \(q(x^*| x^{(i)}) = q( x^{(i)} | x^*)\). \[ \mathcal{A}(x^{(i)}, x^*) = \min \Bigg\{ 1, \frac{p(x^*))}{p(x^{(i)})} \Bigg\} \]

Monte Carlo Markov Chain Methods

Table of Contents

Resources

Introduction

The Monte Carlo principle

Strong Law of Large Numbers

Advantage of Monte Carlo Integration over deterministic integration.

TODO How?

Sampling from standard distributions

Rejection Sampling

Limitations

Importance Sampling

How to choose optimal porposal distribution q(x)?

What makes a good choice of q(x)?

What makes a bad choice of q(x)?

Adaptive Importance Sampling

Dirac Delta Function

Limitations

MCMC Methods

Metropolis-Hastings algorithm

Independent Sampler

Metropolis Algorithm

Simulated annealing for global optimization

Mixtures and cycles of MCMC kernels

Gibbs Sampler