64
VC.M. Bishop’s PRML Tran Quoc Hoan @k09ht haduonght.wordpress.com/ 10 January 2016, PRML Reading, Hasegawa lab., Tokyo The University of Tokyo Chapter 11: Sampling Methods

PRML Reading Chapter 11 - Sampling Method

Embed Size (px)

Citation preview

Page 1: PRML Reading Chapter 11 - Sampling Method

VC.M. Bishop’s PRML

Tran Quoc Hoan

@k09ht haduonght.wordpress.com/

10 January 2016, PRML Reading, Hasegawa lab., Tokyo

The University of Tokyo

Chapter 11: Sampling Methods

Page 2: PRML Reading Chapter 11 - Sampling Method

Introduction

Introduction 2

Generating a random number is not easy!

True Random Number

Pseudo-Random Number

Gather “entropy”, or seemingly random data from the physical world

Seed

Random number

PRG

http://www.howtogeek.com/183051/htg-explains-how-computers-generate-random-numbers/

Ex. Mersenne Twister

Generating a number follow a probability distribution is more difficult!

Page 3: PRML Reading Chapter 11 - Sampling Method

For today’s meeting

Agendas 3

• Cover PRML chapter 11

• Understand the general concept of sampling from desired distribution

• Introduction to MCMC world

• More about MCMC• Details in PaperAlert

Page 4: PRML Reading Chapter 11 - Sampling Method

Outline

Sampling Methods 4

Basic Sampling Algorithms

Markov Chain Monte Carlo

Gibbs Sampling

Slice Sampling

Hybrid Monte Carlo

Estimating the Partition Function

Part I: General concept of basic sampling

Part II:

MCMC world

Page 5: PRML Reading Chapter 11 - Sampling Method

Progress…

5

Basic SamplingAlgorithms

Markov ChainMonte Carlo

Gibbs Sampling

Slice Sampling

Hybrid Monte Carlo

Estimating thePartition Function

Sampling Methods

Page 6: PRML Reading Chapter 11 - Sampling Method

Standard distributions

11.1 Basic Sampling Algorithms 6

• Goal: Sampling from desired distribution p(y) .

• Assumption: can generate random in U[0,1]

z y = h�1(z)

h(y) =

Z y

�1p(x)dx

Generate random Transform

Uniformly distributed

h is cumulative distribution of p

0 y < 1

h(y) = 1� exp(��y)

p(y) = �exp(��y)

y = ���1 ln(1� z)

Ex. Consider exponential distribution

where

then

and

p(y) = p(z)

����dz

dy

���� If h-1 is easy to know

Page 7: PRML Reading Chapter 11 - Sampling Method

Transformation method

711.1 Basic Sampling Algorithms

Page 8: PRML Reading Chapter 11 - Sampling Method

Rejection Sampling

811.1 Basic Sampling Algorithms

• Assumption 1: Sampling from p(z) is difficult but we are able to evaluate p(z) for any given value of z, up to some normalizing constant Z

• Assumption 2: We know how to sample from a proposal distribution q(z) and there exist a constant k such that

p(z) =1

Zpp̃(z)

kq(z) � p̃(z)

• Then we know algorithm to obtain independent samples from p(.)

(11.13)

Page 9: PRML Reading Chapter 11 - Sampling Method

Rejection Sampling

911.1 Basic Sampling Algorithms

Generate z0 from

proposal q(.)

Consider constant k such that

kq(z) cover p~(z)

Generate u0 from U[0, kq(z0)]

Reject z0 if

Keep z0 if

u0 > p̃(z0)

u0 p̃(z0)

• Efficiency of the method depend on the ratio between the grey area and the white area

• Proof p(accept) =

Zp̃(z)

kq(z)q(z)dz

=1

k

Zp̃(z)dz

Page 10: PRML Reading Chapter 11 - Sampling Method

Rejection Sampling Example

1011.1 Basic Sampling Algorithms

• Sampling from Gamma distribution (green curve)

Gam(z|a, b) = baza�1exp(�bz)

�(a)

at z = (a-1)/b

• Proposal distribution -> Cauchy distribution (red curve)

q(z) =c0

1 +(z � z0)2

d20

achieved by transforming z = d0 tan(⇡u) + z0

where u draw uniformly from [0, 1]

• We need to find z0, c0, d0 such that q(z) is greater (or equal) everywhere to Gam(z|a,b) with smallest d0c0 (defines area)

z0 =a� 1

b, d20 = 2a� 1, c0 =

1

⇡d0

Page 11: PRML Reading Chapter 11 - Sampling Method

Adaptive Rejection Sampling

1111.1 Basic Sampling Algorithms

• The proposal distribution q(.) may be difficult to construct.

Fig 11.6: If a sample point is rejected, it is added to the set of the grid points and used to refine the envelope distribution.

Construct q(z) from

initial grid points

Generate z4 from q(z)

Generate u0 from U[0, q(z4)]

Reject z4 if

Keep z4 if

• Rejection sampling methods are inefficient if sampling in high dimension (exponential decrease of

acceptance rate with dimensionality)

u0 p̃(z4)

u0 > p̃(z4)but it is used to refine

the envelope

z4

Page 12: PRML Reading Chapter 11 - Sampling Method

Importance Sampling

1211.1 Basic Sampling Algorithms

IntegralBasic idea: Transform the integral

into an expectation over a simple, known

distribution

p(z) f(z)

z

q(z)

Conditions: q(z) > 0 when f(z)p(z) ≠ 0 Easy to sample from q(z)

E[f ] =Z

f(z)p(z)dz

E[f ] =Zf(z)p(z)

q(z)

q(z)dz

E[f ] =Zf(z)

p(z)

q(z)q(z)dz

E[f ] = 1

S

X

s

w(s)f(z(s))

Proposal

Importance weight

Monte Carlo

correct the bias introduced by

sampling from a wrong distribution

• All the generated samples are retained

Normalized

w(s) / p(z(s))

q(z(s))z(s) ⇠ q(z)

Page 13: PRML Reading Chapter 11 - Sampling Method

SIR(sampling-importance-resampling)

1311.1 Basic Sampling Algorithms

• Rejection sampling: choosing q(z) and constant k is not suitable way

• SIR: based on the use of a proposal distribution q(z) but avoids having to determine the constant k

1. Draw L samples z(1), z(2), ...z(L)from q(z)

2. Calculate the importance weight

p(z(l))

q(z(l))

8l = 1...L

3. Normalize the weights to obtain w1...wL

4. Draw a second set of L samples from the discrete distribution

(z(1), z(2), ...z(L)) with probabilities (w1...wL)

• The resulting L samples are distributed according to p(z) if L -> ∞

Page 14: PRML Reading Chapter 11 - Sampling Method

SIR(sampling-importance-resampling)

1411.1 Basic Sampling Algorithms

1. Draw L samples z(1), z(2), ...z(L)from q(z)

2. Calculate the importance weight

p(z(l))

q(z(l))

8l = 1...L

3. Normalize the weights to obtain w1...wL

4. Draw a second set of L samples from the discrete distribution

(z(1), z(2), ...z(L)) with probabilities (w1...wL)

• Proof

=

Pl I(z

(l) a)p̃(z(l))/q(z(l))Pl p̃(z

(l))/q(z(l))

p(z a) =X

l:z(l)a

wl p(z a) =

RI(z a){p̃(z)/q(z)}q(z)dzR

{p̃(z)/q(z)}q(z)dz

=

RI(z a)p̃(z)dzR

p̃(z)dz

=

ZI(z a)p(z)dz

I(F) = 1 if F is TRUE else 0

If L ! 1 then

Page 15: PRML Reading Chapter 11 - Sampling Method

Sampling and the EM algorithm

1511.1 Basic Sampling Algorithms

• Use some Monte Carlo method to approximate the expectation of the E-step

Monte Carlo EM algorithm

• The expected complete-data log likelihood, given by (Z: hidden; X: observed; : parameters)✓

(11.28)Q(✓,✓old) =

Zp(Z|X,✓old) ln p(Z,X|✓)dZ

may be approximated by (where Z(l)are drawn from p(Z,X|✓old

) )

Q(✓,✓old) ⇡ 1

L

LX

l=1

ln p(Z(l),X|✓) (11.29)

Stochastic EM algorithm

• Considering a finite mixture model, only one sample Z may be drawn at each E-step (makes a hard assignment of each data point to one of the components)

Page 16: PRML Reading Chapter 11 - Sampling Method

IP Algorithm

1611.1 Basic Sampling Algorithms

• For a full Bayesian treatment in which we wish to draw samples from the joint posterior p(✓, Z|X)

IP algorithm

• I-step. We wish to sample from p(Z|X) but we cannot do this directly. Notice that

p(Z|X) =

Zp(Z|✓, X)p(✓|X)d✓ (11.30)

p(✓|X), and then use this to draw a sample Z(l)from p(Z|✓(l),X)

for l = 1...L we first draw a sample ✓(l)from the current estimate for

• P-step. Given the relation

p(✓|X) =

Zp(✓|Z,X)p(Z|X)dZ (11.31)

we use the samples {Z(l)} obtained from I-step to compute

p(✓|X) ⇡ 1

L

LX

l=1

ln p(✓|Z(l),X) (11.32)

Page 17: PRML Reading Chapter 11 - Sampling Method

In Reviews…

1711.1 Basic Sampling Algorithms

• Inverse function method- Analytical reliable but unable to deal with complicated distribution

• Rejection sampling- Able to deal with complicated distribution but difficult to choose proposal distribution and constant k- Sometimes, it wastes samples due to rejection process

• Adaptive rejection sampling- Use envelope function to reduce rejected samples.- Difficult to deal with high dimension, sharp peak distribution

• Importance sampling- Approximate expectation with weights in proposal distribution, not sample from desired distribution

• SIR- Combine rejection sampling and importance sampling

• Monte Carlo EM• IP algorithm for data expand

Page 18: PRML Reading Chapter 11 - Sampling Method

Progress…

18

Basic Sampling Algorithms

Markov Chain Monte Carlo

Gibbs Sampling

Slice Sampling

Hybrid Monte Carlo

Estimating the Partition Function

Part I: General concept of basic sampling

Part II:

Welcome to MCMC world

Sampling Methods

Page 19: PRML Reading Chapter 11 - Sampling Method

Markov Chain Monte Carlo (MCMC)

1911.2 Markov Chain Monte Carlo

• MCMC: general strategy which allows sampling from a large class of distribution (based on the mechanism of Markov chains)

• MCMC scales well with the dimensionality of the sample space

Posterior distribution MLE Likelihood function MCMC

Estimate value Wrong estimateEstimate top of mountain (depend on initial value)

Estimate posterior distribution (approach to global optimal, not depend on initial value)

Slice sampling

Gibbs sampling

Metropolis method

Metropolis-Hastings Method

Markov Chain Monte CarloInverse function

Rejection sampling

Adaptive rejection sampling

Importance sampling SIR Data expand

sampling

Page 20: PRML Reading Chapter 11 - Sampling Method

MCMC: the idea

2011.2 Markov Chain Monte Carlo

• Goal: to generate a set of samples from p(z)

• Idea: to generate samples from a Markov Chain whose invariant distribution is p(z)

1. Knowing the current sample is z(τ), generate a candidate sample z* from a proposal distribution q(z|z(τ))

2. Accept the sample according to an appropriate criterion.

3. If the candidate sample is accepted then z(τ+1) = z* otherwise z(τ+1) = z(τ)

• The proposal distribution depends on the current state

• Samples z(1), z(2), … form a Markov chain and the distribution of z(τ) tends to p(z) as τ -> ∞

• Assumption: We know how to evaluate (but not Zp)p̃(z) = Zpp(z)

Page 21: PRML Reading Chapter 11 - Sampling Method

Metropolis Algorithm

2111.2 Markov Chain Monte Carlo

• The proposal distribution is symmetric

• The candidate sample is accepted with probability

q(zA|zB) = q(zB |zA)

A(z⇤, z(⌧)) = min

✓1,

p̃(z⇤)

p̃(z(⌧))

◆(11.33)

Fig 11.9: The proposal distribution is an isotopic Gaussian distribution whose std = 0.2. Accepted steps in green, rejected steps in red, std contour is ellipse. 150 candidate samples, 43 rejected.

Page 22: PRML Reading Chapter 11 - Sampling Method

Markov Chains

2211.2 Markov Chain Monte Carlo

• Q: under what circumstances will a Markov chain converge to the desired distribution ?

• First order Markov chain: series of random variables z(1), …,z(M) such that

p(z(m+1)|z(1), ..., z(m)) = p(z(m+1)|z(m)) 8m (11.37)

• Markov chain specified by p(z(0)) and the transition probabilities

Tm(z(m), z(m+1)) = p(z(m+1)|z(m))

• A distribution p*(z) is said to be invariant for a Markov chain if

p⇤(z) =X

z0

T (z0, z)p⇤(z0)

with a sufficient condition is to choose the transitions to satisfy the property of detailed balance

p⇤(z)T (z, z0) = T (z0, z)p⇤(z0) (11.40)

Page 23: PRML Reading Chapter 11 - Sampling Method

Markov Chains

2311.2 Markov Chain Monte Carlo

Page 24: PRML Reading Chapter 11 - Sampling Method

Ergodicity

2411.2 Markov Chain Monte Carlo

Unique invariant distribution

if ‘forget’ starting point, z(0)

Image source: Murray, MLSS 2009 slides

Page 25: PRML Reading Chapter 11 - Sampling Method

Markov Chains

2511.2 Markov Chain Monte Carlo

(11.40)

• Goal: to generate a set of samples from p(z)• Idea: to generate samples from a Markov Chain whose invariant distribution is p(z)

• How: choose the transition probability T( z, z’ ) satisfy the property of detailed balance for p(z)

p(z)T (z, z0) = T (z0, z)p(z0)

• T( z, z’ ) can be constructed from a set of “base” transitions B1, B2, …,Bk

T (z0, z) =KX

k=1

↵kBk(z0, z)

T (z0, z) =X

z1

...X

zK�1

B1(z0, z1)...BK�1(zK�2, zK�1)BK(zK�1, z)

or

(11.42)

(11.43)

Page 26: PRML Reading Chapter 11 - Sampling Method

The Metropolis-Hasting Algorithm

2611.2 Markov Chain Monte Carlo

• Generalization of the Metropolis algorithm (the proposal distribution q is no longer symmetric).

• Knowing the current sample is z(τ), generate a candidate sample z* from a proposal distribution q(z|z(τ))

• Accept it with probability

Ak(z⇤, z(⌧)) = min

✓1,

p̃(z⇤)qk(z(⌧)|z⇤)

p̃(z(⌧))qk(z⇤|z(⌧))

◆(11.44)

where k labels the members of the set of possible transitions being considered.

Page 27: PRML Reading Chapter 11 - Sampling Method

The Metropolis-Hasting Algorithm

2711.2 Markov Chain Monte Carlo

• Prove that p(z) is the invariant distribution of the chain

• Notice that the transition probability of this chain is defined as

• We need to prove

p(z)Tk(z, z0) = Tk(z

0, z)p(z0)

Ak(z⇤, z(⌧)) = min

✓1,

p̃(z⇤)qk(z(⌧)|z⇤)

p̃(z(⌧))qk(z⇤|z(⌧))

p(z) = p̃(z)/Zp

Proof

Tk(z, z0) = qk(z

0|z)Ak(z0, z)

p(z)qk(z0|z)Ak(z

0, z) = min(p(z)qk(z0|z), p(z0)qk(z|z0))

Use= min(p(z0)qk(z|z0), p(z)qk(z

0|z))

= p(z)qk(z|z0)Ak(z, z0)

(Q.E.D)

Page 28: PRML Reading Chapter 11 - Sampling Method

The Metropolis-Hasting Algorithm

2811.2 Markov Chain Monte Carlo

• Common choice for q: Gaussian centered on the current state

✓ small variance -> high rate of acceptation but slow exploration of the state space + non independent samples

✓ large variance -> high rate of rejection

Fig 11.10: Use of an isotropic Gaussian proposal (blue circle) to sample from a Gaussian distribution (red). The scale ρ of the proposal should be on the order of σmin , but the algorithm may have low convergence (to explore the state space in the other direction -> (σmax/σmin)2 iterations required)

Page 29: PRML Reading Chapter 11 - Sampling Method

Summary so far…

2911.2 Markov Chain Monte Carlo

• We need approximate methods to solve sum/integrals

• Monte Carlo does not explicitly depend on dimension, although simple methods work only in low dimensions

• Markov Chain Monte Carlo (MCMC) can make local moves. By assuming less, it’s more applicable to higher dimensions

• Simple computations => “easy” to implement(harder to diagnose)

Page 30: PRML Reading Chapter 11 - Sampling Method

Progress…

30

Basic SamplingAlgorithms

Markov ChainMonte Carlo

Gibbs Sampling

Slice Sampling

Hybrid Monte Carlo

Estimating thePartition Function

Sampling Methods

Page 31: PRML Reading Chapter 11 - Sampling Method

Gibbs Sampling

3111.3 Gibbs Sampling

• Sample each variable in turn, conditioned on the values of all other variables in the distribution (method with no rejection)

✓ Initialize {z1, z2, …, zM}

✓ For τ = 1,2,…,T pick each variable in sequently turn or randomly and resample

z⌧+1i / p(zi|z⌧

\i) for i = 1...M

Proof of validity

• Consider a Metropolis-Hastings sampling step involving the variable zk in which the remaining variables z\k remain fixed and the transition probability

qk(z⇤|z) = p(z⇤k|z\k)

then, acceptance probability is

Ak(z⇤, z) =

p(z⇤)qk(z|z⇤)

p(z)qk(z⇤|z) =p(z⇤k|z⇤

\k)p(z⇤\k)p(zk|z⇤

\k)

p(zk|z\k)p(z\k)p(z⇤k|z\k)= 1

where z⇤\k = z\k

Page 32: PRML Reading Chapter 11 - Sampling Method

Gibbs Sampling

3211.3 Gibbs Sampling

Fig 11.11: Illustration of Gibbs sampling, by alternate updates of two variables (blue steps) whose distribution is a correlated Gaussian (red). The step size is governed by the standard deviation of the conditional distribution (green curve), and is O(l), leading to slow progress. The number of steps needed to obtain an independent sample from the distribution is O((L/l)2)

Page 33: PRML Reading Chapter 11 - Sampling Method

Progress…

33

Basic SamplingAlgorithms

Markov ChainMonte Carlo

Gibbs Sampling

Slice Sampling

Hybrid Monte Carlo

Estimating thePartition Function

Sampling Methods

Page 34: PRML Reading Chapter 11 - Sampling Method

Auxiliary variables

3411.4 Slice Sampling

• Collapsing: analytically integrate variables out

• Auxiliary methods

Introduce extra variables integrate by MCMC

Explore where⇡(✓, h)

Z⇡(✓, h)dh = ⇡(✓)

Page 35: PRML Reading Chapter 11 - Sampling Method

Slice Sampling

3511.4 Slice Sampling

• Problem of Metropolis algorithm ( proposal q(z|z’) = q(z’|z) )

✓ Step size is too small, slow convergence (random walk behavior)

✓ Step size is too large, high estimator variance (high rejection rate)

• Idea: adapt step size automatically to suitable value

• Technique: introduce variable u and sample (u, z) jointly. Ignoring u leads to the desired samples of p(z)

Page 36: PRML Reading Chapter 11 - Sampling Method

Slice Sampling

3611.4 Slice Sampling

• Sample z and u uniformly from area under the distribution

✓ Fix z, sample u uniform from

✓ Fix u, sample z uniform from the slice through the distribution

• How to sample z from the slice

slice

[0, p̃(z)]

{z : p̃(z) > u}

✓ Start with the region of width w containing z(τ)

✓ If end point in slice, then extend region by w in that direction

✓ Sample z’ uniform from region

✓ If z’ in slice, then accept as z(τ+1)

✓ If not: make z’ new end point of the region, and resample z’

Multivariate distribution: slice sampling within Gibbs sampler

See next slides for more details

Page 37: PRML Reading Chapter 11 - Sampling Method

Slice Sampling Idea

3711.4 Slice Sampling

p̃(z)

(z, u)

z

Sample uniformly under curve p̃(z) / p(z)

p(u|z) = Uniform[0, p̃(z)]

p(z|u) /(

1 if p̃(z) � u

0 if otherwise

= Uniform on the slice

u

Slide from MCMC NIPS2015 tutorial

Page 38: PRML Reading Chapter 11 - Sampling Method

Slice Sampling Idea

3811.4 Slice Sampling

Rejection sampling p(z|u) using broader uniform

z

(z, u)

u

Unimodal conditionals

Slide from MCMC NIPS2015 tutorial

Page 39: PRML Reading Chapter 11 - Sampling Method

Slice Sampling Idea

3911.4 Slice Sampling

Adaptive rejection sampling p(z|u)

z

(z, u)

u

Unimodal conditionals

Slide from MCMC NIPS2015 tutorial

Page 40: PRML Reading Chapter 11 - Sampling Method

Slice Sampling Idea

4011.4 Slice Sampling

Quickly find new z and no rejection recorded

z

(z, u)

u

|

Unimodal conditionals

Slide from MCMC NIPS2015 tutorial

Page 41: PRML Reading Chapter 11 - Sampling Method

Slice Sampling Idea

4111.4 Slice Sampling

Multimodal conditionalsp̃(z)

(z, u)

u

z

Use updates that leave p(z|u) invariant- place bracket randomly around point

- linearly step out until ends are off slice

- sample on bracket, shrinking as before

Slide from MCMC NIPS2015 tutorial

Page 42: PRML Reading Chapter 11 - Sampling Method

Progress…

42

Basic SamplingAlgorithms

Markov ChainMonte Carlo

Gibbs Sampling

Slice Sampling

Hybrid Monte Carlo

Estimating thePartition Function

Sampling Methods

Page 43: PRML Reading Chapter 11 - Sampling Method

Hybrid Monte Carlo

4311.5 Hybrid Monte Carlo

• Problem of Metropolis algorithm is the step size trade-off

• Hybrid Monte Carlo is suitable in continuous state spaces

✓ Able to make large jumps in state space with low rejection rate

✓ Adopts physical system (Hamiltonian) dynamics rather than a probability distribution to propose future states in the Markov chain.

• Goal: to sample from

p(z) =1

Zpexp(�E(z))

where E(z) is considered as potential energy function of system over z

Page 44: PRML Reading Chapter 11 - Sampling Method

Hamiltonian dynamics

4411.5 Hybrid Monte Carlo

• Hamiltonian dynamics describe how kinetic energy is converted to potential energy (and vice versa) as an object moves throughout in time

• Evolution of state variable z = {zi} under continuous time τ.

• Momentum variables correspond to rate of change of state.

ri =dzid⌧

(11.53)Join (z, r) space is called phase space

• For each location the object takes, there is an associated potential energy E(z), and for each momentum there is an associated kinetic energy K(r).

Total energy of the system is constant and known as Hamiltonian

H(z, r) = E(z) +K(r)

and @ri@⌧

= �@H

@zi= �@E(z)

@zi

@zi@⌧

=@H

@ri=

@K(r)

@ri

• Preserve volume in phase space div V = 0 with V =

✓dz

d⌧,dr

d⌧

◆(11.62)

Page 45: PRML Reading Chapter 11 - Sampling Method

Simulating Hamiltonian dynamics

4511.5 Hybrid Monte Carlo

@ri@⌧

= �@H

@zi= �@E(z)

@zi

@zi@⌧

=@H

@ri=

@K(r)

@ri• If we have expression for partial and a set of initial conditions (z0, r0), we can predict the location and momentum at any point in time.

Leap Frog method (run for L steps to simulate dynamics over L x δ units of time)

1. Take a half step in time to update the momentum variable

ri(⌧ + �/2) = ri(⌧)� (�/2)@E

@zi(⌧)

zi(⌧ + �) = zi(⌧) + �@K

@ri(⌧ + �/2)

2. Take a full step in time to update the position variable

3. Take the remaining half step in time to finish updating the momentum variable

ri(⌧ + �) = ri(⌧ + �/2)� (�/2)@E

@zi(⌧ + �)

Page 46: PRML Reading Chapter 11 - Sampling Method

Simulating Hamiltonian oscillator

4611.5 Hybrid Monte Carlo

F = �kz

K(v) =(mv)2

2m=

v2

2=

r2

2= K(r)

Leap Frog equations

1. r(⌧ + �/2) = r(⌧)� (�/2)z(⌧)

2. z(⌧ + �) = z(⌧) + (�)r(⌧ + �/2)

3. r(⌧ + �) = r(⌧ + �/2)� (�/2)z(⌧ + �)

r

zE+K H

Energy Phase Space

Img Ref. https://theclevermachine.wordpress.com/2012/11/18/mcmc-hamiltonian-monte-carlo-a-k-a-hybrid-monte-carlo/

E(z) = �Z

Fdz =kz2

2

Harmonic Oscillator

Page 47: PRML Reading Chapter 11 - Sampling Method

Target distribution

4711.5 Hybrid Monte Carlo

• Consider canonical distribution p(✓) =1

Zpexp(�E(✓))

• Canonical distribution for the Hamiltonian dynamics energy function is

p(z, r) / exp(�H(z, r)) = exp(�E(z)�K(r))

/ p(z)p(r) state z and momentum r are independently distributed

• We can use Hamiltonian dynamics to sample from the joint canonical distribution over r and z and simply ignore the momentum contributions.

idea of introducing auxiliary variables (r) to facilitate the Markov chain of (z)

• A common choose

K(r) =rTr

2

and E(z) = � log p(z)

Page 48: PRML Reading Chapter 11 - Sampling Method

Hybrid Monte Carlo

4811.5 Hybrid Monte Carlo

• Combination of Metropolis algorithm and Hamiltonian Dynamics

Algorithm to draw M samples from a target distribution1. Set τ = 0

2. Generate an initial position state z(0) ~ π(0)

3. Repeat until τ = M

Set τ = τ + 1 - Sample a new initial momentum variable from the momentum canonical distribution r0 ~ p(r) - Set z0 = z(τ - 1)

- Run Leap Frog algorithm starting at [z0, r0] for L step and step size δ to obtain proposed states z* and r*

- Calculate the Metropolis acceptance probability↵ = min(1, exp{H(z0, r0)�H(z⇤, r⇤)})

- Draw a random number u uniformly from [0, 1]If u ≤ α accept the position and set the next state z(τ) = z* else set z(τ)= z(τ-1)

Page 49: PRML Reading Chapter 11 - Sampling Method

Hybrid Monte Carlo simulation

4911.5 Hybrid Monte Carlo

Hamiltonian Monte Carlo for sampling a Bivariate Normal distribution

E(z) = � log(e

� zT ⌃�1z2

)� const

p(z) = N (µ,⌃) with µ = [0, 0]

The MH algorithm converges much slower than HMC, and consecutive samples have much higher autocorrelation than samples drawn using HMC

Img Source. https://theclevermachine.wordpress.com/2012/11/18/mcmc-hamiltonian-monte-carlo-a-k-a-hybrid-monte-carlo/

Page 50: PRML Reading Chapter 11 - Sampling Method

Detailed balance

5011.5 Hybrid Monte Carlo

Transition probability going from R to R’ Transition probability going from R’ to R1

ZHexp(�H(R))�V

1

2

min{1, exp(H(R)�H(R0))} 1

ZHexp(�H(R0

))�V1

2

min{1, exp(H(R0)�H(R))}

Update after sequence of L leapfrog iterations of step size δthe leapfrog integration preserves phase-space volume

RR’

=

time-reversible

prob of choosing positive step size δ or negative step size -δ

Page 51: PRML Reading Chapter 11 - Sampling Method

Progress…

51

Basic SamplingAlgorithms

Markov ChainMonte Carlo

Gibbs Sampling

Slice Sampling

Hybrid Monte Carlo

Estimating thePartition Function

Sampling Methods

Page 52: PRML Reading Chapter 11 - Sampling Method

Estimating the Partition Function

5211.6 Estimating the Partition Function

• Most sampling algorithms require distribution up to the constant partition function ZE (not needed in order to draw samples from p(z))

pE(z) =1

ZEexp{�E(z)}

ZE =

X

z

exp{�E(z)}

• Partition function is useful for model comparison (because it represent for the probability of observed data).

p(hidden|observed) = p(hidden, observed)

p(observed)

• For model comparison, we’re interested in ratio of partition functions

Page 53: PRML Reading Chapter 11 - Sampling Method

Using importance sampling

5311.6 Estimating the Partition Function

• Use importance sampling from proposal pG with energy G(z)

ZE

ZG=

Pz exp(�E(z))Pz exp(�G(z))

=

Pz exp(�E(z) +G(z)) exp(�G(z))P

z exp(�G(z))

= EpG [exp(�E(z) +G(z))] ' 1

Lexp(�E(z(l)

) +G(z(l))) (11.72)

sampled from pG

• Problem: pG need match pE

• Idea: we can use samples z(l) from pE from a Markov chain

• If ZG is easy to compute we can estimate ZE

pG(z) =1

L

LX

l=1

T (z(l), z) (11.73)

where T gives the transition probabilities of the chain

• We now define G(z) = -log pG(z) and use in (11.72)

Page 54: PRML Reading Chapter 11 - Sampling Method

Chaining

5411.6 Estimating the Partition Function

• Partition function ratio estimation requires matching distributions.

• Partition function ZG needs to be evaluated exactly (but only simple distribution) => Poor matching with complicated distribution

• Idea: use set of distributions between the simple p1 and complex pM

ZM

Z1=

Z2

Z1

Z3

Z2...

ZM

ZM�1

E↵(z) = (1� ↵)E1(z) + ↵EM (z)

• The intermediate distributions interpolate from E1 to EM

(11.74)

(11.75)

• Use single Markov chain run initially for the system p1 and then after some suitable number of steps moves on to the next distribution in the sequence.

Page 55: PRML Reading Chapter 11 - Sampling Method

Summary

55

Basic Sampling Algorithms

Markov Chain Monte Carlo

Gibbs Sampling

Slice Sampling

Hybrid Monte Carlo

Estimating the Partition Function

Part I: General concept of basic sampling

Part II:

MCMC world

Sampling Methods

Page 56: PRML Reading Chapter 11 - Sampling Method

Papers Alert

56Sampling Methods

• Markov Chain Monte Carlo Method without Detailed Balancehttp://journals.aps.org/prl/abstract/10.1103/PhysRevLett.105.120603

• Hamiltonian Annealed Importance Sampling for partition function estimationhttp://arxiv.org/abs/1205.1925

• Hamiltonian Monte Carlo with Reduced Momentum Flips

(2010) Hidemaro Suwa and Synge Todo

(2012) Jascha Sohl-Dickstein, Benjamin J. Culpepper

(2012) Jascha Sohl-Dickstein http://arxiv.org/abs/1205.1939

http://jmlr.org/proceedings/papers/v32/sohl-dickstein14.pdf• Hamiltonian Monte Carlo Without Detailed Balance

(2014) Jascha Sohl-Dickstein

• A Markov Jump Process for More Efficient Hamiltonian Monte Carlo(2015) Jascha Sohl-Dickstein http://arxiv.org/abs/1509.03808

http://jmlr.org/proceedings/papers/v37/salimans15.pdf• Markov Chain Monte Carlo and Variational Inference: Bridging the Gap

(2015) Tim Salimans

Page 57: PRML Reading Chapter 11 - Sampling Method

Observing Dark Worlds

57Dark Matter Worlds Halo

Dark Matter bending the light from a background galaxy. In extreme cases the galaxy here is seen as the two arcs surrounding it

https://www.kaggle.com/c/DarkWorlds

Page 58: PRML Reading Chapter 11 - Sampling Method

Observing Dark Worlds

58Dark Matter Worlds Halo

https://www.kaggle.com/c/DarkWorlds We observe that this stuff

aggregates and forms massive structures called Dark Matter

Halos.

There are many galaxies behind a Dark Matter halo, their shapes will

correlate with its position.

Page 59: PRML Reading Chapter 11 - Sampling Method

Observing Dark Worlds

59Dark Matter Worlds Halo

https://www.kaggle.com/c/DarkWorlds

The task is then to use this “bending of light” to

estimate where in the sky this dark matter is located.

Page 60: PRML Reading Chapter 11 - Sampling Method

Observing Dark Worlds

60Dark Matter Worlds Halo

https://www.kaggle.com/c/DarkWorlds• It is really one of statistics: given the noisy data (the elliptical galaxies) recover the model and parameters (position and mass of the dark matter) that generated them

• Step 1: construct a prior distribution p(x) for halo positions (e.g. uniform)

• Step 2: construct a probabilistic model for the data (observed ellipticities of the galaxies) p(e|x)

p(ei

|x) = N (X

j=allhalos

d

i,j

m

j

f(ri,j

),�2)

http://timsalimans.com/observing-dark-worlds/

✦ dij = tangential direction, i.e. the direction in which halo j bends the light of galaxy i ✦ mj is the mass of halo j ✦ f(rij) is a decreasing function in the euclidean distance rij between galaxy i and halo j.

✦For the large halos assign m as a log-uniform distribution in [40,180], and f(rij) = 1/max(rij, 240) ✦For the small halos, fixed the mass at 20 and f(rij) = 1/max(rij, 70)

Page 61: PRML Reading Chapter 11 - Sampling Method

Observing Dark Worlds

61Dark Matter Worlds Halo

• Step 3: Get posterior distribution for halo positions p(x|e) = p(e|x)p(x)/p(e) (simple random-walk Metropolis Hastings sampler to approximate the posterior distribution )

• Step 4: Minimization the expected loss

x̃ = arg minprediction

Ep(x|e)L(prediction, x)

http://timsalimans.com/observing-dark-worlds/

Page 62: PRML Reading Chapter 11 - Sampling Method

Dark Matter Worlds Halo Slide from MCMC NIPS2015 tutorial

Page 63: PRML Reading Chapter 11 - Sampling Method

Dark Matter Worlds Halo Slide from MCMC NIPS2015 tutorial

Page 64: PRML Reading Chapter 11 - Sampling Method

Dark Matter Worlds Halo Slide from MCMC NIPS2015 tutorial