Download pdf - Hastings paper discussion

OutlineIntroduction

Monte Carlo PrincipleMarkov Chain Theory

MCMCConclusion

Monte Carlo Sampling methods using MarkovChains and their Applications

Hastings-University of Toronto

Reading seminar on classics: C.P.Robertpresented by:Donia Skanji

December 3, 2012

1/40

Hastings-University of Toronto Reading Seminar:MCMC

OutlineIntroduction


MCMCConclusion

Outline

1 Introduction

2 Monte Carlo Principle

3 Markov Chain Theory

4 MCMC

5 Conclusion

2/40


OutlineIntroduction


MCMCConclusion

Introduction to MCMC Methods

3/40


OutlineIntroduction


MCMCConclusion

Introduction:

There are several numerical problems such as Integralcomputing and Maximum evaluation in large dimensionalspaces

Monte Carlo Methods are often applied to solve integrationand optimisation problems.

Monte Carlo Markov chain (MCMC) is one of the most knownMonte Carlo methods.

MCMC methods involve a large class of sampling algorithmsthat have had a greatest influence on science development.

4/40


OutlineIntroduction


MCMCConclusion

To expose some relevant theory and techniques ofapplication related to MCMC methods

To present a generalization of Metropolis sampling method.

Study objectif

♣

5/40


OutlineIntroduction


MCMCConclusion

Next Steps

Monte Carlo Principle

Markov Chain

6/40


OutlineIntroduction


MCMCConclusion

Next Steps


Markov Chain

6/40


OutlineIntroduction


MCMCConclusion

Next Steps


Markov Chain

To introduce:

6/40


OutlineIntroduction


MCMCConclusion

Next Steps


Markov Chain

To introduce:

-MCMC Methods

6/40


OutlineIntroduction


MCMCConclusion

Next Steps


Markov Chain

To introduce:

-MCMC Methods

-MCMC Algorithms

6/40


OutlineIntroduction


MCMCConclusion

Monte Carlo Methods

7/40


OutlineIntroduction


MCMCConclusion

Overview

The idea of Monte Carlo simulation is to draw an i.i.d. set ofsamples{x i}Ni=1 from a target density π.

These N samples can be used to approximate the targetdensity with the following empirical point-mass function:

πN(x) = 1N

∑Ni=1 δx(i)(x)

For independent samples, by Law of Large numbers, one canapproximate the integrals I (f ) with tractable sums IN(f ) thatconverge as follows:

IN(f ) = 1N

∑Ni=1 f (x i )→ I (f ) =

∫f (x)π(x)dx a.s

see example

8/40


OutlineIntroduction


MCMCConclusion

N sample from π

x1

x2

x3

x4x5

xN

x9x6

x7

x8

But independent sampling from π may be difficult especially in ahigh dimensional space.

9/40


OutlineIntroduction


MCMCConclusion

It turns out that 1N

∑Ni=1 f (x i )→

∫f (x)π(x)dx (N →∞)

still applies if we generate samples using a Markovchain(dependent samples).

The idea of MCMC is to use Markov chain convergenceproperties to overcome the dimensionality problems met byregular Monte carlo methods.

But first, some revision of Markov chains in a discrete set χ.

10/40


OutlineIntroduction


MCMCConclusion

Markov Chain Theory

11/40


OutlineIntroduction


MCMCConclusion

Definition

Finite Markov Chain

A Markov chain is a mathematical system that undergoestransitions from one state to another, between a finite or countablenumber of possible states. It is a random process usuallycharacterized as memoryless:

P(X (t+1)/X (0),X (1), . . . ,X (t)) = P(X (t+1)/X (t))

12/40


OutlineIntroduction


MCMCConclusion

Transition Matrix

Let P = {Pij} the transition Matrix of a markov chain with states0, 1, 2 . . . ,S then, if X (t) denotes the state occupied by theprocess at time t, we have:

Pr(X (t+1) = j/X (t) = i) = Pij

X (t+1) = X (t).P13/40


OutlineIntroduction


MCMCConclusion

Properties

Stationarity

As t →∞,the Markov chain converges to itsstationary(invariant) distribution:π = π.P

Irreducibility

Irreducible means any set of states can bereached from any other state in a finite numberof moves (p(i , j) > 0 for every i and j).

Stationarity/Irreducibility

♣

14/40


OutlineIntroduction


MCMCConclusion

Properties

Stationarity


Irreducibility



♣

14/40


OutlineIntroduction


MCMCConclusion

Properties

Stationarity


Irreducibility



♣

14/40


OutlineIntroduction


MCMCConclusion

Properties

Stationarity


Irreducibility



♣

14/40


OutlineIntroduction


MCMCConclusion

MCMC

The idea of Markov Monte Carlo Method is to choose P thetransition Matrix so that π(the target density which is verydifficult to sample from) is its unique stationary distribution.

Assume the Markov Chain:has a stationary distribution π(X )is irreducible and aperiodic

Then we have an Ergodic Theorem:

Theorem(Ergodic Theorem)

if the Markov chain xt is irriducible, aperiodic and stationary thenfor any function h with E |h| ≺ ∞

1N

∑i h(xi )→

∫h(x)dπ(x) when N →∞

15/40


OutlineIntroduction


MCMCConclusion

Recall that our goal is to build a markov chain (X t)using a transition matrix P so that the limiting distri-bution of (X t) is the target density π and integrals canbe approximated using the ergodic theorem.

Summary

16/40


OutlineIntroduction


MCMCConclusion

Question

How do we construct a Markov chain whose stationarydistribution is the target distribution,π

Metropolis et al (1953) showed how.

The method was generalized by Hastings (1970).

17/40


OutlineIntroduction


MCMCConclusion

Construction of the transition matrix

in order to construct a markov chain with π as its stationarydistribution, we have to consider a transition matrix P thatsatisfy the reversibility condition that for all i and j

πip(i → j) = πjp(j → i)

πipij = πjpji

This property ensures that∑πipij = πj(definition of a

stationary distribution) and hence that π is a stationarydistribution of P

18/40


OutlineIntroduction


MCMCConclusion

Construction of the transition matrix

How to choose thetransition MatrixP so that the

reversibility con-dition is verified?

πiPij = πjPji

19/40


OutlineIntroduction


MCMCConclusion

Overview

Suppose that we have a proposal matrix denoted Q where∑j qij = 1 .

If it happens that Q itself satisfies the reversibilitycondition:πiqij = πjqji for all i and j then our research isover,but most likely it will not.

We might find for example that for some i and j:πiqij > πjqji

A convenient way to correct this condition is to reduce thenumber of moves from i to j by introducing a probability αij

that the move is made.

20/40


OutlineIntroduction


MCMCConclusion

The choice of the transition matrix

we assume that the transition matrix P has this form:

Pij = qijαij if i 6= jPii = 1−

∑j 6=i Pij if i = j

where:X Q = qij is the proposal matrix or jumping matrix of anarbitrary Markov chain on the states 0, 1..S , which suggests anew sample value j given a sample value i .X αij is the acceptance probability to move from state i tostate j.

21/40


OutlineIntroduction


MCMCConclusion⊙

In order to obtain the reversibility condition, we have to verify :

πipij = πjpjiπiαijqij = πjαjiqji (∗)⊙

The probabilities αij and αji are introduced to ensure that thetwo sides of (∗) are in balance.⊙

In his paper, Hastings defined a generic form of the acceptanceprobability:

αij =sij

1+πi qijπj qji⊙

Where:sij is a symetric function of i and j(sij = sji ) chosen sothat 0 6 αij 6 1 for all i and j⊙

With this form of Pij and αij suggested by Hastings, it’s readilyverified the reversibility condition.

22/40


OutlineIntroduction


MCMCConclusion

2-The acceptance probability α

Recall that in this paper, Hastings defined the acceptanceprobability αij as follows:

αij =sij

1+πi qijπj qji

For a specific choice of sij , we recognize the acceptanceprobabilities suggested by both:⊕Metropolis et al(1953)⊕Barker(1965)

The choice of α

23/40


OutlineIntroduction


MCMCConclusion

The acceptance probability α

Two choices for Sij are given for all i and j by

s(M)ij =

{1 +

πiqijπjqji

ifπjqjiπiqij> 1

1 +πjqjiπiqij

ifπjqjiπiqij6 1

when qij = qji and Sij = S(M)ij we have the method devised

by Metropolis et al with α(M)ij = min(1,

πjπi

)

whenqij = qji and Sij = S(B)ij = 1 we have the method

devised by Barker with α(B)ij = (

πjπi+πj

)

The choice of Sij

24/40


OutlineIntroduction


MCMCConclusion

In this paper, Hastings mentionned that little is known about

the merits of these two choices of S(M)ij and S

(B)ij

Remark

r

25/40


OutlineIntroduction


MCMCConclusion

The Proposal Matrix Q

It has been recognised that the choice of the proposalmatrix/density is crucial to the success(rapid convergence)of MCMC algorithm.

The proposal matrix can be almost arbitrary which allows toreach all states frequently and assure a high acceptance rate

The choice of Q

26/40


OutlineIntroduction


MCMCConclusion

1 First, pick a proposal matrix Q(i , j) of an arbitrary Markovchain on the states 0, 1..S , which suggests a new samplevalue j given a sample value i .

2 Also, start with some arbitrary point i0 as the first sample.

3 Then, to return a new sample j given the most recentsample i , we proceed as follows:

4 Generate a proposed new sample value j from the jumpingdistribution Q(i → j).

5 Accept proposal with probability α(i → j)

-if proposal accepted then move to j/step4-repeat until a sample from the desired size isobtained

Algorithm

X

27/40


OutlineIntroduction


MCMCConclusion







Algorithm

X

27/40


OutlineIntroduction


MCMCConclusion







Algorithm

X

27/40


OutlineIntroduction


MCMCConclusion







Algorithm

X

27/40


OutlineIntroduction


MCMCConclusion







Algorithm

X

27/40


OutlineIntroduction


MCMCConclusion






-if proposal accepted then move to j/step4

-repeat until a sample from the desired size isobtained

Algorithm

X

27/40


OutlineIntroduction


MCMCConclusion







Algorithm

X

27/40


OutlineIntroduction


MCMCConclusion

An empirical way for checking convergence is to let two ormore different chains run in parallel and see if they areconcentrating on the some place.

The calculation of α does not require knowledge of thenormalizing constant of π because it appears both in thenumerator and denominator.

Although the Markov chain eventually converges to thedesired distribution, the initial samples may follow a verydifferent distribution, especially if the starting point is in aregion of low density.As a result a burn in period is typically necessary.

Remarks

r

28/40


OutlineIntroduction


MCMCConclusion

Example:Poisson Distribution as the Target Distribution

� Consider π as the Poisson distribution with intensity λ > 0

πi = e−λ λi

i! where i = 0, 1, 2, · · ·

� Hastings(1970)suggests the following proposal transition matrix

qij =

q00 = q01 = 1

2 if i = 012 if j = i − 112 if j = i + 10 otherwise

Q =

12

12 0 0 · · ·

12 0 1

2 0 · · ·0 1

2 0 12 · · ·

0 0 12 0 · · ·

......

...... · · ·

� Q is in fact symmetric, and the algorithm reduces to that ofMetropolis

skip

29/40


OutlineIntroduction


MCMCConclusion

pij = qijα(M)ij =

12min(1, i

λ) if j = i − 112min(1, λ

i+1 ) if j = i + 1

1− pi ,i−1 − pi ,i+1 j = i0 otherwise

For i = 0

p0j =

12min(1, λ) if j = 11− 1

2min(1, λ) if j = 00 otherwise

� this transition probability is aperiodic and irreducible� In practice, if λ is small, this choice of Q seems to work fairlywell and fast to approximate π

30/40


OutlineIntroduction


MCMCConclusion

Given a starting point i we take:

j=i+1 with probability 12

or j=i-1 with probability 12

qij = 12δi−1(j) + 1

2δi+1(j)

We calculate Metropolis and Hastings ratio:

αij = min{1, π(j)π(i)} = min{1, λ(j−i) × i!

j!}let u ∼ U[0, 1]

if u ≤ αij then Xk+1 = j

else Xk+1 = Xk = i

Algorithm

♣

31/40


OutlineIntroduction


MCMCConclusion




qij = 12δi−1(j) + 1

2δi+1(j)



j!}let u ∼ U[0, 1]


else Xk+1 = Xk = i

Algorithm

♣

31/40


OutlineIntroduction


MCMCConclusion




qij = 12δi−1(j) + 1

2δi+1(j)



j!}let u ∼ U[0, 1]


else Xk+1 = Xk = i

Algorithm

♣

31/40


OutlineIntroduction


MCMCConclusion




qij = 12δi−1(j) + 1

2δi+1(j)



j!}let u ∼ U[0, 1]


else Xk+1 = Xk = i

Algorithm

♣

31/40


OutlineIntroduction


MCMCConclusion




qij = 12δi−1(j) + 1

2δi+1(j)



j!}let u ∼ U[0, 1]


else Xk+1 = Xk = i

Algorithm

♣

31/40


OutlineIntroduction


MCMCConclusion




qij = 12δi−1(j) + 1

2δi+1(j)



j!}

let u ∼ U[0, 1]


else Xk+1 = Xk = i

Algorithm

♣

31/40


OutlineIntroduction


MCMCConclusion




qij = 12δi−1(j) + 1

2δi+1(j)



j!}let u ∼ U[0, 1]


else Xk+1 = Xk = i

Algorithm

♣

31/40


OutlineIntroduction


MCMCConclusion




qij = 12δi−1(j) + 1

2δi+1(j)



j!}let u ∼ U[0, 1]


else Xk+1 = Xk = i

Algorithm

♣

31/40


OutlineIntroduction


MCMCConclusion




qij = 12δi−1(j) + 1

2δi+1(j)



j!}let u ∼ U[0, 1]


else Xk+1 = Xk = i

Algorithm

♣

31/40


OutlineIntroduction


MCMCConclusion

R implementation

> l i b r a r y (mcsm)> f a c t=f u n c t i o n ( n ){gamma( n+1)}> p o i s s o n f=f u n c t i o n ( n , lambda , x0 ){

x=x0xn=x0f o r ( i i n 1 : n ){i f ( xn!= 0)y=xn +(2∗ rbinom ( 1 , 1 , 0 . 5 ) −1 )e l s e {y=rbinom ( 1 , 1 , 0 . 5 )}a l p h a=min ( 1 , lambda ˆ( y−xn )∗ f a c t ( xn )/ f a c t ( y ) )i f ( r u n i f ( 1 ) < a l p h a ){ xn=y}x=c ( x , xn )}x}

32/40


OutlineIntroduction


MCMCConclusion

33/40


OutlineIntroduction


MCMCConclusion

Multivariate Target

if the distribution π is d-dimensional and the simulatedprocess X (t) = {X1(t), · · ·Xd(t)}, we may use the followingtechniques to construct the transition matrix P

1 In the transition from t to t + 1 all co-ordinates of X (t) maybe changed

2 In the transition from t to t + 1 only one co-ordinates of X (t)may be changed, that selection may be made at randomamong the d co-ordinates

3 In the transition from time t to t + 1 only one co-ordinate maychange in each transition, and the co-ordinate being selectedin a fixed rather than a random sequence.

34/40


OutlineIntroduction


MCMCConclusion

Hastings transformed the d dimensional problem to onedimensional problem

The approach is based on updating one component at eachtime

The transition matrix is defined as follow:P = P1.P2 · · ·Pd

For each (k = 1 · · · d),Pk is constructed so that πPk = π

π will be a stationary distribution of P sinceπP = πP1 · · ·Pd = πP2 · · ·Pd

Hastings’justification

♣

Orthogonal Matrices

35/40


OutlineIntroduction


MCMCConclusion

Conclusion

+In this paper, Hastings gives a generalization of Metropoliset al (1953) approach.

+He also introduiced gibbs sampling strategy when hepresented the multivariate target.

+Hastings treated the continuous case using a discretizationanalogy.

-little information about the merits of Metropolis and Barkeracceptance forms.

36/40


OutlineIntroduction


MCMCConclusion

Thank You For Your Attention

37/40


OutlineIntroduction


MCMCConclusion

Bibliography

[1]:W.K.Hastings(1970).Monte Carlo Sampling Methods UsingMarkov chain and their Applications[2]:Christian P Roberts (2010).Introduicing Monte Carlo Methodswith R[3]:Kenneth Lange(2010).Numerical Analysis for statisticians[4]:Siddhartha Chib(1995).Understanding the metropolis Hastingsalgorithm[5]:Robert Gray(2001).Advanced statistical computing

38/40


OutlineIntroduction


MCMCConclusion

Random orthogonal Matrices

Hastings suggests an interesting chain on the space n × northogonal matrices(H ′H = I , det(H) = 1)

The proposal stage of Hasting’s algorithm consists of choosingat random 2 indices i and j and an angle θ ∈ [0, 2π]

The proposed replacement for the current rotation matrix H isthen H ′ = Eij(θ).H

Eij(θ) coincides with the identity matrix expect for someentries

since Eij(θ)−1 = Eij(−θ) the transition density is symmetricand the markov chain induced is reversible

back

39/40


OutlineIntroduction


MCMCConclusion

Estimating Pi using Monte Carlo methods (SAS output)

Problem :Estimate PI using Monte CarloIntegration

Strategy:Equation of a circle with radius= 1 :

x2 + y2 = 1 which can be written y =√

1 − x2

Area of this circle =pi

Area of this circle in the first quadrant = pi�4

Generate Ux Uniform(0, 1) and Uy Uniform(0, 1)

Check to see if Uy ≤√

1 − U2x

The proportion of generated points when thisCondition is true is an estimate of pi�4.

Based on 10,000 simulated points using SAS:PI (SE) = 3.1056(0.016)

back

40/40