Propp-Wilson Algorithm (and sampling the Ising model)amnon/Classes/2010-Seminar-Random-Walk/... ·...

Preview:

Citation preview

Propp-Wilson Algorithm (and sampling the Ising model)

Danny Leshem, Nov 2009

References:

Haggstrom, O. (2002) Finite Markov Chains and Algorithmic Applications, ch. 10 -11

Propp, J. & Wilson, D. (1996) Exact sampling with coupled Markov chains and applications to statistical

mechanics, Random Structures and Algorithms 9 , 232-252.

2 / 26

Problems with MCMC (e.g. Gibbs sampler)

Propp-Wilson algorithm

Formal description

Simple example

Proof of correctness

Counter-examples for algorithm variations

Sandwiching technique

Simple example

Ising model

Description

Exact sampling using Propp-Wilson with Sandwiching

3 / 26

We want to sample from distribution π

Construct a (finite, irreducible, aperiodic) Markov chain with

(unique) stationary distribution π

MCMC (e.g. Gibbs sampler)

Starts with some initial state s

Returns a state s distributed approximately π

Must be run for sufficient time

Propp-Wilson

s is distributed exactly π (“perfect simulation”)

The algorithm stops automatically

4 / 26

We want to simulate (on a computer) the running of a Markov

chain with 𝑆 = 𝑠1,… , 𝑠𝑘 and transition matrix 𝑃

Assume we can generate 𝑈~𝑈 0,1

Not feasible on a practical computer, but let’s ignore it

We say 𝜙: 𝑆 × 0,1 → 𝑆 is a valid update function for 𝑃 if

∀𝑠𝑖 , 𝑠𝑗 ∈ 𝑆.ℙ 𝜙 𝑠𝑖 ,𝑈 = 𝑠𝑗 = 𝑃𝑖,𝑗

Of course, update functions are not unique

Given an initial state 𝑠 and a sequence 𝑈0,𝑈1,… we can simulate

a run of the chain

5 / 26

We want to sample from distribution π on 𝑆 = 𝑠1,… , 𝑠𝑘

Construct a reversible, irreducible & aperiodic Markov chain with

state space 𝑆 and stationary distribution π

Let 𝑃 be the transition matrix of the chain

Let 𝜙: 𝑆 × 0,1 → 𝑆 be some update function for 𝑃

Let (𝑁1,𝑁2,… ) be an increasing sequence of positive integers

The negative numbers (−𝑁1,−𝑁2,… ) will be used as

starting times for the Markov chain

Usually (𝑁1,𝑁2,… ) = (1,2,4,8,… )

Let 𝑈0,𝑈−1,… be a sequence of i.i.d. random variables ~𝑈 0,1

6 / 26

1. Set 𝑚 = 1

2. For each 𝑠 ∈ 𝑆, simulate the Markov chain starting at time

−𝑁𝑚 in state 𝑠 and running up to time 0 using the update

function 𝜙 with the sequence 𝑈−𝑁𝑚+1,𝑈−𝑁𝑚 +2,… ,𝑈−1,𝑈0

3. If all 𝑘 chains end up in the same state s at time 0, output s and

stop

4. Set 𝑚: = 𝑚 + 1 and go to step 2

Note: The same sequence 𝑈0,𝑈−1,… is used for all 𝑘 chains.

7 / 26

[HAG, p. 78]:

8 / 26

Is the algorithm guaranteed to stop?

Depends on the Markov chain and choice of update function

Example [HAG, problem 10.2]:

Consider the Markov chain with 𝑃 = 0.5 0.50.5 0.5

𝜙𝑔𝑜𝑜𝑑 𝑠𝑖 ,𝑈 = 𝑠1 𝑈 ∈ [0,0.5)

𝑠2 𝑈 ∈ 0.5,1

𝜙𝑏𝑎𝑑 𝑠1,𝑈 = 𝑠1 𝑈 ∈ 0,0.5

𝑠2 𝑈 ∈ 0.5,1

𝜙𝑏𝑎𝑑 𝑠2,𝑈 = 𝑠2 𝑈 ∈ 0,0.5

𝑠1 𝑈 ∈ 0.5,1

𝜙𝑖𝑛𝑑𝑒𝑝𝑒𝑛𝑑𝑒𝑛𝑡 𝑠1,𝑈 = 𝑠1 𝑈 ∈ 0,0.25 ∪ 0.5,0.75

𝑠2 𝑈 ∈ 0.25,0.5 ∪ 0.75,1

𝜙𝑖𝑛𝑑𝑒𝑝𝑒𝑛𝑑𝑒𝑛𝑡 𝑠2,𝑈 = 𝑠2 𝑈 ∈ 0,0.5

𝑠1 𝑈 ∈ 0.5,1

9 / 26

“0-1 law” for termination of the Propp-Wilson algorithm

If every two initial states have a positive probability to meet

after 𝑅 steps, they will (almost surely) meet by the Borel-

Cantelli lemma (meeting time follows geometric distribution)

Special case: if each of the 𝑘 Markov chains is (ergodic and)

independent of the others, the algorithm will (almost surely) stop

But the right dependence can speed things up considerably

10 / 26

[HAG] Theorem 10.1: Let 𝑆,𝑃,𝜙,π, (𝑁1,𝑁2,… ) as before.

Suppose the Propp-Wilson algorithm terminates (almost surely),

and let 𝑌 be its output. Then for every 𝑖 ∈ 1,… , 𝑘 we have:

ℙ 𝑌 = 𝑠𝑖 = π𝑖

Proof: Let 𝑠𝑖 be any state and let 𝜀 > 0. Enough to show

ℙ 𝑌 = 𝑠𝑖 − π𝑖 ≤ 𝜀

The algorithm terminates (almost surely), so we can pick 𝑀 large

enough so that

ℙ 𝑡ℎ𝑒 𝑎𝑙𝑔𝑜𝑟𝑖𝑡ℎ𝑚 𝑟𝑎𝑛 𝑎𝑡 𝑚𝑜𝑠𝑡 𝑀 𝑠𝑡𝑒𝑝𝑠 ≥ 1 − 𝜀

11 / 26

We now use a coupling argument (“coupling from the past”):

imagine running two chains from time –𝑁𝑀 to time 0 (using

same update function and random variables)

(𝑋–𝑁𝑀 ,… ,𝑋0) with initial state set to, say, 𝑠1

(𝑋 –𝑁𝑀 ,… ,𝑋 0) with initial state chosen according to π

Since π is stationary, 𝑋 0 is also distributed according to π

The punch: for our large M, we have ℙ 𝑋0 = 𝑋 0 ≥ 1 − 𝜀

And by the coupling lemma: 𝑋0 is distributed like π (up to

variation distance 𝜀)

To complete the proof, we note that 𝑌 = 𝑋0 (for “𝑀 = ∞”)

We showed that for every ε > 0, 𝑌 is distributed like π (up to

variation distance 𝜀), meaning 𝑌 is distributed exactly π

12 / 26

Why bother with running the chains further and further into the

past? Let’s start all chains at time 0, run them forward until the

first time they “merge” and output that value!

[HAG, counter-example 10.1]:

Consider the Markov chain with 𝑃 = 0.5 0.51 0

It is reversible with stationary distribution

π = π1,π2 = 2

3,1

3

Consider running two chains starting in time 0, with initial

states s1, s2. Say they merge for the first time in time N. By

definition, at time 𝑁 − 1 they don’t have the same value, so

one must be in s2. Surely, then, that chain will be in s1 in

time N… and we sampled from the distribution 1,0 .

13 / 26

Must we really save and reuse the random variables

𝑈−𝑁𝑚 +1,𝑈−𝑁𝑚+2,… ,𝑈−1,𝑈0 when restarting the chain at

time −𝑁𝑚+1? Let’s generate a new sequence in each iteration!

[HAG, counter-example 10.2]:

Consider the Markov chain with 𝑃 = 0.5 0.51 0

Run Propp-Wilson with a new sequence of random variables

every iteration, and take (𝑁1,𝑁2,… ) = (1,2,4,8,… )

Let 𝑌 be the output of this algorithm, and let 𝑀 be the

iteration in which the algorithm stopped.

14 / 26

We have:

ℙ 𝑌 = 𝑠1 = ℙ 𝑌 = 𝑠1,𝑀 = 𝑚

𝑚=1

ℙ 𝑌 = 𝑠1,𝑀 = 1 + ℙ 𝑌 = 𝑠1,𝑀 = 2 =

ℙ 𝑀 = 1 ⋅ ℙ 𝑌 = 𝑠1 𝑀 = 1 + ℙ 𝑀 = 2 ⋅ ℙ 𝑌 = 𝑠1 𝑀 = 2

=1

2⋅ 1 +

3

8⋅

2

3=

3

4>

2

3

The calculation is justified [HAG, problem 10.3]:

ℙ 𝑀 = 2 = ℙ 𝑀 ≠ 1 ⋅ ℙ 𝑀 = 2 𝑀 ≠ 1 =1

2⋅

3

4=

3

8

15 / 26

Propp-Wilson dictates that we follow 𝑘 different Markov chains

(one for each state in 𝑆). For large 𝑘, this is unfeasible.

Do we really need to run all chains?

some Markov chains obey certain monotonicity properties

(with respect to some ordering of the state space)

allows us to run only a small subset of the 𝑘 chains (say, 2)

16 / 26

Consider the “ladder walk” Markov chain with

𝑃 =

0.5 0.5 0 0 00.5 0 0.5 0 00 0.5 0 0.5 00 0 0.5 0 0.50 0 0 0.5 0.5

)

“Try to take a step up or down the ladder with probability 0.5. If

you can’t, stay where you are”

The stationary distribution is uniform: π = 1

5,… ,

1

5

Define the following update function:

𝜙 s,𝑈 = max(1, s − 1) 𝑈 ∈ [0,0.5)

min(5, s + 1) 𝑈 ∈ (0.5,1]

It preserves ordering between the states.

𝑥 ∈ 0,1 , 𝑖 ≤ 𝑗 ∈ 1,… ,5 ⇒ 𝜙 i,𝑥 ≤ 𝜙 j,𝑥

17 / 26

All chains are bounded between the top and bottom chain.

When the two meet, all others join as well!

It’s enough to run just these two chains (2 ≪ 𝑘)

18 / 26

Let 𝐺 = (𝑉,𝐸) be a graph

The Ising model is a way to pick a random 𝜉 ∈ −1,1 𝑣

vertices are atoms in a ferromagnetic material

we assign a spin orientation to each vertex

Let 𝛽 ≥ 0 be the fixed inverse temperature of the model

Define the energy of a spin configuration as

𝐻 𝜉 = − 𝜉 𝑥 ⋅ 𝜉 𝑦

𝑥 ,𝑦 ∈𝐸

“low energy = high agreement”

Define the probability measure 𝜋𝐺 ,𝛽 as

𝜋𝐺 ,𝛽 𝜉 =1

𝑍𝐺,𝛽𝑒−𝛽⋅𝐻(𝜉)

𝑍𝐺 ,𝛽 is a just normalizing constant

19 / 26

The case 𝛽 = 0 (“infinite temperature”)

Each configuration is equally probable, meaning each vertex

independently takes a value −1,1 w.p. 1

2 each

The case 𝛽 → ∞ (“zero temperature”)

The probability mass is equally divided between the “all plus”

and “all minus” configurations

Phase transition phenomena

The model depends quantitatively on whether 𝛽 is above or

below a certain threshold value 𝛽𝑐

e.g. for 𝑚 × 𝑚 square lattice, 𝛽𝑐 =1

2log(1 + 2) ≈ 0.441.

For 𝛽 < 𝛽𝑐 , the proportion of +1 spins tends to 1

2 as 𝑚 → ∞

For 𝛽 > 𝛽𝑐 , one of the spins “takes over” as 𝑚 → ∞

20 / 26

𝛽 = 0.15 𝛽 = 0

𝛽 = 0.5 𝛽 = 0.3

21 / 26

Given 𝛽,𝐺 = (𝑉,𝐸) define a Markov chain

𝑆 = −1,1 𝑣

Given 𝑋𝑛 we obtain 𝑋𝑛+1 by picking a vertex 𝑥 ∈ 𝑉 at

random and then picking 𝑋𝑛+1(𝑥) according to the

conditional distribution (under 𝜋𝐺 ,𝛽 ) given 𝑋𝑛(𝑉 ∖ {𝑥}).

Other vertices remain unchanged.

Notation

For 𝑥 ∈ 𝑉, 𝜉 ∈ −1,1 𝑣∖{𝑥} define

𝜉+ ∈ −1,1 𝑣 the expanded state s.t. 𝜉+ 𝑥 = +1

𝜉− ∈ −1,1 𝑣 the expanded state s.t. 𝜉− 𝑥 = −1

𝑘+(𝑥, 𝜉) the number of +1 neighbors of 𝑥 (in 𝜉)

𝑘−(𝑥, 𝜉) the number of -1 neighbors of 𝑥 (in 𝜉)

22 / 26

Lemma [HAG, problem 11.3]:

𝜋𝐺,𝛽 (𝜉+)

𝜋𝐺,𝛽 (𝜉−)= 𝑒2𝛽(𝑘+ 𝑥 ,𝜉 −𝑘− 𝑥 ,𝜉 )

Proof:

By definition:

𝜋𝐺 ,𝛽 𝜉 =1

𝑍𝐺 ,𝛽𝑒−𝛽⋅𝐻(𝜉) =

1

𝑍𝐺 ,𝛽𝑒𝛽⋅ 𝜉 𝑦 ⋅𝜉 𝑧 𝑦 ,𝑧 ∈𝐸

So, ignoring identical elements (edges not connected to 𝑥):

𝜋𝐺 ,𝛽(𝜉+)

𝜋𝐺 ,𝛽 𝜉−

=𝑒𝛽(𝑘+ 𝑥 ,𝜉 −𝑘− 𝑥 ,𝜉 )

𝑒𝛽(𝑘− 𝑥 ,𝜉 −𝑘+ 𝑥 ,𝜉 )= 𝑒2𝛽(𝑘+ 𝑥 ,𝜉 −𝑘− 𝑥 ,𝜉 )

23 / 26

Corollary:

𝜋𝐺 ,𝛽 𝑋 𝑥 = +1 X 𝑉 ∖ {𝑥} = 𝜉 =𝑒2𝛽(𝑘+ 𝑥 ,𝜉 −𝑘− 𝑥 ,𝜉 )

𝑒2𝛽(𝑘− 𝑥 ,𝜉 −𝑘+ 𝑥 ,𝜉 ) + 1

Proof:

𝜋𝐺 ,𝛽 𝑋 𝑥 = +1 X 𝑉 ∖ {𝑥} = 𝜉 =

𝜋𝐺 ,𝛽 𝑋 𝑥 = +1 𝑎𝑛𝑑 X 𝑉 ∖ 𝑥 = 𝜉

𝜋𝐺 ,𝛽 X 𝑉 ∖ 𝑥 = 𝜉 =

𝜋𝐺 ,𝛽 𝜉+

𝜋𝐺 ,𝛽 𝜉+ + 𝜋𝐺 ,𝛽 𝜉

− =

𝜋𝐺 ,𝛽 𝜉+

𝜋𝐺 ,𝛽 𝜉−

𝜋𝐺,𝛽 𝜉+

𝜋𝐺,𝛽 𝜉−

+𝜋𝐺 ,𝛽 𝜉

𝜋𝐺 ,𝛽 𝜉−

And then apply the lemma.

24 / 26

Define a (partial) ordering on the state space

For two configurations 𝜉, 𝜂 ∈ −1,1 𝑣 we write 𝜂 ≼ 𝜉if

∀𝑥 ∈ 𝑉. 𝜂 𝑥 ≤ 𝜉(𝑥)

We have unique 𝜉𝑚𝑎𝑥 (“all +1”) and 𝜉𝑚𝑖𝑛 (“all -1”)

By the corollary, our previous update function can be based on

𝑋𝑛+1 𝑥, 𝜉 = +1 𝑈𝑛+1 <𝑒2𝛽(𝑘+ 𝑥 ,𝜉 −𝑘− 𝑥 ,𝜉 )

𝑒2𝛽(𝑘− 𝑥 ,𝜉 −𝑘+ 𝑥 ,𝜉 ) + 1−1 otherwise

25 / 26

Claim: Given any vertex 𝑥 and 𝜂 ≼ 𝜉, define

𝜉2(y) =

𝜉 𝑦 𝑦 ≠ 𝑥

+1 𝑦 = 𝑥,𝑈 <𝑒2𝛽(𝑘+ 𝑥 ,𝜉 −𝑘− 𝑥 ,𝜉 )

𝑒2𝛽(𝑘− 𝑥 ,𝜉 −𝑘+ 𝑥 ,𝜉 ) + 1−1 otherwise

And similarly define 𝜂2.

Then 𝜂2 ≼ 𝜉2.

Proof: For 𝑦 ≠ 𝑥 this is a direct result of 𝜂 ≼ 𝜉.

For 𝑥, we note that 𝑘+ 𝑥, 𝜉 ≥ 𝑘+ 𝑥, 𝜂 , 𝑘− 𝑥, 𝜉 ≤ 𝑘− 𝑥, 𝜂 .

So 2𝛽 𝑘+ 𝑥, 𝜉 − 𝑘− 𝑥, 𝜉 ≥ 2𝛽 𝑘+ 𝑥, 𝜂 − 𝑘− 𝑥, 𝜂 .

Finally, 𝑒2𝛽 (𝑘+ 𝑥 ,𝜉 −𝑘− 𝑥 ,𝜉 )

𝑒2𝛽 (𝑘− 𝑥 ,𝜉 −𝑘+ 𝑥 ,𝜉 )+1≥

𝑒2𝛽 (𝑘+ 𝑥 ,𝜂 −𝑘− 𝑥 ,𝜂 )

𝑒2𝛽 (𝑘− 𝑥 ,𝜂 −𝑘+ 𝑥 ,𝜂 )+1 since

z

z+1 is

strictly increasing for 𝑧 ≥ 0. This proves the claim.

26 / 26

We showed it’s enough to run two chains only in order to sample

a configuration from the Ising model.

But even so, do these chains “join” within a reasonable time?

Depends on the graph and on the temperature…

For the 𝑚 × 𝑚 square lattice

if 𝛽 < 𝛽𝑐 then this (expected) time grows like a low-

degree polynomial in 𝑚.

if 𝛽 > 𝛽𝑐 it grows exponentially in 𝑚.

Propp & Wilson showed one can still use their algorithm

(with another twist) to obtain samples in both cases.