72
Lecture 3 Advanced Sampling Techniques Dahua Lin The Chinese University of Hong Kong 1

MLPI Lecture 3: Advanced Sampling Techniques

Embed Size (px)

Citation preview

Lecture'3

Advanced(Sampling(TechniquesDahua%Lin

The$Chinese$University$of$Hong$Kong

1

Overview• Collapsed*Gibbs*Sampling

• Sampling*with*Auxiliary*Variables

• Slice*Sampling

• Simulated*Tempering*&*Parallel*Tempering

• Swendsen?Wang*Algorithm

• Hamiltonian*Monte*Carlo

2

Collapsed)Gibbs)Sampling

3

Mo#va#ng(Example

with

We#want#to#sample#from# .

4

Gibbs%Sampling

Draw% %where% :

with% .

5

Gibbs%Sampling%(cont'd)

Draw% :

• How%well%can%this%sampler%perform%when%?

6

Collapsed)Gibbs)Sampling

• Basic&idea:"replace"the"original"condi0onal"distribu0on"with"a"condi0onal"distribu0on"of"a"marginal(distribu.on,"o7en"called"a"reduced(condi.onal(distribu.on.

• Consider"the"example"above,"we"consider"a"marginal(distribu.on:

7

Collapsed)Gibbs)Sampling)(cont'd)

• Draw& ,&with& &marginalized&out,&as:

• Draw&

• Can&we&exchange&the&order&of&these&two&steps?&Why?

8

Basic&Guidelines

• Order%of%steps%ma-ers!

• Generally,*one*can*move*components*from*"being'sampled"*to*"being'condi0oned'on".

• replacing*outputs*with*intermediates*would*change*the*sta:onary*distribu:on.

• A*variable*can*be*updated*mul:ple*:mes*in*an*itera:on.

9

Why$do$collapsed$samplers$o/en$perform$be3er$than$full6fledged$Gibbs$samplers?

10

Rao$Blackwell+Theorem

Consider)an)example) )and)we)want)to)es1mate) .)Suppose)we)have)two)tractable)ways)to)do)so:

(1)$draw$ ,$and$compute

11

Rao$Blackwell+Theorem+(cont'd)

(2)$draw$ $where$ $is$the$marginal$distribu4on,$and$compute

• Both&are&correct.&By&Strong&LLN,&both& &and& &converge&to& &almost&surely.

• Which&one&is&be<er?&Can&you&jus@fy&your&answer?12

Rao$Blackwell+Theorem+(cont'd)• (Rao%Blackwell,Theorem)"Sample"variance"will"be"reduced"when"some"components"are"marginalized"out."With"the"se:ng"above,"we"have

• Generally,*reducing)sample)variance*would*also*lead*to*the*reduc3on*of*autocorrela2on*of*the*chain,*thus*improving*the*mixing*performance.

13

Sampling)with)Auxiliary)Variables• The%Rao$Blackwell$Theorem%suggests%that%in%order%to%achieve%be3er%performance,%one%should%try%to%marginalize%out%as%many%components%as%possible.

• However,%in%many%cases,%one%may%want%to%do%the%opposite,%that%is,%to%introduce%addi>onal%variables%to%facilitate%the%simula>ons.

• For%example,%when%the%target$distribu6on%is%mul6modal,%one%may%use%an%auxiliary%variable%to%help%the%chain%escape%from%local%traps.

14

Use$Auxiliary$Variables

• Specify)an)auxiliary)variable) )and)the)joint)distribu8on) )such)that)

)for)certain) .

• Design)a)chain)to)update) )using)the)M=H)algorithm)or)the)Gibbs)sampler.

• The)samples)of) )can)then)be)obtained)through)marginaliza)on)or)condi)oning.

15

Slice&Sampling

16

Slice&Sampler

• Sampling* *is*equivalent*to*sampling*uniformly*from*the*area*under* :*

.

• Gibbs*sampling*based*on*the*uniform*distribu;on*over* .*Each*itera;on*consists*of*two*steps:

• Given* ,*

• Given* ,*

17

Slice&Sampler&(Illustra0on)

18

Slice&Sampler&(Discussion)• Slice&sampler"can"mix"very"rapidly,"as"it"will"not"be"locally"trapped.

• Slice&sampler"is"o7en"nontrivial"to"implement"in"prac8ce."Drawing" "is"some8mes"very"difficult.

• For"distribu8ons"of"certain"forms,"which"have"an"easy&way"to"draw" ,"slice&sampling"is"good"strategy.

19

Simulated*Tempering

20

Gibbs%Measure

A"Gibbs%measure"is"a"probability"measure"with"a"density"in"the"following"form:

Here,% %is%called%the%energy&func*on,% %is%called%the%inverse&temperature,%and%the%normalizing%constant% %depends%on% .

21

Gibbs%Measure%(cont'd)

In#literature#of#MCMC#sampling,#we#o5en#parameterize#a#Gibbs#measure#using#the#temperature(parameter# ,#thus#

.

22

Tempered'MCMC

Typical(MCMC(methods(usually(rely(on(local%moves(to(explore(the(state(space.(What(is(the(problem?

23

Tempered'MCMC'(cont'd)

Local&traps&o+en&leads&to&very&poor&mixing.&Can&we&improve&this?

24

Simulated*Tempering

Suppose'we'intend'to'sample'from'

Basic&idea:!Augment!the!target!distribu0on!by!including!a!temperature(index( ,!with!joint!distribu0on!given!by

25

Simulated*Tempering*(cont'd)• We$only$collect$samples$at$the$lowest'temperature,$

.

• The$chain$mixes$much$faster$at$high$temperatures,$but$we$want$to$collect$samples$at$the$lowest$temperature.$So$we$have$to$constantly$switch$between$temperatures.

26

Simulated*Tempering*(Algorithm)

One$itera)on$of$Simulated*Tempering$has$two$steps:

• (Base&transi+on):#update# #at#the#same#temperature,#i.e.#holding# #fixed.

• (Temperature&switching):#with# #fixed,#propose##with# #such#that#

• Accept#the#change#with#probability#.

• Any#drawbacks?27

Simulated*Tempering*(Discussion)

• Set% .%Given% ,%we%should%set% %such%that%uphill%moves%from%( )%should%have%a%considerable%probability%of%being%accepted.

• Build%the%temperature(ladder%step%by%step%un?l%we%have%a%sufficiently%smooth%distribu?on%at%the%top.

• The%?me%spent%on%the%base%level% %is%around%.%If%we%have%too%many%levels,%only%a%very%

small%por?on%of%samples%can%be%used.

28

Simulated*Tempering*(Discussion)• All$temperature$levels$play$an$important$role.$So$it$is$desirable$to$spend$comparable$amount$of$8me$at$

each$level.$Se:ng$ $for$each$ ,$we$have

• The%normalzing%constants% %are%typically%unknown%and%es8ma8ng%them%is%very%difficult%and%expensive.

29

Parallel&Tempering

(Basic'idea)!rather!than!jumping!between!temperatures,!it!simultaneously!simulate!mul3ple!chains,!each!at!a!temperature!level! ,!called!a!replica,!and!constantly!swap!samples!between!replicas.

30

Parallel&Tempering&(Algorithm)

Each%itera*on%consists%of%the%following%steps:

• (Parallel'update):"simulate"each"replica"with"its"own"transi2on"kernel

• (Replica'exchange):"propose"to"swap"states"between"two"replicas"(say"the" 7th"and" 7th,"where"

):

31

Parallel&Tempering&(Algorithm)• The%proposal%is%accepted%with%probability%

,%where

• We$collect$samples$from$the$base$replica$(the$one$with$ ).

• Why$does$this$algorithm$produce$the$desired$distribu;on?

32

Parallel&Tempering&(Jus1fica1on)

Let$ .$We$define

Obviously,+the+step+of+parallel&update+preserves+the+invariant+distribu5on+ .

33

Parallel&Tempering&(Jus1fica1on)

Note%that%the%step%of%replica(exchange%is%symmetric,%i.e.%the%probabili0es%of%going%up%and%down%are%equal,%then%according%to%the%Metropolis(algorithm,%we%have%

%with

34

Parallel&Tempering&(Discussion)• It$is$efficient$and$very$easy$to$implement,$especially$in$a$parallel$compu6ng$environment.

• It$is$o9en$an$art$instead$of$a$technique$to$tune$a$parallel$tempering$system.

• The$parallel-tempering$is$a$special$case$of$a$large$family$of$MCMC$methods$called$Extended-Ensemble-Monte-Carlo,$which$involves$a$collec6on$of$parallel$Markov$chains$and$the$simula6on$switches$between$these$them.

35

Swendsen'Wang+Algorithm

The$Swendsen'Wang$algorithm$(R.-Swendsen$and$J.-Wang,$1987)$is$an$efficient$Gibbs$sampling$algorithm$for$sampling$from$the$extended-Ising-model.

36

Standard'Ising'Model

The$standard$Ising&model$is$defined$as

where% %for%each% %is%called%a%spin,%and%.

• Gibbs&sampling&is&extremely&slow,&especially&when&the&temperature&is&low.

37

Extended'Ising'Model• We$extend$the$model$by$introducing$addi5onal$bond%variables$ ,$each$for$an$edge.$Each$bond$has$two$states:$ $indica5ng$connected$and$ $indica5ng$disconnected.

• We$define$a$joint$distribu5on$that$couples$the$spins$and$bonds,

38

Extended'Ising'Model'(cont'd)

Here,% %is%described%as%below:

• When& ,& &for&every&se.ng&of&

• When& ,&

39

Extended'Ising'Model'(cont'd)

With%this%se(ng,% %can%be%wri1en%as:

where% :

• when& ,& &must&be&

• when& ,& &is&set&to&zero&with&probability&.

40

Swendsen'Wang+Algorithm

!Each!itera*on!consists!of!two!steps:

• (Clustering):"condi(oned"on"the"spins" ,"draw"the"bonds" "independenly."For"an"edge"

:

• If" ,"set"

• If" ,"set" "with"probability""or" "otherwise.

41

Swendsen'Wang+Algorithm

• (Swapping):"condi(oned"on"the"bonds" ,"draw"the"spins" .

• For"each"connected"component,"draw" "or" "with"equal"chance,"and"assign"the"resultant"value"to"all"nodes"in"the"component.

42

Swendsen'Wang+Algorithm+(Illustra7on)In the case of a rectangular grid, this Gibbs sampling algorithm mixes very rapidly.

The following figures illustrate Gibbs sampling. Spin states up and down areshown by filled and empty circles. Bond states 1 and 0 are shown by thick lines andthin dotted lines. We start from a state with five connected components. (Rememberthat isolated spins count as connected components, albeit of size 1.)

First, let’s update the bonds The forbidden bonds are highlighted

Bonds are forbidden from forming wherever the two adjacent spins are in oppositestates. The bonds that are not forbidden are set to the 1 state with probability p.

After updating the bonds Now we update spins Update bonds again

1.2 Other properties of the extended model

We already mentioned that the partition function Z is the same as that of the Isingmodel.

The marginal P (x) is correct, because when we sum the factor gm over dm, we getfm. Summing over dm is easy because it appears in only one factor.

OK, we’ve summed out d and obtained the Ising model. What if we sum out x?The marginal P (d) is called the random cluster model. Summing over x for given

d, all factors are constants. The number of states is 2number of clusters. Thus

P (d) =1

!

m

"

pdm(1 − p)1−dm

#

2c(d) (10)

where c(d) is the number of connected components in the state d. Isolated spinswhose neighbouring bonds are all zero count as single connected components.

The random cluster model can be generalized by replacing the number 2 by aparameter q:

P (q)(d) =!

m

"

pdm(1 − p)1−dm

#

qc(d) (11)

The random cluster model can be simulated directly, just as the Ising model canbe simulated directly; but the S–W method, augmenting the bonds with spins, isprobably the most efficient way to simulate the model. For integer values of q, theappropriate spin system is the ‘Potts model’, the generalization of the Ising modelfrom 2 spin states to q.

In the case of a rectangular grid, this Gibbs sampling algorithm mixes very rapidly.The following figures illustrate Gibbs sampling. Spin states up and down are

shown by filled and empty circles. Bond states 1 and 0 are shown by thick lines andthin dotted lines. We start from a state with five connected components. (Rememberthat isolated spins count as connected components, albeit of size 1.)

First, let’s update the bonds The forbidden bonds are highlighted

Bonds are forbidden from forming wherever the two adjacent spins are in oppositestates. The bonds that are not forbidden are set to the 1 state with probability p.

After updating the bonds Now we update spins Update bonds again

1.2 Other properties of the extended model

We already mentioned that the partition function Z is the same as that of the Isingmodel.

The marginal P (x) is correct, because when we sum the factor gm over dm, we getfm. Summing over dm is easy because it appears in only one factor.

OK, we’ve summed out d and obtained the Ising model. What if we sum out x?The marginal P (d) is called the random cluster model. Summing over x for given

d, all factors are constants. The number of states is 2number of clusters. Thus

P (d) =1

!

m

"

pdm(1 − p)1−dm

#

2c(d) (10)

where c(d) is the number of connected components in the state d. Isolated spinswhose neighbouring bonds are all zero count as single connected components.

The random cluster model can be generalized by replacing the number 2 by aparameter q:

P (q)(d) =!

m

"

pdm(1 − p)1−dm

#

qc(d) (11)

The random cluster model can be simulated directly, just as the Ising model canbe simulated directly; but the S–W method, augmenting the bonds with spins, isprobably the most efficient way to simulate the model. For integer values of q, theappropriate spin system is the ‘Potts model’, the generalization of the Ising modelfrom 2 spin states to q.

43

Swendsen'Wang+Algorithm+(Discussion)

• When& &is&large,& &has&a&high&probability&of&being&set&to&one,&i.e.& &and& &are&likely&to&be&connected.

• Experiments&show&that&the&Swendsen)Wang&algorithm&mixes&very&rapidly,&especially&for&rectangular&grids.

• Can&you&provide&an&intui?ve&explana?on?

44

Swendsen'Wang+Algorithm+(Discussion)

• The%Swendsen'Wang%algorithm%can%be%generalized%to%Po4s%models%(nodes%can%take%values%from%a%finite%set).

• The%Swendsen'Wang%algorithm%has%been%widely%used%in%image%analysis%applicaAons,%e.g.%image%segmentaAon%(in%this%case,%it%is%called%Swendsen'Wang,cut).

45

Hamiltonian)Monte)Carlo• An$MCMC$method$based$on$Hamiltonian)Dynamics.$It$was$originally$devised$for$molecular)simula1on

• In$1987,$a$seminal$paper$by$Duane$et)al$unifies$MCMC$and$molecular$dynamics.$They$called$it$Hybrid)Monte)Carlo,$which$abbreviates$to$HMC

• In$many$arEcles,$people$call$it$Hamiltonian)Monte)Carlo,$as$this$name$is$considered$to$be$more$specific$and$informaEve,$and$it$retains$the$same$abbreviaEon$"HMC".

46

Mo#va#ng(Example:(Free(Fall

47

Mo#va#ng(Example:(Free(Fall

• The%change%of%momentum% %is%caused%by%the%accumula5on/release%of%the%poten(al+energy:

• The%change%of%loca-on% %is%caused%by%velocity,%the%deriva-ve%of%kinema-c.energy%w.r.t.%the%momentum:

48

Hamiltonian)Dynamics• Hamiltonian)Dynamics"is"a"generalized"theory"of"the"classical)mechanics,"which"provides"a"elegant"and"flexible"abstrac:on"of"a"dynamic"system"in"physics.

• In"Hamiltonian"Dynamics,"a"physical"system"is"described"by" ,"where" "and" "are"respec:vely"the"posi1on"and"momentum"of"the" @th"en:ty.

49

Hamilton's+Equa/ons

The$dynamics$of$the$system$is$characterized$by$the$Hamilton's+Equa/ons:

Here,% %is%called%the%Hamiltonian,%which%can%be%interpreted%as%the%total)energy%of%the%system.

50

Hamilton's+Equa/ons+(cont'd)

• The%Hamiltonian% %is%o)en%formulated%as%the%sum%of%the%poten+al,energy% %and%the%kine+c,energy% :

• With&this&se)ng,&the&Hamilton's+Equa/ons&become:

51

Conserva)on*of*Hamiltonian

The$Hamiltonian$is$conserved,$i.e.,$it$is$invariant$over$,me:

Intui&vely,,this,reflects,the,law$of$energy$conserva/on.

52

Hamiltonian)Reversibility• The%Hamiltonian)dynamics%is%reversible

• Let%the%ini+al%states%be% %and%the%states%at%+me% %be% .%Then,%it%we%reverse%the%process,%star+ng%at% ,%then%the%states%at%+me% %would%be% .

• In%the%context%of%MCMC,%this%leads%to%the%reversibility%of%the%underlying%chain.

53

Simula'on*of*Hamiltonian*Dynamics

A"natural"idea"to"simulate"Hamiltonian)dynamics"is"to"use"Euler's)method"over"discre1zed"1me"steps:

Is#this#a#good#method?

54

Leapfrog)Method

Be#er%results%can%be%obtained%with%leapfrog:

More%importantly,%the%leapfrog%update%is%reversible.55

Leapfrog)Method)(cont'd)

56

Example

Consider)a)Hamiltonian)system:

Write&down&the&Hamilton's+Equa/ons:

Derive&the&solu-on:

57

Example((Simula-on)

58

Hamiltonian)Monte)Carlo

(Basic'idea):!Consider!the!poten&al)energy!as!the!Gibbs)energy,!and!introduce!the!"momentums"!as!auxiliary)variables!to!control!the!dynamics.

59

Hamiltonian)Monte)Carlo)(cont'd)

Suppose'the'target&distribu,on'is'

,'then'we'form'an'augmented&

distribu,on'as

Here,%the%loca%ons% %represent%the%variables%of%interest,%and%the%momentums% %control%the%dynamics%of%simula7on.

60

Hamiltonian)Monte)Carlo)(cont'd)

In#prac(ce,#the#kine%c#energy#is#o2en#formalized#as

61

Hamiltonian)Monte)Carlo)(Algorithm)

Each%itera*on%of%HMC%comprises%two%steps:

• Gibbs%update:#sample#the#momentums# #from#the#Gaussian#prior#given#by

62

Hamiltonian)Monte)Carlo)(Algorithm)• Metropolis*update:#using#Hamiltonian#dynamics#to#propose#a#new#state.#Star8ng#from# ,#simulate#the#dynamic#system#with#the#leapfrog#method#for# #steps#with#step<size# ,#which#yields# .#The#proposed#state#is#accepted#with#probability:

63

HMC$(Discussion)• If$the$simula.on$is$exact,$we$will$have$

,$and$thus$the$proposed$state$should$always$be$accepted.$

• In$prac.ce,$there$can$be$some$devia.on$due$to$discre.za.on,$we$have$to$use$the$Metropolis$rule$to$guarantee$the$correctness.

64

HMC$(Discussion)

• HMC%has%a%high%acceptance%rate%while%allowing%large%moves%along%less6constrained%direc8ons%at%each%itera8on.

• This%is%a%key%advantage%as%compared%to%random'walk%proposals,%which,%in%order%to%maintain%a%reasonably%high%acceptance%rate,%has%to%keep%a%very%small%step%size,%resul8ng%in%substan8al%correla8on%between%consecu8ve%samples.

65

Tuning&HMC• For%efficient%simula1on,%it%is%important%to%choose%appropriate%values%for%both%the%leapfrog%step%size% %and%the%number%of%leapfrog%steps%per%itera1on% .

• Tuning%HMC%(and%actually%many%generic%sampling%methods)%oCen%requires%preliminary*runs%with%different%trial%seGngs%and%different%ini1al%values,%as%well%as%careful%analysis%of%the%energy%trajectories.

66

Tuning&HMC&(cont'd)

• For%most%cases,% %and% %can%be%tuned%independently.

• Too%small%a%stepsize%would%waste%computa8on%8me,%while%large%stepsize%would%cause%unstable%simula8on,%and%thus%low%acceptance%rate.

• One%should%choose% %such%that%the%energy%trajectory%is%stable%and%the%acceptance%rate%is%maintained%at%a%reasonably%high%level.

• One%should%choose% %such%that%back@and@forth%movement%of%the%states%can%be%observed.

67

Generic'Sampling'Systems

A"number"of"so,ware"systems"are"available"for"sampling"from"models"specified"by"the"user

• WinBUGS:*based*on*BUGS*(Bayesian*inference*Using*Gibbs*Sampling).

• provide*a*friendly*language*for*user*to*specify*the*model

• Running*only*on*Windows

• Note:*The*development*has*stopped*since*2007.68

Generic'Sampling'Systems'(cont'd)• JAGS:'"Just'Another'Gibbs'Sampler"

• Cross8pla9orm'support

• Use'a'dialect'of'BUGS

• Extensible:'allow'users'to'write'customized'funcCons,'distribuCons,'and'samplers

69

Generic'Sampling'Systems'(cont'd)• Stan:'"Sampling'Through'Adap5ve'Neighborhoods"

• Core'wri=en'in'C++,'and'ports'available'in'Python,'R,'Matlab,'and'Julia

• A'user'friendly'language'for'model'specifica5on

• Use'Hamiltonian'Monte'Carlo'(HMC)'and'No'ULTurn'Samplers'(NUTS)'as'core'algorithm

• Open'source'(GPLv3'licensed)'and'under'ac5ve'development'on'Github

70

Stan%Exampledata { int<lower=0> N; vector[N] x; vector[N] y;}parameters { real alpha; real beta; real<lower=0> sigma;}model { for (n in 1:N) y[n] ~ normal(alpha + beta * x[n], sigma);}

71

Generic'Sampling'System'vs.'Dedicated'Algorithms

Generic' Dedicated'

Easy%to%use% Require%knowledge%and%experience%

High%produc9vity% Time=consuming%to%develop%

Slow% O@en%remarkably%more%efficient%

Limited%flexibility% Necessary%for%many%new%models%

72