The Metropolis-Hastings algorithmpdh10/Teaching/564/Sup/mhalgorithm.pdf · 2016. 6. 2. · Peter Ho Song sparrow data Data: subpopulation of n = 52 female song sparrows. ... et cetera

The Metropolis-Hastings algorithm

Peter Hoff

Song sparrow data

Data: subpopulation of n = 52 female song sparrows.

• yi = number of offspring;

• xi = age in years.

1 2 3 4 5 6

01

23

45

67

age

offs

prin

g

Song sparrow data

Data: subpopulation of n = 52 female song sparrows.

• yi = number of offspring;

• xi = age in years.

1 2 3 4 5 6

01

23

45

67

age

offs

prin

g

Poisson model

Let Y be the offspring for a randomly sampled bird of age x .

P(Y = y |x) =?

Model: {Y |x} ∼ Poisson(θx)

Problem: Don’t have much data for each x .

Solution: Assume E[Y |x ] = θx is smoothly varying with x .

Poisson model


P(Y = y |x) =?




Poisson model


P(Y = y |x) =?




Poisson model


P(Y = y |x) =?




Poisson model


P(Y = y |x) =?




Poisson regression

Identity link:E[Y |x ] = β1 + β2x + β3x

2

Problem: Range of regression goes outside of range of Y .

Log link:

E[Y |x ] = eβ1+β2x+β3x2

logE[Y |x ] = β1 + β2x + β3x2 = βTx

• βTx is called the linear predictor ;

• The log function links the linear predictor to E[Y |x ];

• This is called a Poisson regression model with a log link.

Poisson regression


2


Log link:

E[Y |x ] = eβ1+β2x+β3x2





Poisson regression


2


Log link:

E[Y |x ] = eβ1+β2x+β3x2





Poisson regression


2


Log link:

E[Y |x ] = eβ1+β2x+β3x2





Poisson regression


2


Log link:

E[Y |x ] = eβ1+β2x+β3x2





Poisson regression


2


Log link:

E[Y |x ] = eβ1+β2x+β3x2





Poisson regression


2


Log link:

E[Y |x ] = eβ1+β2x+β3x2





Bayesian inference for Poisson regression

Simple conjugate priors for Poisson regression are not available.

Solution # 1: Grid based approximation.

Given a prior for β (say β ∼ N(0, 100× I);

1. Construct a grid of β-values

β1 ∈ {β(1)1 , . . . , β

(100)1 }

β2 ∈ {β(1)2 , . . . , β

(100)2 }

β3 ∈ {β(1)3 , . . . , β

(100)3 }

2. For each β = (β(i)1 , β

(j)2 , β

(k)3 ), compute pi,j,k, = p(y|X,β)× p(β)

3. Use pi,j,k/(∑

i,j,k pi,j,k) as an approximation to p(β|y,X).

Bayesian inference for Poisson regression

Simple conjugate priors for Poisson regression are not available.

Solution # 1: Grid based approximation.

Given a prior for β (say β ∼ N(0, 100× I);

1. Construct a grid of β-values

β1 ∈ {β(1)1 , . . . , β

(100)1 }

β2 ∈ {β(1)2 , . . . , β

(100)2 }

β3 ∈ {β(1)3 , . . . , β

(100)3 }

2. For each β = (β(i)1 , β

(j)2 , β

(k)3 ), compute pi,j,k, = p(y|X,β)× p(β)

3. Use pi,j,k/(∑

i,j,k pi,j,k) as an approximation to p(β|y,X).

Grid-based approximation

−1 0 1 2

0.0

0.2

0.4

0.6

0.8

1.0

1.2

β2

p(β 2

|y)

−0.6 −0.4 −0.2 0.0 0.2 0.4

01

23

45

67

β3

p(β 3

|y)

−0.5 0.0 0.5 1.0 1.5 2.0

−0.

4−

0.3

−0.

2−

0.1

0.0

β2

β 3

This required calculation of p(y|X,β)× p(β) at 1 million grid points.

Two more predictors would require 10 billion grid points.

Grid-based approximation

−1 0 1 2

0.0

0.2

0.4

0.6

0.8

1.0

1.2

β2

p(β 2

|y)

−0.6 −0.4 −0.2 0.0 0.2 0.4

01

23

45

67

β3

p(β 3

|y)

−0.5 0.0 0.5 1.0 1.5 2.0

−0.

4−

0.3

−0.

2−

0.1

0.0

β2

β 3

This required calculation of p(y|X,β)× p(β) at 1 million grid points.

Two more predictors would require 10 billion grid points.

Metropolis algorithm

Objective: Construct a large collection of θ-values, {θ(1), . . . , θ(S)}, whoseempirical distribution approximates p(θ|y).

Why?

mean(θ(1), . . . , θ(S)) ≈ E[θ|y ]

sd(θ(1), . . . , θ(S)) ≈ SD[θ|y ]

et cetera.



Why?

mean(θ(1), . . . , θ(S)) ≈ E[θ|y ]

sd(θ(1), . . . , θ(S)) ≈ SD[θ|y ]

et cetera.



Why?

mean(θ(1), . . . , θ(S)) ≈ E[θ|y ]

sd(θ(1), . . . , θ(S)) ≈ SD[θ|y ]

et cetera.


How? Roughly speaking, for any two different values θa and θb we need

#{θ(s)’s in the collection = θa}#{θ(s)’s in the collection = θb}

≈ p(θa|y)

p(θb|y).

Task: Given {θ(1), . . . , θ(s)}, appropriately choose a new θ(s+1) to add.

Idea: Let θ∗ be “near” θ(s). Should θ∗ be added to the sample?

• If p(θ∗|y) > p(θ|y) , then yes.

• Want more θ∗’s than θ(s)’s in sample.• Already have θ(s) in the sample.• So include θ∗ to balance things out.

• If p(θ∗|y) < p(θ|y) , then maybe.




≈ p(θa|y)

p(θb|y).









≈ p(θa|y)

p(θb|y).







r =p(θ∗|y)

p(θ(s)|y)=

p(y |θ∗)p(θ∗)

p(y)

p(y)

p(y |θ(s))p(θ(s))=

p(y |θ∗)p(θ∗)

p(y |θ(s))p(θ(s))

Having computed r , how should we proceed?

If r > 1:Intuition: Since θ(s) is already in our set, we should include θ∗ as it has ahigher probability than θ(s).Procedure: Accept θ∗ into our set, i.e. set θ(s+1) = θ∗.

If r < 1:Intuition: The relative frequency of θ-values in our set equal to θ∗

compared to those equal to θ(s) should be p(θ∗|y)/p(θ(s)|y) = r . This

means that for every instance of θ(s), we should have only a “fraction” ofan instance of a θ∗ value.Procedure: Set θ(s+1) equal to either θ∗ or θ(s), with probability r and1− r respectively.


r =p(θ∗|y)

p(θ(s)|y)=

p(y |θ∗)p(θ∗)

p(y)

p(y)

p(y |θ(s))p(θ(s))=

p(y |θ∗)p(θ∗)

p(y |θ(s))p(θ(s))







r =p(θ∗|y)

p(θ(s)|y)=

p(y |θ∗)p(θ∗)

p(y)

p(y)

p(y |θ(s))p(θ(s))=

p(y |θ∗)p(θ∗)

p(y |θ(s))p(θ(s))







r =p(θ∗|y)

p(θ(s)|y)=

p(y |θ∗)p(θ∗)

p(y)

p(y)

p(y |θ(s))p(θ(s))=

p(y |θ∗)p(θ∗)

p(y |θ(s))p(θ(s))







Given θ(s), generate a value θ(s+1) as follows:

1. Sample θ∗ ∼ J(θ|θ(s));

2. Compute the acceptance ratio

r =p(θ∗|y)

p(θ(s)|y)=

p(y |θ∗)p(θ∗)

p(y |θ(s))p(θ(s)).

3. Let

θ(s+1) =

{θ∗ with probability min(r , 1)

θ(s) with probability 1−min(r , 1).

Notes:

• J(θ∗|θ(s)) is a symmetric proposal distribution.

• Step 3: simulate u ∼ uniform(0, 1), and then set

θ(s+1) =

{θ∗ if r > u

θ(s) if r < u


Given θ(s), generate a value θ(s+1) as follows:

1. Sample θ∗ ∼ J(θ|θ(s));

2. Compute the acceptance ratio

r =p(θ∗|y)

p(θ(s)|y)=

p(y |θ∗)p(θ∗)

p(y |θ(s))p(θ(s)).

3. Let

θ(s+1) =

{θ∗ with probability min(r , 1)

θ(s) with probability 1−min(r , 1).

Notes:

• J(θ∗|θ(s)) is a symmetric proposal distribution.

• Step 3: simulate u ∼ uniform(0, 1), and then set

θ(s+1) =

{θ∗ if r > u

θ(s) if r < u

Simple normal example

θ ∼ normal(µ, τ 2)

{y1, . . . , yn|θ} ∼ i.i.d. normal(θ, σ2)

{θ|y1, . . . , yn} ∼ normal(µn, τ2n )

µn = yn/σ2

n/σ2 + 1/τ 2+ µ

1/τ 2

n/σ2 + 1/τ 2

τ 2n = 1/(n/σ2 + 1/τ 2).

Let’s compare this known result to the Metropolis algorithm.




{θ|y1, . . . , yn} ∼ normal(µn, τ2n )

µn = yn/σ2

n/σ2 + 1/τ 2+ µ

1/τ 2

n/σ2 + 1/τ 2

τ 2n = 1/(n/σ2 + 1/τ 2).





{θ|y1, . . . , yn} ∼ normal(µn, τ2n )

µn = yn/σ2

n/σ2 + 1/τ 2+ µ

1/τ 2

n/σ2 + 1/τ 2

τ 2n = 1/(n/σ2 + 1/τ 2).



The acceptance ratio is

r =p(θ∗|y)

p(θ(s)|y)=

( ∏ni=1 dnorm(yi , θ

∗, σ)∏ni=1 dnorm(yi , θ(s), σ)

)×(

dnorm(θ∗, µ, τ)

dnorm(θ(s), µ, τ)

).

Problem: Computing r directly can be numerically unstable.

Solution: Take logs

log r =n∑

i=1

[log dnorm(yi , θ∗, σ)− log dnorm(yi , θ

(s), σ)] +

log dnorm(θ∗, µ, τ)− log dnorm(θ(s), µ, τ).

Accept the proposal if log u < log r , where u is a sample from the uniformdistribution on (0, 1).



r =p(θ∗|y)

p(θ(s)|y)=



)×(



).


Solution: Take logs

log r =n∑

i=1


(s), σ)] +





r =p(θ∗|y)

p(θ(s)|y)=



)×(



).


Solution: Take logs

log r =n∑

i=1


(s), σ)] +





r =p(θ∗|y)

p(θ(s)|y)=



)×(



).


Solution: Take logs

log r =n∑

i=1


(s), σ)] +



##### ---- Parameters and datas2<-1 ; t2<-10 ; mu<-5y<-c(9.37, 10.18, 9.16, 11.60, 10.33)theta<-0 ; delta<-2 ; S<-10000 ; THETA<-NULL ; set.seed(1)#### ----

#### ---- Metropolis algorithmfor(s in 1:S){

theta.star<-rnorm(1,theta,sqrt(delta))

log.r<-( sum(dnorm(y,theta.star,sqrt(s2),log=TRUE)) +dnorm(theta.star,mu,sqrt(t2),log=TRUE) ) -

( sum(dnorm(y,theta,sqrt(s2),log=TRUE)) +dnorm(theta,mu,sqrt(t2),log=TRUE) )

if(log(runif(1))<log.r) { theta<-theta.star }

THETA<-c(THETA,theta)

}##### ----

0 2000 4000 6000 8000 10000

56

78

910

11

iteration

θ

θ

dens

ity8.5 9.0 9.5 10.0 11.0

0.0

0.2

0.4

0.6

0.8

Tuning parameters

How to choose δ?

Rerun algorithm with δ ∈ {1/32, 2, 64}.

0 100 200 300 400 500

05

1015

iteration

θ

0 100 200 300 400 500

05

1015

iteration0 100 200 300 400 500

05

1015

iteration

Tuning parameters

How to choose δ?


0 100 200 300 400 500

05

1015

iteration

θ

0 100 200 300 400 500

05

1015

iteration0 100 200 300 400 500

05

1015

iteration

Tuning parameters

How to choose δ?


0 100 200 300 400 500

05

1015

iteration

θ

0 100 200 300 400 500

05

1015

iteration0 100 200 300 400 500

05

1015

iteration

Things to discuss

Choices:

• starting values: Use ad-hoc parameter estimates (MLE/OLS).

• proposal distribution: Aim for an acceptance rate between 20-50%.

Assessment:

• convergence: stationarity plots, remove initial values during “burn-in”.

• autocorrelation: compute and report effective sample size.

Things to discuss

Choices:

• starting values: Use ad-hoc parameter estimates (MLE/OLS).

• proposal distribution: Aim for an acceptance rate between 20-50%.

Assessment:

• convergence: stationarity plots, remove initial values during “burn-in”.

• autocorrelation: compute and report effective sample size.

Song sparrow example

y<-yX.sparrow[,1] ; X<-yX.sparrow[,-1]

n<-length(y) ; p<-dim(X)[2]

pmn.beta<-rep(0,p) # prior expectationpsd.beta<-rep(10,p) # prior sd

var.prop<- var(log(y+1/2))*solve( t(X)%*%X ) # proposal variacebeta<-rep(0,p) # starting valueS<-10000 # number of iterationsBETA<-matrix(0,nrow=S,ncol=p) # saved beta valuesacs<-0 # acceptancesset.seed(1) # initialize RNG

Song sparrow example

for(s in 1:S){

#propose a new betabeta.p<- c(rmvnorm(1, beta, var.prop ))

# compute rlhr<- sum(dpois(y,exp(X%*%beta.p),log=T)) -

sum(dpois(y,exp(X%*%beta),log=T)) +sum(dnorm(beta.p,pmn.beta,psd.beta,log=T)) -sum(dnorm(beta,pmn.beta,psd.beta,log=T))

# accept or rejectif( log(runif(1))< lhr ) { beta<-beta.p ; acs<-acs+1 }

BETA[s,]<-beta}

apply(BETA,2,coda::effectiveSize)

## [1] 818.4049 778.4707 726.3633

0 2000 4000 6000 8000 10000

−0.

3−

0.2

−0.

10.

0

iteration

β 3

0 10 20 30 40

0.0

0.2

0.4

0.6

0.8

1.0

lag

AC

F

0 5 10 15 20 25 30

0.0

0.2

0.4

0.6

0.8

1.0

lag/10

AC

F

1 2 3 4 5 6

01

23

4

age

expe

cted

num

ber

of o

ffspr

ing

Documents

The Metropolis-Hastings algorithmpdh10/Teaching/564/Sup/mhalgorithm.pdf · 2016. 6. 2. · Peter Ho Song sparrow data Data: subpopulation of n = 52 female song sparrows. ... et cetera