Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
The Metropolis-Hastings algorithm
Peter Hoff
Song sparrow data
Data: subpopulation of n = 52 female song sparrows.
• yi = number of offspring;
• xi = age in years.
1 2 3 4 5 6
01
23
45
67
age
offs
prin
g
Song sparrow data
Data: subpopulation of n = 52 female song sparrows.
• yi = number of offspring;
• xi = age in years.
1 2 3 4 5 6
01
23
45
67
age
offs
prin
g
Poisson model
Let Y be the offspring for a randomly sampled bird of age x .
P(Y = y |x) =?
Model: {Y |x} ∼ Poisson(θx)
Problem: Don’t have much data for each x .
Solution: Assume E[Y |x ] = θx is smoothly varying with x .
Poisson model
Let Y be the offspring for a randomly sampled bird of age x .
P(Y = y |x) =?
Model: {Y |x} ∼ Poisson(θx)
Problem: Don’t have much data for each x .
Solution: Assume E[Y |x ] = θx is smoothly varying with x .
Poisson model
Let Y be the offspring for a randomly sampled bird of age x .
P(Y = y |x) =?
Model: {Y |x} ∼ Poisson(θx)
Problem: Don’t have much data for each x .
Solution: Assume E[Y |x ] = θx is smoothly varying with x .
Poisson model
Let Y be the offspring for a randomly sampled bird of age x .
P(Y = y |x) =?
Model: {Y |x} ∼ Poisson(θx)
Problem: Don’t have much data for each x .
Solution: Assume E[Y |x ] = θx is smoothly varying with x .
Poisson model
Let Y be the offspring for a randomly sampled bird of age x .
P(Y = y |x) =?
Model: {Y |x} ∼ Poisson(θx)
Problem: Don’t have much data for each x .
Solution: Assume E[Y |x ] = θx is smoothly varying with x .
Poisson regression
Identity link:E[Y |x ] = β1 + β2x + β3x
2
Problem: Range of regression goes outside of range of Y .
Log link:
E[Y |x ] = eβ1+β2x+β3x2
logE[Y |x ] = β1 + β2x + β3x2 = βTx
• βTx is called the linear predictor ;
• The log function links the linear predictor to E[Y |x ];
• This is called a Poisson regression model with a log link.
Poisson regression
Identity link:E[Y |x ] = β1 + β2x + β3x
2
Problem: Range of regression goes outside of range of Y .
Log link:
E[Y |x ] = eβ1+β2x+β3x2
logE[Y |x ] = β1 + β2x + β3x2 = βTx
• βTx is called the linear predictor ;
• The log function links the linear predictor to E[Y |x ];
• This is called a Poisson regression model with a log link.
Poisson regression
Identity link:E[Y |x ] = β1 + β2x + β3x
2
Problem: Range of regression goes outside of range of Y .
Log link:
E[Y |x ] = eβ1+β2x+β3x2
logE[Y |x ] = β1 + β2x + β3x2 = βTx
• βTx is called the linear predictor ;
• The log function links the linear predictor to E[Y |x ];
• This is called a Poisson regression model with a log link.
Poisson regression
Identity link:E[Y |x ] = β1 + β2x + β3x
2
Problem: Range of regression goes outside of range of Y .
Log link:
E[Y |x ] = eβ1+β2x+β3x2
logE[Y |x ] = β1 + β2x + β3x2 = βTx
• βTx is called the linear predictor ;
• The log function links the linear predictor to E[Y |x ];
• This is called a Poisson regression model with a log link.
Poisson regression
Identity link:E[Y |x ] = β1 + β2x + β3x
2
Problem: Range of regression goes outside of range of Y .
Log link:
E[Y |x ] = eβ1+β2x+β3x2
logE[Y |x ] = β1 + β2x + β3x2 = βTx
• βTx is called the linear predictor ;
• The log function links the linear predictor to E[Y |x ];
• This is called a Poisson regression model with a log link.
Poisson regression
Identity link:E[Y |x ] = β1 + β2x + β3x
2
Problem: Range of regression goes outside of range of Y .
Log link:
E[Y |x ] = eβ1+β2x+β3x2
logE[Y |x ] = β1 + β2x + β3x2 = βTx
• βTx is called the linear predictor ;
• The log function links the linear predictor to E[Y |x ];
• This is called a Poisson regression model with a log link.
Poisson regression
Identity link:E[Y |x ] = β1 + β2x + β3x
2
Problem: Range of regression goes outside of range of Y .
Log link:
E[Y |x ] = eβ1+β2x+β3x2
logE[Y |x ] = β1 + β2x + β3x2 = βTx
• βTx is called the linear predictor ;
• The log function links the linear predictor to E[Y |x ];
• This is called a Poisson regression model with a log link.
Bayesian inference for Poisson regression
Simple conjugate priors for Poisson regression are not available.
Solution # 1: Grid based approximation.
Given a prior for β (say β ∼ N(0, 100× I);
1. Construct a grid of β-values
β1 ∈ {β(1)1 , . . . , β
(100)1 }
β2 ∈ {β(1)2 , . . . , β
(100)2 }
β3 ∈ {β(1)3 , . . . , β
(100)3 }
2. For each β = (β(i)1 , β
(j)2 , β
(k)3 ), compute pi,j,k, = p(y|X,β)× p(β)
3. Use pi,j,k/(∑
i,j,k pi,j,k) as an approximation to p(β|y,X).
Bayesian inference for Poisson regression
Simple conjugate priors for Poisson regression are not available.
Solution # 1: Grid based approximation.
Given a prior for β (say β ∼ N(0, 100× I);
1. Construct a grid of β-values
β1 ∈ {β(1)1 , . . . , β
(100)1 }
β2 ∈ {β(1)2 , . . . , β
(100)2 }
β3 ∈ {β(1)3 , . . . , β
(100)3 }
2. For each β = (β(i)1 , β
(j)2 , β
(k)3 ), compute pi,j,k, = p(y|X,β)× p(β)
3. Use pi,j,k/(∑
i,j,k pi,j,k) as an approximation to p(β|y,X).
Grid-based approximation
−1 0 1 2
0.0
0.2
0.4
0.6
0.8
1.0
1.2
β2
p(β 2
|y)
−0.6 −0.4 −0.2 0.0 0.2 0.4
01
23
45
67
β3
p(β 3
|y)
−0.5 0.0 0.5 1.0 1.5 2.0
−0.
4−
0.3
−0.
2−
0.1
0.0
β2
β 3
This required calculation of p(y|X,β)× p(β) at 1 million grid points.
Two more predictors would require 10 billion grid points.
Grid-based approximation
−1 0 1 2
0.0
0.2
0.4
0.6
0.8
1.0
1.2
β2
p(β 2
|y)
−0.6 −0.4 −0.2 0.0 0.2 0.4
01
23
45
67
β3
p(β 3
|y)
−0.5 0.0 0.5 1.0 1.5 2.0
−0.
4−
0.3
−0.
2−
0.1
0.0
β2
β 3
This required calculation of p(y|X,β)× p(β) at 1 million grid points.
Two more predictors would require 10 billion grid points.
Metropolis algorithm
Objective: Construct a large collection of θ-values, {θ(1), . . . , θ(S)}, whoseempirical distribution approximates p(θ|y).
Why?
mean(θ(1), . . . , θ(S)) ≈ E[θ|y ]
sd(θ(1), . . . , θ(S)) ≈ SD[θ|y ]
et cetera.
Metropolis algorithm
Objective: Construct a large collection of θ-values, {θ(1), . . . , θ(S)}, whoseempirical distribution approximates p(θ|y).
Why?
mean(θ(1), . . . , θ(S)) ≈ E[θ|y ]
sd(θ(1), . . . , θ(S)) ≈ SD[θ|y ]
et cetera.
Metropolis algorithm
Objective: Construct a large collection of θ-values, {θ(1), . . . , θ(S)}, whoseempirical distribution approximates p(θ|y).
Why?
mean(θ(1), . . . , θ(S)) ≈ E[θ|y ]
sd(θ(1), . . . , θ(S)) ≈ SD[θ|y ]
et cetera.
Metropolis algorithm
How? Roughly speaking, for any two different values θa and θb we need
#{θ(s)’s in the collection = θa}#{θ(s)’s in the collection = θb}
≈ p(θa|y)
p(θb|y).
Task: Given {θ(1), . . . , θ(s)}, appropriately choose a new θ(s+1) to add.
Idea: Let θ∗ be “near” θ(s). Should θ∗ be added to the sample?
• If p(θ∗|y) > p(θ|y) , then yes.
• Want more θ∗’s than θ(s)’s in sample.• Already have θ(s) in the sample.• So include θ∗ to balance things out.
• If p(θ∗|y) < p(θ|y) , then maybe.
Metropolis algorithm
How? Roughly speaking, for any two different values θa and θb we need
#{θ(s)’s in the collection = θa}#{θ(s)’s in the collection = θb}
≈ p(θa|y)
p(θb|y).
Task: Given {θ(1), . . . , θ(s)}, appropriately choose a new θ(s+1) to add.
Idea: Let θ∗ be “near” θ(s). Should θ∗ be added to the sample?
• If p(θ∗|y) > p(θ|y) , then yes.
• Want more θ∗’s than θ(s)’s in sample.• Already have θ(s) in the sample.• So include θ∗ to balance things out.
• If p(θ∗|y) < p(θ|y) , then maybe.
Metropolis algorithm
How? Roughly speaking, for any two different values θa and θb we need
#{θ(s)’s in the collection = θa}#{θ(s)’s in the collection = θb}
≈ p(θa|y)
p(θb|y).
Task: Given {θ(1), . . . , θ(s)}, appropriately choose a new θ(s+1) to add.
Idea: Let θ∗ be “near” θ(s). Should θ∗ be added to the sample?
• If p(θ∗|y) > p(θ|y) , then yes.
• Want more θ∗’s than θ(s)’s in sample.• Already have θ(s) in the sample.• So include θ∗ to balance things out.
• If p(θ∗|y) < p(θ|y) , then maybe.
Metropolis algorithm
r =p(θ∗|y)
p(θ(s)|y)=
p(y |θ∗)p(θ∗)
p(y)
p(y)
p(y |θ(s))p(θ(s))=
p(y |θ∗)p(θ∗)
p(y |θ(s))p(θ(s))
Having computed r , how should we proceed?
If r > 1:Intuition: Since θ(s) is already in our set, we should include θ∗ as it has ahigher probability than θ(s).Procedure: Accept θ∗ into our set, i.e. set θ(s+1) = θ∗.
If r < 1:Intuition: The relative frequency of θ-values in our set equal to θ∗
compared to those equal to θ(s) should be p(θ∗|y)/p(θ(s)|y) = r . This
means that for every instance of θ(s), we should have only a “fraction” ofan instance of a θ∗ value.Procedure: Set θ(s+1) equal to either θ∗ or θ(s), with probability r and1− r respectively.
Metropolis algorithm
r =p(θ∗|y)
p(θ(s)|y)=
p(y |θ∗)p(θ∗)
p(y)
p(y)
p(y |θ(s))p(θ(s))=
p(y |θ∗)p(θ∗)
p(y |θ(s))p(θ(s))
Having computed r , how should we proceed?
If r > 1:Intuition: Since θ(s) is already in our set, we should include θ∗ as it has ahigher probability than θ(s).Procedure: Accept θ∗ into our set, i.e. set θ(s+1) = θ∗.
If r < 1:Intuition: The relative frequency of θ-values in our set equal to θ∗
compared to those equal to θ(s) should be p(θ∗|y)/p(θ(s)|y) = r . This
means that for every instance of θ(s), we should have only a “fraction” ofan instance of a θ∗ value.Procedure: Set θ(s+1) equal to either θ∗ or θ(s), with probability r and1− r respectively.
Metropolis algorithm
r =p(θ∗|y)
p(θ(s)|y)=
p(y |θ∗)p(θ∗)
p(y)
p(y)
p(y |θ(s))p(θ(s))=
p(y |θ∗)p(θ∗)
p(y |θ(s))p(θ(s))
Having computed r , how should we proceed?
If r > 1:Intuition: Since θ(s) is already in our set, we should include θ∗ as it has ahigher probability than θ(s).Procedure: Accept θ∗ into our set, i.e. set θ(s+1) = θ∗.
If r < 1:Intuition: The relative frequency of θ-values in our set equal to θ∗
compared to those equal to θ(s) should be p(θ∗|y)/p(θ(s)|y) = r . This
means that for every instance of θ(s), we should have only a “fraction” ofan instance of a θ∗ value.Procedure: Set θ(s+1) equal to either θ∗ or θ(s), with probability r and1− r respectively.
Metropolis algorithm
r =p(θ∗|y)
p(θ(s)|y)=
p(y |θ∗)p(θ∗)
p(y)
p(y)
p(y |θ(s))p(θ(s))=
p(y |θ∗)p(θ∗)
p(y |θ(s))p(θ(s))
Having computed r , how should we proceed?
If r > 1:Intuition: Since θ(s) is already in our set, we should include θ∗ as it has ahigher probability than θ(s).Procedure: Accept θ∗ into our set, i.e. set θ(s+1) = θ∗.
If r < 1:Intuition: The relative frequency of θ-values in our set equal to θ∗
compared to those equal to θ(s) should be p(θ∗|y)/p(θ(s)|y) = r . This
means that for every instance of θ(s), we should have only a “fraction” ofan instance of a θ∗ value.Procedure: Set θ(s+1) equal to either θ∗ or θ(s), with probability r and1− r respectively.
Metropolis algorithm
Given θ(s), generate a value θ(s+1) as follows:
1. Sample θ∗ ∼ J(θ|θ(s));
2. Compute the acceptance ratio
r =p(θ∗|y)
p(θ(s)|y)=
p(y |θ∗)p(θ∗)
p(y |θ(s))p(θ(s)).
3. Let
θ(s+1) =
{θ∗ with probability min(r , 1)
θ(s) with probability 1−min(r , 1).
Notes:
• J(θ∗|θ(s)) is a symmetric proposal distribution.
• Step 3: simulate u ∼ uniform(0, 1), and then set
θ(s+1) =
{θ∗ if r > u
θ(s) if r < u
Metropolis algorithm
Given θ(s), generate a value θ(s+1) as follows:
1. Sample θ∗ ∼ J(θ|θ(s));
2. Compute the acceptance ratio
r =p(θ∗|y)
p(θ(s)|y)=
p(y |θ∗)p(θ∗)
p(y |θ(s))p(θ(s)).
3. Let
θ(s+1) =
{θ∗ with probability min(r , 1)
θ(s) with probability 1−min(r , 1).
Notes:
• J(θ∗|θ(s)) is a symmetric proposal distribution.
• Step 3: simulate u ∼ uniform(0, 1), and then set
θ(s+1) =
{θ∗ if r > u
θ(s) if r < u
Simple normal example
θ ∼ normal(µ, τ 2)
{y1, . . . , yn|θ} ∼ i.i.d. normal(θ, σ2)
{θ|y1, . . . , yn} ∼ normal(µn, τ2n )
µn = yn/σ2
n/σ2 + 1/τ 2+ µ
1/τ 2
n/σ2 + 1/τ 2
τ 2n = 1/(n/σ2 + 1/τ 2).
Let’s compare this known result to the Metropolis algorithm.
Simple normal example
θ ∼ normal(µ, τ 2)
{y1, . . . , yn|θ} ∼ i.i.d. normal(θ, σ2)
{θ|y1, . . . , yn} ∼ normal(µn, τ2n )
µn = yn/σ2
n/σ2 + 1/τ 2+ µ
1/τ 2
n/σ2 + 1/τ 2
τ 2n = 1/(n/σ2 + 1/τ 2).
Let’s compare this known result to the Metropolis algorithm.
Simple normal example
θ ∼ normal(µ, τ 2)
{y1, . . . , yn|θ} ∼ i.i.d. normal(θ, σ2)
{θ|y1, . . . , yn} ∼ normal(µn, τ2n )
µn = yn/σ2
n/σ2 + 1/τ 2+ µ
1/τ 2
n/σ2 + 1/τ 2
τ 2n = 1/(n/σ2 + 1/τ 2).
Let’s compare this known result to the Metropolis algorithm.
Simple normal example
The acceptance ratio is
r =p(θ∗|y)
p(θ(s)|y)=
( ∏ni=1 dnorm(yi , θ
∗, σ)∏ni=1 dnorm(yi , θ(s), σ)
)×(
dnorm(θ∗, µ, τ)
dnorm(θ(s), µ, τ)
).
Problem: Computing r directly can be numerically unstable.
Solution: Take logs
log r =n∑
i=1
[log dnorm(yi , θ∗, σ)− log dnorm(yi , θ
(s), σ)] +
log dnorm(θ∗, µ, τ)− log dnorm(θ(s), µ, τ).
Accept the proposal if log u < log r , where u is a sample from the uniformdistribution on (0, 1).
Simple normal example
The acceptance ratio is
r =p(θ∗|y)
p(θ(s)|y)=
( ∏ni=1 dnorm(yi , θ
∗, σ)∏ni=1 dnorm(yi , θ(s), σ)
)×(
dnorm(θ∗, µ, τ)
dnorm(θ(s), µ, τ)
).
Problem: Computing r directly can be numerically unstable.
Solution: Take logs
log r =n∑
i=1
[log dnorm(yi , θ∗, σ)− log dnorm(yi , θ
(s), σ)] +
log dnorm(θ∗, µ, τ)− log dnorm(θ(s), µ, τ).
Accept the proposal if log u < log r , where u is a sample from the uniformdistribution on (0, 1).
Simple normal example
The acceptance ratio is
r =p(θ∗|y)
p(θ(s)|y)=
( ∏ni=1 dnorm(yi , θ
∗, σ)∏ni=1 dnorm(yi , θ(s), σ)
)×(
dnorm(θ∗, µ, τ)
dnorm(θ(s), µ, τ)
).
Problem: Computing r directly can be numerically unstable.
Solution: Take logs
log r =n∑
i=1
[log dnorm(yi , θ∗, σ)− log dnorm(yi , θ
(s), σ)] +
log dnorm(θ∗, µ, τ)− log dnorm(θ(s), µ, τ).
Accept the proposal if log u < log r , where u is a sample from the uniformdistribution on (0, 1).
Simple normal example
The acceptance ratio is
r =p(θ∗|y)
p(θ(s)|y)=
( ∏ni=1 dnorm(yi , θ
∗, σ)∏ni=1 dnorm(yi , θ(s), σ)
)×(
dnorm(θ∗, µ, τ)
dnorm(θ(s), µ, τ)
).
Problem: Computing r directly can be numerically unstable.
Solution: Take logs
log r =n∑
i=1
[log dnorm(yi , θ∗, σ)− log dnorm(yi , θ
(s), σ)] +
log dnorm(θ∗, µ, τ)− log dnorm(θ(s), µ, τ).
Accept the proposal if log u < log r , where u is a sample from the uniformdistribution on (0, 1).
##### ---- Parameters and datas2<-1 ; t2<-10 ; mu<-5y<-c(9.37, 10.18, 9.16, 11.60, 10.33)theta<-0 ; delta<-2 ; S<-10000 ; THETA<-NULL ; set.seed(1)#### ----
#### ---- Metropolis algorithmfor(s in 1:S){
theta.star<-rnorm(1,theta,sqrt(delta))
log.r<-( sum(dnorm(y,theta.star,sqrt(s2),log=TRUE)) +dnorm(theta.star,mu,sqrt(t2),log=TRUE) ) -
( sum(dnorm(y,theta,sqrt(s2),log=TRUE)) +dnorm(theta,mu,sqrt(t2),log=TRUE) )
if(log(runif(1))<log.r) { theta<-theta.star }
THETA<-c(THETA,theta)
}##### ----
0 2000 4000 6000 8000 10000
56
78
910
11
iteration
θ
θ
dens
ity8.5 9.0 9.5 10.0 11.0
0.0
0.2
0.4
0.6
0.8
Tuning parameters
How to choose δ?
Rerun algorithm with δ ∈ {1/32, 2, 64}.
0 100 200 300 400 500
05
1015
iteration
θ
0 100 200 300 400 500
05
1015
iteration0 100 200 300 400 500
05
1015
iteration
Tuning parameters
How to choose δ?
Rerun algorithm with δ ∈ {1/32, 2, 64}.
0 100 200 300 400 500
05
1015
iteration
θ
0 100 200 300 400 500
05
1015
iteration0 100 200 300 400 500
05
1015
iteration
Tuning parameters
How to choose δ?
Rerun algorithm with δ ∈ {1/32, 2, 64}.
0 100 200 300 400 500
05
1015
iteration
θ
0 100 200 300 400 500
05
1015
iteration0 100 200 300 400 500
05
1015
iteration
Things to discuss
Choices:
• starting values: Use ad-hoc parameter estimates (MLE/OLS).
• proposal distribution: Aim for an acceptance rate between 20-50%.
Assessment:
• convergence: stationarity plots, remove initial values during “burn-in”.
• autocorrelation: compute and report effective sample size.
Things to discuss
Choices:
• starting values: Use ad-hoc parameter estimates (MLE/OLS).
• proposal distribution: Aim for an acceptance rate between 20-50%.
Assessment:
• convergence: stationarity plots, remove initial values during “burn-in”.
• autocorrelation: compute and report effective sample size.
Song sparrow example
y<-yX.sparrow[,1] ; X<-yX.sparrow[,-1]
n<-length(y) ; p<-dim(X)[2]
pmn.beta<-rep(0,p) # prior expectationpsd.beta<-rep(10,p) # prior sd
var.prop<- var(log(y+1/2))*solve( t(X)%*%X ) # proposal variacebeta<-rep(0,p) # starting valueS<-10000 # number of iterationsBETA<-matrix(0,nrow=S,ncol=p) # saved beta valuesacs<-0 # acceptancesset.seed(1) # initialize RNG
Song sparrow example
for(s in 1:S){
#propose a new betabeta.p<- c(rmvnorm(1, beta, var.prop ))
# compute rlhr<- sum(dpois(y,exp(X%*%beta.p),log=T)) -
sum(dpois(y,exp(X%*%beta),log=T)) +sum(dnorm(beta.p,pmn.beta,psd.beta,log=T)) -sum(dnorm(beta,pmn.beta,psd.beta,log=T))
# accept or rejectif( log(runif(1))< lhr ) { beta<-beta.p ; acs<-acs+1 }
BETA[s,]<-beta}
apply(BETA,2,coda::effectiveSize)
## [1] 818.4049 778.4707 726.3633
0 2000 4000 6000 8000 10000
−0.
3−
0.2
−0.
10.
0
iteration
β 3
0 10 20 30 40
0.0
0.2
0.4
0.6
0.8
1.0
lag
AC
F
0 5 10 15 20 25 30
0.0
0.2
0.4
0.6
0.8
1.0
lag/10
AC
F
1 2 3 4 5 6
01
23
4
age
expe
cted
num
ber
of o
ffspr
ing