Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Bayesian computationprehistory (1763 – 1960): Conjugate priors

Chapter 5: Bayesian Computation – p. 1/42


1960’s: Numerical quadrature – Newton-Cotesmethods, Gaussian quadrature, etc.




1970’s: Expectation-Maximization (“EM”) algorithm –iterative mode-finder





1980’s: Asymptotic methods – Laplace’s method,saddlepoint approximations






1980’s: Noniterative Monte Carlo methods – Directposterior sampling and indirect methods (importancesampling, rejection, etc.)







1990’s: Markov chain Monte Carlo (MCMC) – Gibbssampler, Metropolis-Hastings algorithm







1990’s: Markov chain Monte Carlo (MCMC) – Gibbssampler, Metropolis-Hastings algorithm

⇒ MCMC methods broadly applicable, but require care inparametrization and convergence diagnosis!


Asymptotic methodsWhen n is large, f(x|θ) will be quite peaked relative top(θ), and so p(θ|x) will be approximately normal.


Asymptotic methodsWhen n is large, f(x|θ) will be quite peaked relative top(θ), and so p(θ|x) will be approximately normal.

“Bayesian Central Limit Theorem”: Suppose

X1, . . . , Xniid∼ fi(xi|θ), and that the prior p(θ) and the

likelihood f(x|θ) are positive and twice differentiable

near θp, the posterior mode of θ. Then for large n

p(θ|x)·∼ N(θ

p, [Ip(x)]−1) ,

where [Ip(x)]−1 is the “generalized” observed Fisherinformation matrix for θ, i.e., minus the inverse Hessianof the log posterior evaluated at the mode,

Ipij(x) = −

[∂2

∂θi∂θjlog (f(x|θ)p(θ))

]

θ=θp

.


Example 5.1: Hamburger patties againRecall X|θ ∼ Bin(n, θ) and θ ∼ Beta(α, β).

Under a flat prior on θ, we have

l(θ) = log f(x|θ)p(θ) ∝ x log θ + (n − x) log(1 − θ) .

Taking the derivative of l(θ) and equating to zero, weobtain θp = θ = x/n, the familiar binomial proportion.


Example 5.1: Hamburger patties againRecall X|θ ∼ Bin(n, θ) and θ ∼ Beta(α, β).

Under a flat prior on θ, we have

l(θ) = log f(x|θ)p(θ) ∝ x log θ + (n − x) log(1 − θ) .

Taking the derivative of l(θ) and equating to zero, weobtain θp = θ = x/n, the familiar binomial proportion.

The second derivative is

∂2l(θ)

∂θ2=

−x

θ2−

n − x

(1 − θ)2,

so that

∂2l(θ)

∂θ2

∣∣∣∣θ=θ

= −x

θ2−

n − x

(1 − θ)2= −

n

θ−

n

1 − θ.


Example 5.1: Hamburger patties again

Thus [Ip(x)]−1 =(

nθ

+ n1−θ

)−1=(

nθ(1−θ)

)−1= θ(1−θ)

n ,

which is the usual frequentist expression for V ar(θ).Thus the Bayesian CLT gives

p(θ|x)·∼ N

(θ ,

θ(1 − θ)

n

)


Example 5.1: Hamburger patties again

Thus [Ip(x)]−1 =(

nθ

+ n1−θ

)−1=(

nθ(1−θ)

)−1= θ(1−θ)

n ,

which is the usual frequentist expression for V ar(θ).Thus the Bayesian CLT gives

p(θ|x)·∼ N

(θ ,

θ(1 − θ)

n

)

Notice that a frequentist might instead use MLEasymptotics to write

θ | θ·∼ N

(θ ,

θ(1 − θ)

n

),

leading to identical inferences for θ, but for differentreasons and with different interpretations!


Example 5.1: Hamburger patties againComparison of this normal approximation to the exactposterior, a Beta(14, 4) distribution (recall n = 16):

post

erio

r de

nsity

0.0 0.2 0.4 0.6 0.8 1.0

01

23

4

θ

exact (beta)approximate (normal)

Similar modes, but very different tail behavior: 95% crediblesets are (.57, .93) for exact, but (.62, 1.0) for normalapproximation.


Higher order approximationsThe Bayesian CLT is a first order approximation, since

E(g(θ)) = g(θ) [1 + O (1/n)] .



E(g(θ)) = g(θ) [1 + O (1/n)] .

Second order approximations (i.e., to order O(1/n2))again requiring only mode and Hessian calculations areavailable via Laplace’s Method (C&L, Sec. 5.2.2).



E(g(θ)) = g(θ) [1 + O (1/n)] .


Advantages of Asymptotic Methods:deterministic, noniterative algorithmsubstitutes differentiation for integrationfacilitates studies of Bayesian robustness



E(g(θ)) = g(θ) [1 + O (1/n)] .


Advantages of Asymptotic Methods:deterministic, noniterative algorithmsubstitutes differentiation for integrationfacilitates studies of Bayesian robustness

Disadvantages of Asymptotic Methods:requires well-parametrized, unimodal posteriorθ must be of at most moderate dimensionn must be large, but is beyond our control


Noninterative Monte Carlo MethodsRecall Monte Carlo integration: Suppose θ ∼ h(θ), and

we seek γ ≡ E[f(θ)] =∫

f(θ)h(θ)dθ. Then if θiiid∼ h(θ),

γ =1

N

N∑

j=1

f(θj)w.p. 1→ E[f(θ)] as N → ∞ ( SLLN )





γ =1

N

N∑

j=1


Since γ is itself a sample mean of independentobservations, V ar(γ) = V ar[f(θ)]/N . This means

se(γ) =

√√√√ 1

N(N − 1)

N∑

j=1

[f(θj) − γ]2 .

=⇒ γ ± 2 se(γ) provides a 95% (frequentist!) CI for γ





γ =1

N

N∑

j=1


Since γ is itself a sample mean of independentobservations, V ar(γ) = V ar[f(θ)]/N . This means

se(γ) =

√√√√ 1

N(N − 1)

N∑

j=1

[f(θj) − γ]2 .

=⇒ γ ± 2 se(γ) provides a 95% (frequentist!) CI for γ

Note dependence on N , rather than n!Chapter 5: Bayesian Computation – p. 7/42

Example 5.2: Direct bivariate samplingIf Yi

iid∼ N(µ, σ2), i = 1, . . . , n, and p(µ, σ) = 1

σ , then

µ|σ2, y ∼ N(y, σ2/n) , and

σ2|y ∼ Kχ−2n−1 = K · IG

(n − 1

2, 2

),

where K =∑n

i=1(yi − y)2.



iid∼ N(µ, σ2), i = 1, . . . , n, and p(µ, σ) = 1

σ , then

µ|σ2, y ∼ N(y, σ2/n) , and

σ2|y ∼ Kχ−2n−1 = K · IG

(n − 1

2, 2

),

where K =∑n

i=1(yi − y)2.

So draw {(µj , σ2j ), j = 1, . . . , N} from p(µ, σ2|y) as:

σ2j ∼ Kχ−2

n−1 ; then µj ∼ N(y, σ2j /n), j = 1, . . . , N .



iid∼ N(µ, σ2), i = 1, . . . , n, and p(µ, σ) = 1

σ , then

µ|σ2, y ∼ N(y, σ2/n) , and

σ2|y ∼ Kχ−2n−1 = K · IG

(n − 1

2, 2

),

where K =∑n

i=1(yi − y)2.


σ2j ∼ Kχ−2

n−1 ; then µj ∼ N(y, σ2j /n), j = 1, . . . , N .

To estimate the posterior mean: E(µ|y) = 1N

∑Nj=1 µj .



iid∼ N(µ, σ2), i = 1, . . . , n, and p(µ, σ) = 1

σ , then

µ|σ2, y ∼ N(y, σ2/n) , and

σ2|y ∼ Kχ−2n−1 = K · IG

(n − 1

2, 2

),

where K =∑n

i=1(yi − y)2.


σ2j ∼ Kχ−2

n−1 ; then µj ∼ N(y, σ2j /n), j = 1, . . . , N .

To estimate the posterior mean: E(µ|y) = 1N

∑Nj=1 µj .

Transformations also easy: To estimate the coefficientof variation, γ = σ/µ, define γj = σj/µj , j = 1, . . . , N ;summarize with moments or histograms!


Indirect MethodsImportance sampling: Suppose we wish to approximate

E(f(θ)|y) =

∫f(θ)L(θ)p(θ)dθ∫

L(θ)p(θ)dθ,

and we can roughly approximate the normalizedlikelihood times prior, cL(θ)p(θ), by some density g(θ)from which we can easily sample – say, a multivariate t.


Indirect MethodsImportance sampling: Suppose we wish to approximate

E(f(θ)|y) =

∫f(θ)L(θ)p(θ)dθ∫

L(θ)p(θ)dθ,

and we can roughly approximate the normalizedlikelihood times prior, cL(θ)p(θ), by some density g(θ)from which we can easily sample – say, a multivariate t.

Then defining the weight function w(θ) = L(θ)p(θ)/g(θ),

E(f(θ)|y) =

∫f(θ)w(θ)g(θ)dθ∫

w(θ)g(θ)dθ≈

1N

∑Nj=1 f(θj)w(θj)

1N

∑Nj=1 w(θj)

,

where θjiid∼ g(θ). Here, g(θ) is called the importance

function; a good match to cL(θ)p(θ) will produce roughlyequal weights, hence a good approximation. Chapter 5: Bayesian Computation – p. 9/42

Rejection samplingHere, instead of trying to approximate the posterior

h(θ) =L(θ)p(θ)∫L(θ)p(θ)dθ

,

we try to “blanket” it: suppose there exists a constantM > 0 and a smooth density g(θ), called the envelopefunction, such that L(θ)p(θ) < Mg(θ) for all θ.The algorithm proceeds as follows:




,

we try to “blanket” it: suppose there exists a constantM > 0 and a smooth density g(θ), called the envelopefunction, such that L(θ)p(θ) < Mg(θ) for all θ.The algorithm proceeds as follows:(i) Generate θj ∼ g(θ).




,

we try to “blanket” it: suppose there exists a constantM > 0 and a smooth density g(θ), called the envelopefunction, such that L(θ)p(θ) < Mg(θ) for all θ.The algorithm proceeds as follows:(i) Generate θj ∼ g(θ).(ii) Generate U ∼ Uniform(0, 1).




,

we try to “blanket” it: suppose there exists a constantM > 0 and a smooth density g(θ), called the envelopefunction, such that L(θ)p(θ) < Mg(θ) for all θ.The algorithm proceeds as follows:(i) Generate θj ∼ g(θ).(ii) Generate U ∼ Uniform(0, 1).(iii) If MUg(θj) < L(θj)p(θj), accept θj ; else reject θj .




,

we try to “blanket” it: suppose there exists a constantM > 0 and a smooth density g(θ), called the envelopefunction, such that L(θ)p(θ) < Mg(θ) for all θ.The algorithm proceeds as follows:(i) Generate θj ∼ g(θ).(ii) Generate U ∼ Uniform(0, 1).(iii) If MUg(θj) < L(θj)p(θj), accept θj ; else reject θj .(iv) Return to step (i) and repeat, until the desired

sample {θj , j = 1, . . . , N} is obtained. Its memberswill be random variates from h(θ).


Rejection Sampling: informal “proof”

-3 -2 -1 0 1 2 3

0.0

0.2

0.4

0.6

0.8

Mg

Lπ

a

Consider the θj samples in the histogram bar centeredat a: the rejection step “slices off” the top portion of thebar. Repeat for all a: accepted θj mimic the lower curve!


Rejection Sampling: informal “proof”

-3 -2 -1 0 1 2 3

0.0

0.2

0.4

0.6

0.8

Mg

Lπ

a

Consider the θj samples in the histogram bar centeredat a: the rejection step “slices off” the top portion of thebar. Repeat for all a: accepted θj mimic the lower curve!

Need to choose M as small as possible (efficiency),and watch for “envelope violations”!


Markov chain Monte Carlo methods– iterative MC methods, useful when difficult or impossibleto find a feasible importance or envelope density.

Given two unknowns x (think parameters) and y (thinkmissing data), we can often write

p(x) =

∫p(x|y)p(y)dy and p(y) =

∫p(y|x)p(x)dx ,

where p(x|y) and p(y|x) are known; we seek p(x).


Markov chain Monte Carlo methods– iterative MC methods, useful when difficult or impossibleto find a feasible importance or envelope density.

Given two unknowns x (think parameters) and y (thinkmissing data), we can often write

p(x) =

∫p(x|y)p(y)dy and p(y) =

∫p(y|x)p(x)dx ,

where p(x|y) and p(y|x) are known; we seek p(x).

Analytical solution via substitution:

p(x) =

∫p(x|y)

∫p(y|x′)p(x′)dx′dy =

∫h(x, x′)p(x′)dx′ ,

where h(x, x′) =∫

p(x|y)p(y|x′)dy.


Substitution samplingThis determines a fixed point system

pi+1(x) =

∫h(x, x′)pi(x

′)dx′ ,

which converges under mild conditions.



pi+1(x) =

∫h(x, x′)pi(x

′)dx′ ,


Tanner and Wong (1987): Can also use a sampling-based approach: Draw X(0) ∼ p0(x), thenY (1) ∼ p(y|x(0)), and finally X(1) ∼ p(x|y(1)).Then X(1) has marginal distribution

p1(x) =

∫p(x|y)p1(y)dy =

∫h(x, x′)p0(x

′)dx′ .



pi+1(x) =

∫h(x, x′)pi(x

′)dx′ ,


Tanner and Wong (1987): Can also use a sampling-based approach: Draw X(0) ∼ p0(x), thenY (1) ∼ p(y|x(0)), and finally X(1) ∼ p(x|y(1)).Then X(1) has marginal distribution

p1(x) =

∫p(x|y)p1(y)dy =

∫h(x, x′)p0(x

′)dx′ .

Repeating this process produces pairs (X(i), Y (i)) such

that X(i) d→ X ∼ p(x) and Y (i) d

→ Y ∼ p(y).Chapter 5: Bayesian Computation – p. 13/42

Substitution sampling

That is, for i sufficiently large, X(i) may be thought of asa sample from the true marginal density p(x).Tanner and Wong refer to this (bivariate) algorithm asdata augmentation.




The luxury of avoiding the integration above has comeat the price of obtaining not the marginal density p(x)itself, but only a sample from this density.




The luxury of avoiding the integration above has comeat the price of obtaining not the marginal density p(x)itself, but only a sample from this density.

Can we extend this idea to more than 2 dimensions?...


Gibbs samplingSuppose the joint distribution of θ = (θ1, . . . , θK) isuniquely determined by the full conditional distributions,{pi(θi|θj 6=i), i = 1, . . . ,K}.



Given an arbitrary set of starting values {θ(0)1 , . . . , θ

(0)K },

Draw θ(1)1 ∼ p1(θ1|θ

(0)2 , . . . , θ

(0)K ),


(1)1 , θ

(0)3 , . . . , θ

(0)K ),

...

Draw θ(1)K ∼ pK(θK |θ

(1)1 , . . . , θ

(1)K−1),



Given an arbitrary set of starting values {θ(0)1 , . . . , θ

(0)K },


(0)2 , . . . , θ

(0)K ),


(1)1 , θ

(0)3 , . . . , θ

(0)K ),

...

Draw θ(1)K ∼ pK(θK |θ

(1)1 , . . . , θ

(1)K−1),

Under mild conditions,

(θ(t)1 , . . . , θ

(t)K )

d→ (θ1, · · · , θK) ∼ p as t → ∞ .


Gibbs sampling (cont’d)

For t sufficiently large (say, bigger than t0), {θ(t)}Tt=t0+1

is a (correlated) sample from the true posterior.





We might therefore use a sample mean to estimate theposterior mean, i.e.,

E(θi|y) =1

T − t0

T∑

t=t0+1

θ(t)i .





We might therefore use a sample mean to estimate theposterior mean, i.e.,

E(θi|y) =1

T − t0

T∑

t=t0+1

θ(t)i .

The time from t = 0 to t = t0 is commonly known as theburn-in period; one can safely adapt (change) anMCMC algorithm during this preconvergence period,since these samples will be discarded anyway


Gibbs sampling (cont’d)In practice, we may actually run m parallel Gibbssampling chains, instead of only 1, for some modest m(say, m = 5). Discarding the burn-in period, we obtain

E(θi|y) =1

m(T − t0)

m∑

j=1

T∑

t=t0+1

θ(t)i,j ,

where now the j subscript indicates chain number.


Gibbs sampling (cont’d)In practice, we may actually run m parallel Gibbssampling chains, instead of only 1, for some modest m(say, m = 5). Discarding the burn-in period, we obtain

E(θi|y) =1

m(T − t0)

m∑

j=1

T∑

t=t0+1

θ(t)i,j ,

where now the j subscript indicates chain number.

A density estimate p(θi|y) may be obtained by

smoothing the histogram of the {θ(t)i,j }, or as

p(θi|y) =1

m(T − t0)

m∑

j=1

T∑

t=t0+1

p(θi|θ(t)k 6=i,j , y)

≈

∫p(θi|θk 6=i,y)p(θk 6=i|y)dθk 6=i


Example 2.5 (revisited)Consider the model

Yi|θiind∼ Poisson(θisi), θi

ind∼ G(α, β),

β ∼ IG(c, d), i = 1, . . . , k,

where α, c, d, and the si are known. Thus


Example 2.5 (revisited)Consider the model

Yi|θiind∼ Poisson(θisi), θi

ind∼ G(α, β),

β ∼ IG(c, d), i = 1, . . . , k,

where α, c, d, and the si are known. Thus

f(yi|θi) =e−(θisi)(θisi)

yi

yi!, yi ≥ 0, θi > 0,

g(θi|β) =θα−1i e−θi/β

Γ(α)βα, α > 0, β > 0,

h(β) =e−1/(βd)

Γ(c)dcβc+1, c > 0, d > 0.

Note g is conjugate for f , and h is conjugate for g!Chapter 5: Bayesian Computation – p. 18/42

Example 2.5 (revisited)

To implement the Gibbs sampler, we require the fullconditional distributions of β and the θi.




By Bayes’ Rule, each of these is proportional to thecomplete Bayesian model specification,

[k∏

i=1

f(yi|θi)g(θi|β)

]h(β)




By Bayes’ Rule, each of these is proportional to thecomplete Bayesian model specification,

[k∏

i=1

f(yi|θi)g(θi|β)

]h(β)

Thus we can find full conditional distributions bydropping irrelevant terms from this expression, andnormalizing!

Expressions given on the next slide...



p(θi|θj 6=i, β,y) ∝ f(yi|θi)g(θi|β)

∝ θyi+α−1i e−θi(si+1/β)

∝ G(θi | yi + α, (si + 1/β)−1

), and

p(β|{θi},y) ∝

[k∏

i=1

g(θi|β)

]h(β) ∝

[k∏

i=1

e−θi/β

βα

]e−1/(βd)

βc+1

∝e−

1

β (∑k

i=1θi+

1

d)

βkα+c+1

∝ IG

β kα + c,

(k∑

i=1

θi + 1/d

)−1 .


Example 2.5 (revisited)Thus the {θ

(t)i } and β(t) may be sampled directly!




If α were also unknown, harder since

p(α|{θi}, β,y) ∝

[k∏

i=1

g(θi|α, β)

]h(α)

is not proportional to any standard family. So resort to:





p(α|{θi}, β,y) ∝

[k∏

i=1

g(θi|α, β)

]h(α)

is not proportional to any standard family. So resort to:adaptive rejection sampling (ARS): providedp(α|{θi}, β,y) is log-concave, or





p(α|{θi}, β,y) ∝

[k∏

i=1

g(θi|α, β)

]h(α)

is not proportional to any standard family. So resort to:adaptive rejection sampling (ARS): providedp(α|{θi}, β,y) is log-concave, orMetropolis-Hastings sampling – IOU for now!





p(α|{θi}, β,y) ∝

[k∏

i=1

g(θi|α, β)

]h(α)


Note: This is the order the WinBUGS software uses whenderiving full conditionals!





p(α|{θi}, β,y) ∝

[k∏

i=1

g(θi|α, β)

]h(α)


Note: This is the order the WinBUGS software uses whenderiving full conditionals!

This is the standard “hybrid approach": Use Gibbsoverall, with “substeps" for awkward full conditionals


Example 5.6: Rat dataConsider the longitudinal data model

Yijind∼ N

(αi + βixij , σ2

),

where Yij is the weight of the ith rat at measurementpoint j, while xij denotes its age in days, fori = 1, . . . , k = 30, and j = 1, . . . , ni = 5 for all i(see text p.147 for actual data).


Example 5.6: Rat dataConsider the longitudinal data model

Yijind∼ N

(αi + βixij , σ2

),

where Yij is the weight of the ith rat at measurementpoint j, while xij denotes its age in days, fori = 1, . . . , k = 30, and j = 1, . . . , ni = 5 for all i(see text p.147 for actual data).

Adopt the random effects model

θi ≡

(αi

βi

)iid∼ N

(θ0 ≡

(α0

β0

), Σ

), i = 1, . . . , k ,

which is conjugate with the likelihood (see generalnormal linear model in Example 2.7, text p.34).


Example 5.6: Rat dataPriors: Conjugate forms are again available, namely

σ2 ∼ IG(a, b) ,

θ0 ∼ N(η, C) , and

Σ−1 ∼ W((ρR)−1, ρ

), (1)

where W denotes the Wishart (multivariate gamma)distribution; see Appendix A.2.2, text p.328.


Example 5.6: Rat dataPriors: Conjugate forms are again available, namely

σ2 ∼ IG(a, b) ,

θ0 ∼ N(η, C) , and

Σ−1 ∼ W((ρR)−1, ρ

), (2)

where W denotes the Wishart (multivariate gamma)distribution; see Appendix A.2.2, text p.328.

We assume the hyperparameters (a, b,η, C, ρ, and R)are all known, so there are 30(2) + 3 + 3 = 66 unknownparameters in the model.

Yet the Gibbs sampler is relatively straightforward toimplement here, thanks to the conjugacy at each stagein the hierarchy.


Example 5.6: Rat dataFull conditional derivation: Consider for example θi.Since we are allowed to think of the remainingparameters as fixed, by Example 2.7 (Lindley andSmith, 1972) we have

θi|y, θ0, Σ−1, σ2 ∼ N (Didi , Di) where

D−1i = σ−2XT

i Xi + Σ−1 and di = σ−2XTi yi + Σ−1θ0,

for yi =

yi1...

yini

, and Xi =

1 xi1...

...1 xini

.


Example 5.6: Rat dataFull conditional derivation: Consider for example θi.Since we are allowed to think of the remainingparameters as fixed, by Example 2.7 (Lindley andSmith, 1972) we have

θi|y, θ0, Σ−1, σ2 ∼ N (Didi , Di) where

D−1i = σ−2XT

i Xi + Σ−1 and di = σ−2XTi yi + Σ−1θ0,

for yi =

yi1...

yini

, and Xi =

1 xi1...

...1 xini

.

Similarly, the full conditionals for θ0, Σ−1, and σ2

emerge in closed form as normal, Wishart, and inversegamma, respectively!


Example 5.6: Rat dataUsing vague hyperpriors, run 3 initially overdispersedparallel sampling chains for 500 iterations each:

iteration

mu

0 100 200 300 400 500

100

200

Trace of mu(500 values per trace)

mu

90 100 110 120

0.0

0.05

Kernel density for mu(1200 values)

•• • ••• ••• •• •••• ••• • • • ••• ••• •• ••• • •• • •• ••• • ••• ••• ••• •• •• ••• •••• ••• •••• • •• ••• • •• • •••• •• •• •• •• •• •• •••• • •••• •• •• •••• •• ••• •• •• ••• ••••• • •• •• •• ••• ••• •• • ••••••••• • •• ••••• ••• • •••• ••• •• •• •• ••• •• •• •• •• •• •••• •• • •• ••• •• ••• • •• ••••• ••••• •••• • ••• •••• ••••• ••• • •• •• • • •• • •• • • •• ••• • ••• ••••••• •• ••••• •••• ••• •• • • •• •• • ••• •• •••• •• • •• •• •• •• •• •••• •• • •• •• • ••• •• • ••• •• • ••• • ••• •• • •••• •••••• •• • •• •• • •• ••••• • • •••••••• •• •• •••••• • •••••••••••••••••• •••••••••• ••••••••••••••••••••••••••••••••• ••••••••••••••• • •••••••• •••••••••• •••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• • •• ••• •••• ••• • ••••••••• •••• •••• ••• • •••• • ••• ••• •• ••••• •• ••• ••• •• • ••• •• ••• ••• •• •• • •• • •• •• •• •• ••••• •• ••• • •• •• • •• •• • •• •••• ••• •• ••• ••• • •••• •• •• •••• • ••• •••• ••• •• •••••• • • •• ••• ••• ••• ••• ••• •• • •••• •• •• •••• • ••• •• • ••• •• ••• • •• •••• •• •• • •• •• •••• ••••• •• • •••• •• •• •• •• •• •• •••• • ••••• •••• •• •• •• •••••• • • •• •••• •• •••••••• •• •• •••• • • •• •• ••• •• •• •••••• ••••• ••• • •••• •• ••• •• ••• • ••• •• •• ••• ••••• •• ••• ••• •• •••• •• •• ••••• • ••••• •• •• ••• •••• •••• • •••••• •• • •••• ••• ••• • • ••• •••• ••• ••••• • • •• •• ••• •• •••• •••••• ••• ••• ••••• ••• • •••• • ••• ••• ••• •• • •• •• •••• •• •• ••• ••• •••••• •• • •• • • ••••• •••• ••• ••• •• • ••• ••• ••• •• •• •• • •• • ••• •••• ••• ••• •• •• •• • •• • ••••••• •• • ••• •• ••• ••• •••• • • ••• •

iteration

beta

0

0 100 200 300 400 500

05

10

Trace of beta0(500 values per trace)

beta0

5.8 6.0 6.2 6.4 6.6 6.8

02

Kernel density for beta0(1200 values)

•• •• • •• ••• •• ••• •• ••• ••• •• • ••• ••••• • ••• • •• •• ••• ••• •• • • •••• •• •••• ••• •• ••• • ••••• •• •• • •• •• •• •• •••• ••• • ••• • • ••• ••• •• •••• ••• ••••• ••••• ••• •• •• •• •• • •• •• •• •• ••• ••••• • ••• ••• ••• •• •• • •••••• ••••• •••• • •• ••••• •• ••• •• • •••••• • •••• • •••••••• • • •••••••• •• •• ••• •• •• ••• ••• • •• • ••• • ••••••• •••••• ••••••• •••• •• •• • • ••• •••••• •••• ••• •• •• • •••• • ••• • ••••• •••• ••••• •• ••• ••• •••• ••• ••• • •• • •••• • •••• ••• •• ••••• ••• • ••• •• • •• •• •• •••• • • • ••• • •••••••••••••••• •••• • ••••••••••• • •••••••••••• ••••••••••••••••••• ••••••••••• • • •••• ••• • ••••••• ••••••••••• ••••••••••••••••••••••••••••••••••••••••••••••••••••• •••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• ••••••••••• • ••••• •••• •••••••••• ••• •• ••• •• • ••• •• ••• • •••• ••• • •• • • •••• •• •••• ••• •••• • ••••• • ••• •• •••••• • • • •••• •• • ••• •• • ••••• ••• ••• •• •• •• •••••• • ••• ••• ••••• ••• •• •• •• •• • •••• ••• ••• •• • •••• •• • •••••• ••• • •••• •• ••• ••• •• • • •••• •• •••• •• •••• •••• •• •• •• •• •• •• •• ••• • •••• • •••• •• •• ••• ••••• • ••••• ••• •• •••• ••• ••• •• ••• • •• ••• •• • • ••• •• ••• ••• •• •• •••••• •• •• ••• • ••••• •••• •••• •• • •• •• •• ••• ••• • ••••• •••• •• •••• • ••• ••• •• •• •••••• • ••• •• ••• •• • ••• ••• •• • •• ••• •• • • •••• ••• ••• •• •••• • ••• • • •• •• •• • •• • •• • •• •• •••• •••• • • ••••• •• ••• • • •• •••••• • •• •• •• •• ••••••• •• •• ••• •• • •• •••• •• ••••• •••• ••• ••• ••• • • ••• • •• • • •• •• • ••



iteration

mu

0 100 200 300 400 500

100

200


mu

90 100 110 120

0.0

0.05


•• • ••• ••• •• •••• ••• • • • ••• ••• •• ••• • •• • •• ••• • ••• ••• ••• •• •• ••• •••• ••• •••• • •• ••• • •• • •••• •• •• •• •• •• •• •••• • •••• •• •• •••• •• ••• •• •• ••• ••••• • •• •• •• ••• ••• •• • ••••••••• • •• ••••• ••• • •••• ••• •• •• •• ••• •• •• •• •• •• •••• •• • •• ••• •• ••• • •• ••••• ••••• •••• • ••• •••• ••••• ••• • •• •• • • •• • •• • • •• ••• • ••• ••••••• •• ••••• •••• ••• •• • • •• •• • ••• •• •••• •• • •• •• •• •• •• •••• •• • •• •• • ••• •• • ••• •• • ••• • ••• •• • •••• •••••• •• • •• •• • •• ••••• • • •••••••• •• •• •••••• • •••••••••••••••••• •••••••••• ••••••••••••••••••••••••••••••••• ••••••••••••••• • •••••••• •••••••••• •••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• • •• ••• •••• ••• • ••••••••• •••• •••• ••• • •••• • ••• ••• •• ••••• •• ••• ••• •• • ••• •• ••• ••• •• •• • •• • •• •• •• •• ••••• •• ••• • •• •• • •• •• • •• •••• ••• •• ••• ••• • •••• •• •• •••• • ••• •••• ••• •• •••••• • • •• ••• ••• ••• ••• ••• •• • •••• •• •• •••• • ••• •• • ••• •• ••• • •• •••• •• •• • •• •• •••• ••••• •• • •••• •• •• •• •• •• •• •••• • ••••• •••• •• •• •• •••••• • • •• •••• •• •••••••• •• •• •••• • • •• •• ••• •• •• •••••• ••••• ••• • •••• •• ••• •• ••• • ••• •• •• ••• ••••• •• ••• ••• •• •••• •• •• ••••• • ••••• •• •• ••• •••• •••• • •••••• •• • •••• ••• ••• • • ••• •••• ••• ••••• • • •• •• ••• •• •••• •••••• ••• ••• ••••• ••• • •••• • ••• ••• ••• •• • •• •• •••• •• •• ••• ••• •••••• •• • •• • • ••••• •••• ••• ••• •• • ••• ••• ••• •• •• •• • •• • ••• •••• ••• ••• •• •• •• • •• • ••••••• •• • ••• •• ••• ••• •••• • • ••• •

iteration

beta

0

0 100 200 300 400 500

05

10


beta0

5.8 6.0 6.2 6.4 6.6 6.8

02


•• •• • •• ••• •• ••• •• ••• ••• •• • ••• ••••• • ••• • •• •• ••• ••• •• • • •••• •• •••• ••• •• ••• • ••••• •• •• • •• •• •• •• •••• ••• • ••• • • ••• ••• •• •••• ••• ••••• ••••• ••• •• •• •• •• • •• •• •• •• ••• ••••• • ••• ••• ••• •• •• • •••••• ••••• •••• • •• ••••• •• ••• •• • •••••• • •••• • •••••••• • • •••••••• •• •• ••• •• •• ••• ••• • •• • ••• • ••••••• •••••• ••••••• •••• •• •• • • ••• •••••• •••• ••• •• •• • •••• • ••• • ••••• •••• ••••• •• ••• ••• •••• ••• ••• • •• • •••• • •••• ••• •• ••••• ••• • ••• •• • •• •• •• •••• • • • ••• • •••••••••••••••• •••• • ••••••••••• • •••••••••••• ••••••••••••••••••• ••••••••••• • • •••• ••• • ••••••• ••••••••••• ••••••••••••••••••••••••••••••••••••••••••••••••••••• •••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• ••••••••••• • ••••• •••• •••••••••• ••• •• ••• •• • ••• •• ••• • •••• ••• • •• • • •••• •• •••• ••• •••• • ••••• • ••• •• •••••• • • • •••• •• • ••• •• • ••••• ••• ••• •• •• •• •••••• • ••• ••• ••••• ••• •• •• •• •• • •••• ••• ••• •• • •••• •• • •••••• ••• • •••• •• ••• ••• •• • • •••• •• •••• •• •••• •••• •• •• •• •• •• •• •• ••• • •••• • •••• •• •• ••• ••••• • ••••• ••• •• •••• ••• ••• •• ••• • •• ••• •• • • ••• •• ••• ••• •• •• •••••• •• •• ••• • ••••• •••• •••• •• • •• •• •• ••• ••• • ••••• •••• •• •••• • ••• ••• •• •• •••••• • ••• •• ••• •• • ••• ••• •• • •• ••• •• • • •••• ••• ••• •• •••• • ••• • • •• •• •• • •• • •• • •• •• •••• •••• • • ••••• •• ••• • • •• •••••• • •• •• •• •• ••••••• •• •• ••• •• • •• •••• •• ••••• •••• ••• ••• ••• • • ••• • •• • • •• •• • ••

The output from all three chains over iterations 101–500is used in the posterior kernel density estimates (col 2)



iteration

mu

0 100 200 300 400 500

100

200


mu

90 100 110 120

0.0

0.05


•• • ••• ••• •• •••• ••• • • • ••• ••• •• ••• • •• • •• ••• • ••• ••• ••• •• •• ••• •••• ••• •••• • •• ••• • •• • •••• •• •• •• •• •• •• •••• • •••• •• •• •••• •• ••• •• •• ••• ••••• • •• •• •• ••• ••• •• • ••••••••• • •• ••••• ••• • •••• ••• •• •• •• ••• •• •• •• •• •• •••• •• • •• ••• •• ••• • •• ••••• ••••• •••• • ••• •••• ••••• ••• • •• •• • • •• • •• • • •• ••• • ••• ••••••• •• ••••• •••• ••• •• • • •• •• • ••• •• •••• •• • •• •• •• •• •• •••• •• • •• •• • ••• •• • ••• •• • ••• • ••• •• • •••• •••••• •• • •• •• • •• ••••• • • •••••••• •• •• •••••• • •••••••••••••••••• •••••••••• ••••••••••••••••••••••••••••••••• ••••••••••••••• • •••••••• •••••••••• •••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• • •• ••• •••• ••• • ••••••••• •••• •••• ••• • •••• • ••• ••• •• ••••• •• ••• ••• •• • ••• •• ••• ••• •• •• • •• • •• •• •• •• ••••• •• ••• • •• •• • •• •• • •• •••• ••• •• ••• ••• • •••• •• •• •••• • ••• •••• ••• •• •••••• • • •• ••• ••• ••• ••• ••• •• • •••• •• •• •••• • ••• •• • ••• •• ••• • •• •••• •• •• • •• •• •••• ••••• •• • •••• •• •• •• •• •• •• •••• • ••••• •••• •• •• •• •••••• • • •• •••• •• •••••••• •• •• •••• • • •• •• ••• •• •• •••••• ••••• ••• • •••• •• ••• •• ••• • ••• •• •• ••• ••••• •• ••• ••• •• •••• •• •• ••••• • ••••• •• •• ••• •••• •••• • •••••• •• • •••• ••• ••• • • ••• •••• ••• ••••• • • •• •• ••• •• •••• •••••• ••• ••• ••••• ••• • •••• • ••• ••• ••• •• • •• •• •••• •• •• ••• ••• •••••• •• • •• • • ••••• •••• ••• ••• •• • ••• ••• ••• •• •• •• • •• • ••• •••• ••• ••• •• •• •• • •• • ••••••• •• • ••• •• ••• ••• •••• • • ••• •

iteration

beta

0

0 100 200 300 400 500

05

10


beta0

5.8 6.0 6.2 6.4 6.6 6.8

02


•• •• • •• ••• •• ••• •• ••• ••• •• • ••• ••••• • ••• • •• •• ••• ••• •• • • •••• •• •••• ••• •• ••• • ••••• •• •• • •• •• •• •• •••• ••• • ••• • • ••• ••• •• •••• ••• ••••• ••••• ••• •• •• •• •• • •• •• •• •• ••• ••••• • ••• ••• ••• •• •• • •••••• ••••• •••• • •• ••••• •• ••• •• • •••••• • •••• • •••••••• • • •••••••• •• •• ••• •• •• ••• ••• • •• • ••• • ••••••• •••••• ••••••• •••• •• •• • • ••• •••••• •••• ••• •• •• • •••• • ••• • ••••• •••• ••••• •• ••• ••• •••• ••• ••• • •• • •••• • •••• ••• •• ••••• ••• • ••• •• • •• •• •• •••• • • • ••• • •••••••••••••••• •••• • ••••••••••• • •••••••••••• ••••••••••••••••••• ••••••••••• • • •••• ••• • ••••••• ••••••••••• ••••••••••••••••••••••••••••••••••••••••••••••••••••• •••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• ••••••••••• • ••••• •••• •••••••••• ••• •• ••• •• • ••• •• ••• • •••• ••• • •• • • •••• •• •••• ••• •••• • ••••• • ••• •• •••••• • • • •••• •• • ••• •• • ••••• ••• ••• •• •• •• •••••• • ••• ••• ••••• ••• •• •• •• •• • •••• ••• ••• •• • •••• •• • •••••• ••• • •••• •• ••• ••• •• • • •••• •• •••• •• •••• •••• •• •• •• •• •• •• •• ••• • •••• • •••• •• •• ••• ••••• • ••••• ••• •• •••• ••• ••• •• ••• • •• ••• •• • • ••• •• ••• ••• •• •• •••••• •• •• ••• • ••••• •••• •••• •• • •• •• •• ••• ••• • ••••• •••• •• •••• • ••• ••• •• •• •••••• • ••• •• ••• •• • ••• ••• •• • •• ••• •• • • •••• ••• ••• •• •••• • ••• • • •• •• •• • •• • •• • •• •• •••• •••• • • ••••• •• ••• • • •• •••••• • •• •• •• •• ••••••• •• •• ••• •• • •• •••• •• ••••• •••• ••• ••• ••• • • ••• • •• • • •• •• • ••

The output from all three chains over iterations 101–500is used in the posterior kernel density estimates (col 2)

The average rat weighs about 106 grams at birth, andgains about 6.2 grams per day.


Metropolis algorithmWhat happens if the full conditional p(θi|θj 6=i,y) is notavailable in closed form? Typically, p(θi|θj 6=i,y) will beavailable up to proportionality constant, since it isproportional to the portion of the Bayesian model(likelihood times prior) that involves θi.



Suppose the true joint posterior for θ has unnormalizeddensity p(θ).



Suppose the true joint posterior for θ has unnormalizeddensity p(θ).

Choose a candidate density q(θ∗|θ(t−1)) that is a validdensity function for every possible value of theconditioning variable θ(t−1), and satisfies

q(θ∗|θ(t−1)) = q(θ(t−1)|θ∗) ,

i.e., q is symmetric in its arguments.


Metropolis algorithm (cont’d)Given a starting value θ(0) at iteration t = 0, the algorithmproceeds as follows:

Metropolis Algorithm: For (t ∈ 1 : T ), repeat:




1. Draw θ∗ from q(·|θ(t−1))





2. Compute the ratior = p(θ∗)/p(θ(t−1)) = exp[log p(θ∗) − log p(θ(t−1))]






3. If r ≥ 1, set θ(t) = θ∗;

If r < 1, set θ(t) =

{θ∗ with probability r

θ(t−1) with probability 1 − r.






3. If r ≥ 1, set θ(t) = θ∗;




Then a draw θ(t) converges in distribution to a drawfrom the true posterior density p(θ|y).






3. If r ≥ 1, set θ(t) = θ∗;




Then a draw θ(t) converges in distribution to a drawfrom the true posterior density p(θ|y).

Note: When used as a substep in a larger (e.g., Gibbs)algorithm, we often use T = 1 (convergence still OK).


Metropolis algorithm (cont’d)How to choose the candidate density? The usualapproach (after θ has been transformed to havesupport ℜk, if necessary) is to set

q(θ∗|θ(t−1)) = N(θ∗|θ(t−1), Σ) .

In one dimension, MCMC “folklore” suggests choosingΣ to provide an observed acceptance ratio near 50%.


Metropolis algorithm (cont’d)How to choose the candidate density? The usualapproach (after θ has been transformed to havesupport ℜk, if necessary) is to set

q(θ∗|θ(t−1)) = N(θ∗|θ(t−1), Σ) .

In one dimension, MCMC “folklore” suggests choosingΣ to provide an observed acceptance ratio near 50%.

Hastings (1970) showed we can drop the requirementthat q be symmetric, provided we use

r =p(θ∗)q(θ(t−1) | θ∗)

p(θ(t−1))q(θ∗ | θ(t−1))

– useful for asymmetric target densities!– this form called the Metropolis-Hastings algorithm


Convergence assessmentWhen it is safe to stop and summarize MCMC output?

We would like to ensure that∫|pt(θ) − p(θ)|dθ < ǫ, but

all we can hope to see is∫|pt(θ) − pt+k(θ)|dθ!





Controversy: Does the eventual mixing of “initiallyoverdispersed” parallel sampling chains provideworthwhile information on convergence?






While one can never “prove” convergence of aMCMC algorithm using only a finite realization fromthe chain, poor mixing of parallel chains can helpdiscover extreme forms of nonconvergence






While one can never “prove” convergence of aMCMC algorithm using only a finite realization fromthe chain, poor mixing of parallel chains can helpdiscover extreme forms of nonconvergence

Still, it’s tricky: a slowly converging sampler may beindistinguishable from one that will never converge(e.g., due to nonidentifiability)!


Convergence diagnosticsVarious summaries of MCMC output, such as

sample autocorrelations in one or more chains:close to 0 indicates near-independence, and sochain should more quickly traverse the entireparameter space :)close to 1 indicates the sampler is “stuck” :(


Convergence diagnosticsVarious summaries of MCMC output, such as

sample autocorrelations in one or more chains:close to 0 indicates near-independence, and sochain should more quickly traverse the entireparameter space :)close to 1 indicates the sampler is “stuck” :(

Gelman/Rubin shrink factor,

√R =

√(N − 1

N+

m + 1

mN

B

W

)df

df − 2

N→∞−→ 1 ,

where B/N is the variance between the means from them parallel chains, W is the average of the mwithin-chain variances, and df is the degrees offreedom of an approximating t density to the posterior.


Convergence diagnosis strategyRun a few (3 to 5) parallel chains, with starting pointsbelieved to be overdispersed

say, covering ±3 prior standard deviations from theprior mean




Overlay the resulting sample traces for a representativesubset of the parameters

say, most of the fixed effects, some of the variancecomponents, and a few well-chosen random effects)






Annotate each plot with lag 1 sample autocorrelationsand perhaps Gelman and Rubin diagnostics






Annotate each plot with lag 1 sample autocorrelationsand perhaps Gelman and Rubin diagnostics

Investigate bivariate plots and crosscorrelations amongparameters suspected of being confounded, just as onemight do regarding collinearity in linear regression.


Variance estimationHow good is our MCMC estimate once we get it?

Suppose a single long chain of (post-convergence)MCMC samples {λ(t)}N

t=1. Let

E(λ|y) = λN =1

N

N∑

t=1

λ(t) .


Variance estimationHow good is our MCMC estimate once we get it?

Suppose a single long chain of (post-convergence)MCMC samples {λ(t)}N

t=1. Let

E(λ|y) = λN =1

N

N∑

t=1

λ(t) .

Then by the CLT, under iid sampling we could take

V ariid(λN ) = s2λ/N =

1

N(N − 1)

N∑

t=1

(λ(t) − λN )2 .

But this is likely an underestimate due to positiveautocorrelation in the MCMC samples.


Variance estimation (cont’d)

To avoid wasteful parallel sampling or “thinning,”compute the effective sample size,

ESS = N/κ(λ) ,

where κ(λ) = 1 + 2∑∞

k=1 ρk(λ) is the autocorrelationtime, and we cut off the sum when ρk(λ) < ǫ



To avoid wasteful parallel sampling or “thinning,”compute the effective sample size,

ESS = N/κ(λ) ,

where κ(λ) = 1 + 2∑∞

k=1 ρk(λ) is the autocorrelationtime, and we cut off the sum when ρk(λ) < ǫ

ThenV arESS(λN ) = s2

λ/ESS(λ)

Note: κ(λ) ≥ 1, so ESS(λ) ≤ N , and so we have thatV arESS(λN ) ≥ V ariid(λN ), in concert with intuition.



Another alternative: Batching: Divide the run into msuccessive batches of length k with batch meansb1, . . . , bm. Then λN = b = 1

m

∑mi=1 bi, and

V arbatch(λN ) =1

m(m − 1)

m∑

i=1

(bi − λN )2 ,

provided that k is large enough so that the correlationbetween batches is negligible.



Another alternative: Batching: Divide the run into msuccessive batches of length k with batch meansb1, . . . , bm. Then λN = b = 1

m

∑mi=1 bi, and

V arbatch(λN ) =1

m(m − 1)

m∑

i=1

(bi − λN )2 ,

provided that k is large enough so that the correlationbetween batches is negligible.

For any V used to approximate V ar(λN ), a 95% CI forE(λ|y) is then given by

λN ± z.025

√V .


Recent DevelopmentsOverrelaxation: Try to speed convergence by inducingnegative autocorrelation within the chains



Neal (1998): Generate {θi,k}Kk=1 independently from the

full conditional p(θi|θj 6=i,y). Ordering these along withthe old value, we have

θi,0 ≤ θi,1 ≤ · · · ≤ θi,r ≡ θ(t−1)i ≤ · · · ≤ θi,K ,

so that r is the index of the old value. Then take

θ(t)i = θi,K−r .





θi,0 ≤ θi,1 ≤ · · · ≤ θi,r ≡ θ(t−1)i ≤ · · · ≤ θi,K ,



Note that K = 1 produces Gibbs sampling, while largeK produces progressively more overrelaxation.





θi,0 ≤ θi,1 ≤ · · · ≤ θi,r ≡ θ(t−1)i ≤ · · · ≤ θi,K ,



Note that K = 1 produces Gibbs sampling, while largeK produces progressively more overrelaxation.

Generation of the K random variables can be avoided ifthe full conditional cdf and inverse cdf are available.


Blocking and Structured MCMCA general approach to accelerating MCMC convergence viamultivariate updating of blocks of correlated parameters.

For hierarchical linear models expressible as

y

0

M

=

X1 0

H1 H2

G1 G2

[

θ1

θ2

]+

ǫ

δ

ξ

,


Blocking and Structured MCMCA general approach to accelerating MCMC convergence viamultivariate updating of blocks of correlated parameters.

For hierarchical linear models expressible as

y

0

M

=

X1 0

H1 H2

G1 G2

[

θ1

θ2

]+

ǫ

δ

ξ

,

or more compactly as

Y = Xθ + E,

where X and Y are known, θ is unknown, and E is anerror term having Cov(E) = Γ, where Γ is block diagonal:

Cov(E) = Γ = Diag (Cov(ǫ), Cov(δ), Cov(ξ)) .


Blocking and Structured MCMCUnder Gaussian errors, p(θ | Y,X,Γ) is given by

N((X ′Γ−1X)−1X ′Γ−1Y , (X ′Γ−1X)−1

).



N((X ′Γ−1X)−1X ′Γ−1Y , (X ′Γ−1X)−1

).

With conjugate (inverse gamma or Wishart) priors forthe variance components, p(Γ | Y,X,θ) is also available,producing a giant two-block Gibbs sampler.



N((X ′Γ−1X)−1X ′Γ−1Y , (X ′Γ−1X)−1

).


May want to use the above density only as a candidatedistribution in a Hastings independence chain algorithm(avoid frequent inversion of X ′Γ−1X).



N((X ′Γ−1X)−1X ′Γ−1Y , (X ′Γ−1X)−1

).


May want to use the above density only as a candidatedistribution in a Hastings independence chain algorithm(avoid frequent inversion of X ′Γ−1X).

For hierarchical nonlinear models, replace the data y

with “low shrinkage” estimates y = θ1 (say, MLEs), andproceed with the Hastings implementation!


Auxiliary variables and slice samplingEase and/or accelerate sampling from p(θ) by adding anauxiliary (or latent) variable u ∼ p(u|θ).

Simple slice sampler: Take u|θ ∼ Unif(0, p(θ)), so thatp(θ, u) ∝ I{0≤u≤p(θ)}(θ, u). So Gibbs = 2 uniform updates:




u|θ ∼ Unif(0, p(θ)), and




u|θ ∼ Unif(0, p(θ)), andθ|u ∼ Unif(θ : p(θ) ≥ u)





The 2nd update (over the “slice” defined by u) requiresp(θ) to be invertible, either analytically or numerically.






Product slice sampler: if p(θ) ∝ p(θ)∏n

i=1 Li(θ),Li(θ) > 0, then introduce u = (u1, . . . , un) whereui|θ ∼ Unif(0, Li(θ)) for all i.






Product slice sampler: if p(θ) ∝ p(θ)∏n

i=1 Li(θ),Li(θ) > 0, then introduce u = (u1, . . . , un) whereui|θ ∼ Unif(0, Li(θ)) for all i.

Slice samplers typically have excellent convergenceproperties, and are increasingly used as alternatives toMetropolis steps!


Reversible Jump MCMCMethod for handling models of varying dimension, orBayesian model choice more generally.

Suppose model M = j has parameter vector θj ofdimension nj , j = 1, . . . ,K. The algorithm is:




1. Let the current state of the Markov chain be (j,θj),where θj is of dimension nj.





2. Propose a new model j′ with probability h(j, j′).






3. Generate u from a proposal density q(u|θj , j , j′).






3. Generate u from a proposal density q(u|θj , j , j′).

4. Set (θ′j′ ,u

′) = gj ,j ′(θj ,u), where gj ,j ′ is a deterministicfunction that is 1-1 and onto. This is a “dimensionmatching” function, specified so that

nj + dim(u) = nj′ + dim(u′) .


Reversible Jump MCMC5. Accept the proposed move (from j to j′) with probability

αj→j′, which is the minimum of 1 and

f(y|θ′j ′,M = j ′)p(θ′

j ′|M = j ′)πj ′h(j ′, j )q(u′|θj ′ , j′, j )

f(y|θj ,M = j )p(θj |M = j )πjh(j , j ′)q(u|θj , j , j ′)

∣∣∣∣∂g(θj ,u)

∂(θj ,u)

∣∣∣∣ .


Reversible Jump MCMC5. Accept the proposed move (from j to j′) with probability

αj→j′, which is the minimum of 1 and

f(y|θ′j ′,M = j ′)p(θ′

j ′|M = j ′)πj ′h(j ′, j )q(u′|θj ′ , j′, j )

f(y|θj ,M = j )p(θj |M = j )πjh(j , j ′)q(u|θj , j , j ′)

∣∣∣∣∂g(θj ,u)

∂(θj ,u)

∣∣∣∣ .

At convergence, we obtain {M (g)}Gg=1 from p(M |y), so

p(M = j|y) =number of M (g) = j

total number of M (g), j = 1, . . . ,K ,

as well as the Bayes factor between models j and j′,

BF jj′ =p(M = j|y)/p(M = j′|y)

p(M = j)/p(M = j′).


RJMCMC Example 1

Suppose K = 2 models, for which θ1 ∈ ℜ and θ2 ∈ ℜ2,and θ1 is a subvector of θ2. Then when moving fromj = 1 to j′ = 2, we may draw u ∼ q(u) and set

θ′2 = (θ1, u) .


RJMCMC Example 1

Suppose K = 2 models, for which θ1 ∈ ℜ and θ2 ∈ ℜ2,and θ1 is a subvector of θ2. Then when moving fromj = 1 to j′ = 2, we may draw u ∼ q(u) and set

θ′2 = (θ1, u) .

That is, the dimension matching g is the identityfunction, and so the Jacobian in step 5 is equal to 1.


RJMCMC Example 2Consider choosing between a time series model havinga constant mean level θ1, and a changepoint modelhaving level θ2,1 before time t and θ2,2 afterward.



When moving from Model 2 to 1, set

θ′1 =θ2,1 + θ2,2

2.




θ′1 =θ2,1 + θ2,2

2.

To ensure reversibility of this move, when going fromModel 1 to 2 we might sample u ∼ q(u) and set

θ′2,1 = θ1 − u and θ′2,2 = θ1 + u ,

since this is a 1-1 and onto function corresponding tothe deterministic “down move” we selected above.




θ′1 =θ2,1 + θ2,2

2.

To ensure reversibility of this move, when going fromModel 1 to 2 we might sample u ∼ q(u) and set

θ′2,1 = θ1 − u and θ′2,2 = θ1 + u ,

since this is a 1-1 and onto function corresponding tothe deterministic “down move” we selected above.

RJMCMC tricky (and not in WinBUGS) but extremelygeneral! Chapter 5: Bayesian Computation – p. 42/42

Documents

Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –