127
Bayesian computation prehistory (1763 – 1960): Conjugate priors Chapter 5: Bayesian Computation – p. 1/42

Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Bayesian computationprehistory (1763 – 1960): Conjugate priors

Chapter 5: Bayesian Computation – p. 1/42

Page 2: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Bayesian computationprehistory (1763 – 1960): Conjugate priors

1960’s: Numerical quadrature – Newton-Cotesmethods, Gaussian quadrature, etc.

Chapter 5: Bayesian Computation – p. 1/42

Page 3: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Bayesian computationprehistory (1763 – 1960): Conjugate priors

1960’s: Numerical quadrature – Newton-Cotesmethods, Gaussian quadrature, etc.

1970’s: Expectation-Maximization (“EM”) algorithm –iterative mode-finder

Chapter 5: Bayesian Computation – p. 1/42

Page 4: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Bayesian computationprehistory (1763 – 1960): Conjugate priors

1960’s: Numerical quadrature – Newton-Cotesmethods, Gaussian quadrature, etc.

1970’s: Expectation-Maximization (“EM”) algorithm –iterative mode-finder

1980’s: Asymptotic methods – Laplace’s method,saddlepoint approximations

Chapter 5: Bayesian Computation – p. 1/42

Page 5: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Bayesian computationprehistory (1763 – 1960): Conjugate priors

1960’s: Numerical quadrature – Newton-Cotesmethods, Gaussian quadrature, etc.

1970’s: Expectation-Maximization (“EM”) algorithm –iterative mode-finder

1980’s: Asymptotic methods – Laplace’s method,saddlepoint approximations

1980’s: Noniterative Monte Carlo methods – Directposterior sampling and indirect methods (importancesampling, rejection, etc.)

Chapter 5: Bayesian Computation – p. 1/42

Page 6: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Bayesian computationprehistory (1763 – 1960): Conjugate priors

1960’s: Numerical quadrature – Newton-Cotesmethods, Gaussian quadrature, etc.

1970’s: Expectation-Maximization (“EM”) algorithm –iterative mode-finder

1980’s: Asymptotic methods – Laplace’s method,saddlepoint approximations

1980’s: Noniterative Monte Carlo methods – Directposterior sampling and indirect methods (importancesampling, rejection, etc.)

1990’s: Markov chain Monte Carlo (MCMC) – Gibbssampler, Metropolis-Hastings algorithm

Chapter 5: Bayesian Computation – p. 1/42

Page 7: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Bayesian computationprehistory (1763 – 1960): Conjugate priors

1960’s: Numerical quadrature – Newton-Cotesmethods, Gaussian quadrature, etc.

1970’s: Expectation-Maximization (“EM”) algorithm –iterative mode-finder

1980’s: Asymptotic methods – Laplace’s method,saddlepoint approximations

1980’s: Noniterative Monte Carlo methods – Directposterior sampling and indirect methods (importancesampling, rejection, etc.)

1990’s: Markov chain Monte Carlo (MCMC) – Gibbssampler, Metropolis-Hastings algorithm

⇒ MCMC methods broadly applicable, but require care inparametrization and convergence diagnosis!

Chapter 5: Bayesian Computation – p. 1/42

Page 8: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Asymptotic methodsWhen n is large, f(x|θ) will be quite peaked relative top(θ), and so p(θ|x) will be approximately normal.

Chapter 5: Bayesian Computation – p. 2/42

Page 9: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Asymptotic methodsWhen n is large, f(x|θ) will be quite peaked relative top(θ), and so p(θ|x) will be approximately normal.

“Bayesian Central Limit Theorem”: Suppose

X1, . . . , Xniid∼ fi(xi|θ), and that the prior p(θ) and the

likelihood f(x|θ) are positive and twice differentiable

near θp, the posterior mode of θ. Then for large n

p(θ|x)·∼ N(θ

p, [Ip(x)]−1) ,

where [Ip(x)]−1 is the “generalized” observed Fisherinformation matrix for θ, i.e., minus the inverse Hessianof the log posterior evaluated at the mode,

Ipij(x) = −

[∂2

∂θi∂θjlog (f(x|θ)p(θ))

]

θ=θp

.

Chapter 5: Bayesian Computation – p. 2/42

Page 10: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Example 5.1: Hamburger patties againRecall X|θ ∼ Bin(n, θ) and θ ∼ Beta(α, β).

Under a flat prior on θ, we have

l(θ) = log f(x|θ)p(θ) ∝ x log θ + (n − x) log(1 − θ) .

Taking the derivative of l(θ) and equating to zero, weobtain θp = θ = x/n, the familiar binomial proportion.

Chapter 5: Bayesian Computation – p. 3/42

Page 11: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Example 5.1: Hamburger patties againRecall X|θ ∼ Bin(n, θ) and θ ∼ Beta(α, β).

Under a flat prior on θ, we have

l(θ) = log f(x|θ)p(θ) ∝ x log θ + (n − x) log(1 − θ) .

Taking the derivative of l(θ) and equating to zero, weobtain θp = θ = x/n, the familiar binomial proportion.

The second derivative is

∂2l(θ)

∂θ2=

−x

θ2−

n − x

(1 − θ)2,

so that

∂2l(θ)

∂θ2

∣∣∣∣θ=θ

= −x

θ2−

n − x

(1 − θ)2= −

n

θ−

n

1 − θ.

Chapter 5: Bayesian Computation – p. 3/42

Page 12: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Example 5.1: Hamburger patties again

Thus [Ip(x)]−1 =(

+ n1−θ

)−1=(

nθ(1−θ)

)−1= θ(1−θ)

n ,

which is the usual frequentist expression for V ar(θ).Thus the Bayesian CLT gives

p(θ|x)·∼ N

(θ ,

θ(1 − θ)

n

)

Chapter 5: Bayesian Computation – p. 4/42

Page 13: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Example 5.1: Hamburger patties again

Thus [Ip(x)]−1 =(

+ n1−θ

)−1=(

nθ(1−θ)

)−1= θ(1−θ)

n ,

which is the usual frequentist expression for V ar(θ).Thus the Bayesian CLT gives

p(θ|x)·∼ N

(θ ,

θ(1 − θ)

n

)

Notice that a frequentist might instead use MLEasymptotics to write

θ | θ·∼ N

(θ ,

θ(1 − θ)

n

),

leading to identical inferences for θ, but for differentreasons and with different interpretations!

Chapter 5: Bayesian Computation – p. 4/42

Page 14: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Example 5.1: Hamburger patties againComparison of this normal approximation to the exactposterior, a Beta(14, 4) distribution (recall n = 16):

post

erio

r de

nsity

0.0 0.2 0.4 0.6 0.8 1.0

01

23

4

θ

exact (beta)approximate (normal)

Similar modes, but very different tail behavior: 95% crediblesets are (.57, .93) for exact, but (.62, 1.0) for normalapproximation.

Chapter 5: Bayesian Computation – p. 5/42

Page 15: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Higher order approximationsThe Bayesian CLT is a first order approximation, since

E(g(θ)) = g(θ) [1 + O (1/n)] .

Chapter 5: Bayesian Computation – p. 6/42

Page 16: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Higher order approximationsThe Bayesian CLT is a first order approximation, since

E(g(θ)) = g(θ) [1 + O (1/n)] .

Second order approximations (i.e., to order O(1/n2))again requiring only mode and Hessian calculations areavailable via Laplace’s Method (C&L, Sec. 5.2.2).

Chapter 5: Bayesian Computation – p. 6/42

Page 17: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Higher order approximationsThe Bayesian CLT is a first order approximation, since

E(g(θ)) = g(θ) [1 + O (1/n)] .

Second order approximations (i.e., to order O(1/n2))again requiring only mode and Hessian calculations areavailable via Laplace’s Method (C&L, Sec. 5.2.2).

Advantages of Asymptotic Methods:deterministic, noniterative algorithmsubstitutes differentiation for integrationfacilitates studies of Bayesian robustness

Chapter 5: Bayesian Computation – p. 6/42

Page 18: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Higher order approximationsThe Bayesian CLT is a first order approximation, since

E(g(θ)) = g(θ) [1 + O (1/n)] .

Second order approximations (i.e., to order O(1/n2))again requiring only mode and Hessian calculations areavailable via Laplace’s Method (C&L, Sec. 5.2.2).

Advantages of Asymptotic Methods:deterministic, noniterative algorithmsubstitutes differentiation for integrationfacilitates studies of Bayesian robustness

Disadvantages of Asymptotic Methods:requires well-parametrized, unimodal posteriorθ must be of at most moderate dimensionn must be large, but is beyond our control

Chapter 5: Bayesian Computation – p. 6/42

Page 19: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Noninterative Monte Carlo MethodsRecall Monte Carlo integration: Suppose θ ∼ h(θ), and

we seek γ ≡ E[f(θ)] =∫

f(θ)h(θ)dθ. Then if θiiid∼ h(θ),

γ =1

N

N∑

j=1

f(θj)w.p. 1→ E[f(θ)] as N → ∞ ( SLLN )

Chapter 5: Bayesian Computation – p. 7/42

Page 20: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Noninterative Monte Carlo MethodsRecall Monte Carlo integration: Suppose θ ∼ h(θ), and

we seek γ ≡ E[f(θ)] =∫

f(θ)h(θ)dθ. Then if θiiid∼ h(θ),

γ =1

N

N∑

j=1

f(θj)w.p. 1→ E[f(θ)] as N → ∞ ( SLLN )

Since γ is itself a sample mean of independentobservations, V ar(γ) = V ar[f(θ)]/N . This means

se(γ) =

√√√√ 1

N(N − 1)

N∑

j=1

[f(θj) − γ]2 .

=⇒ γ ± 2 se(γ) provides a 95% (frequentist!) CI for γ

Chapter 5: Bayesian Computation – p. 7/42

Page 21: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Noninterative Monte Carlo MethodsRecall Monte Carlo integration: Suppose θ ∼ h(θ), and

we seek γ ≡ E[f(θ)] =∫

f(θ)h(θ)dθ. Then if θiiid∼ h(θ),

γ =1

N

N∑

j=1

f(θj)w.p. 1→ E[f(θ)] as N → ∞ ( SLLN )

Since γ is itself a sample mean of independentobservations, V ar(γ) = V ar[f(θ)]/N . This means

se(γ) =

√√√√ 1

N(N − 1)

N∑

j=1

[f(θj) − γ]2 .

=⇒ γ ± 2 se(γ) provides a 95% (frequentist!) CI for γ

Note dependence on N , rather than n!Chapter 5: Bayesian Computation – p. 7/42

Page 22: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Example 5.2: Direct bivariate samplingIf Yi

iid∼ N(µ, σ2), i = 1, . . . , n, and p(µ, σ) = 1

σ , then

µ|σ2, y ∼ N(y, σ2/n) , and

σ2|y ∼ Kχ−2n−1 = K · IG

(n − 1

2, 2

),

where K =∑n

i=1(yi − y)2.

Chapter 5: Bayesian Computation – p. 8/42

Page 23: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Example 5.2: Direct bivariate samplingIf Yi

iid∼ N(µ, σ2), i = 1, . . . , n, and p(µ, σ) = 1

σ , then

µ|σ2, y ∼ N(y, σ2/n) , and

σ2|y ∼ Kχ−2n−1 = K · IG

(n − 1

2, 2

),

where K =∑n

i=1(yi − y)2.

So draw {(µj , σ2j ), j = 1, . . . , N} from p(µ, σ2|y) as:

σ2j ∼ Kχ−2

n−1 ; then µj ∼ N(y, σ2j /n), j = 1, . . . , N .

Chapter 5: Bayesian Computation – p. 8/42

Page 24: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Example 5.2: Direct bivariate samplingIf Yi

iid∼ N(µ, σ2), i = 1, . . . , n, and p(µ, σ) = 1

σ , then

µ|σ2, y ∼ N(y, σ2/n) , and

σ2|y ∼ Kχ−2n−1 = K · IG

(n − 1

2, 2

),

where K =∑n

i=1(yi − y)2.

So draw {(µj , σ2j ), j = 1, . . . , N} from p(µ, σ2|y) as:

σ2j ∼ Kχ−2

n−1 ; then µj ∼ N(y, σ2j /n), j = 1, . . . , N .

To estimate the posterior mean: E(µ|y) = 1N

∑Nj=1 µj .

Chapter 5: Bayesian Computation – p. 8/42

Page 25: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Example 5.2: Direct bivariate samplingIf Yi

iid∼ N(µ, σ2), i = 1, . . . , n, and p(µ, σ) = 1

σ , then

µ|σ2, y ∼ N(y, σ2/n) , and

σ2|y ∼ Kχ−2n−1 = K · IG

(n − 1

2, 2

),

where K =∑n

i=1(yi − y)2.

So draw {(µj , σ2j ), j = 1, . . . , N} from p(µ, σ2|y) as:

σ2j ∼ Kχ−2

n−1 ; then µj ∼ N(y, σ2j /n), j = 1, . . . , N .

To estimate the posterior mean: E(µ|y) = 1N

∑Nj=1 µj .

Transformations also easy: To estimate the coefficientof variation, γ = σ/µ, define γj = σj/µj , j = 1, . . . , N ;summarize with moments or histograms!

Chapter 5: Bayesian Computation – p. 8/42

Page 26: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Indirect MethodsImportance sampling: Suppose we wish to approximate

E(f(θ)|y) =

∫f(θ)L(θ)p(θ)dθ∫

L(θ)p(θ)dθ,

and we can roughly approximate the normalizedlikelihood times prior, cL(θ)p(θ), by some density g(θ)from which we can easily sample – say, a multivariate t.

Chapter 5: Bayesian Computation – p. 9/42

Page 27: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Indirect MethodsImportance sampling: Suppose we wish to approximate

E(f(θ)|y) =

∫f(θ)L(θ)p(θ)dθ∫

L(θ)p(θ)dθ,

and we can roughly approximate the normalizedlikelihood times prior, cL(θ)p(θ), by some density g(θ)from which we can easily sample – say, a multivariate t.

Then defining the weight function w(θ) = L(θ)p(θ)/g(θ),

E(f(θ)|y) =

∫f(θ)w(θ)g(θ)dθ∫

w(θ)g(θ)dθ≈

1N

∑Nj=1 f(θj)w(θj)

1N

∑Nj=1 w(θj)

,

where θjiid∼ g(θ). Here, g(θ) is called the importance

function; a good match to cL(θ)p(θ) will produce roughlyequal weights, hence a good approximation. Chapter 5: Bayesian Computation – p. 9/42

Page 28: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Rejection samplingHere, instead of trying to approximate the posterior

h(θ) =L(θ)p(θ)∫L(θ)p(θ)dθ

,

we try to “blanket” it: suppose there exists a constantM > 0 and a smooth density g(θ), called the envelopefunction, such that L(θ)p(θ) < Mg(θ) for all θ.The algorithm proceeds as follows:

Chapter 5: Bayesian Computation – p. 10/42

Page 29: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Rejection samplingHere, instead of trying to approximate the posterior

h(θ) =L(θ)p(θ)∫L(θ)p(θ)dθ

,

we try to “blanket” it: suppose there exists a constantM > 0 and a smooth density g(θ), called the envelopefunction, such that L(θ)p(θ) < Mg(θ) for all θ.The algorithm proceeds as follows:(i) Generate θj ∼ g(θ).

Chapter 5: Bayesian Computation – p. 10/42

Page 30: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Rejection samplingHere, instead of trying to approximate the posterior

h(θ) =L(θ)p(θ)∫L(θ)p(θ)dθ

,

we try to “blanket” it: suppose there exists a constantM > 0 and a smooth density g(θ), called the envelopefunction, such that L(θ)p(θ) < Mg(θ) for all θ.The algorithm proceeds as follows:(i) Generate θj ∼ g(θ).(ii) Generate U ∼ Uniform(0, 1).

Chapter 5: Bayesian Computation – p. 10/42

Page 31: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Rejection samplingHere, instead of trying to approximate the posterior

h(θ) =L(θ)p(θ)∫L(θ)p(θ)dθ

,

we try to “blanket” it: suppose there exists a constantM > 0 and a smooth density g(θ), called the envelopefunction, such that L(θ)p(θ) < Mg(θ) for all θ.The algorithm proceeds as follows:(i) Generate θj ∼ g(θ).(ii) Generate U ∼ Uniform(0, 1).(iii) If MUg(θj) < L(θj)p(θj), accept θj ; else reject θj .

Chapter 5: Bayesian Computation – p. 10/42

Page 32: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Rejection samplingHere, instead of trying to approximate the posterior

h(θ) =L(θ)p(θ)∫L(θ)p(θ)dθ

,

we try to “blanket” it: suppose there exists a constantM > 0 and a smooth density g(θ), called the envelopefunction, such that L(θ)p(θ) < Mg(θ) for all θ.The algorithm proceeds as follows:(i) Generate θj ∼ g(θ).(ii) Generate U ∼ Uniform(0, 1).(iii) If MUg(θj) < L(θj)p(θj), accept θj ; else reject θj .(iv) Return to step (i) and repeat, until the desired

sample {θj , j = 1, . . . , N} is obtained. Its memberswill be random variates from h(θ).

Chapter 5: Bayesian Computation – p. 10/42

Page 33: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Rejection Sampling: informal “proof”

-3 -2 -1 0 1 2 3

0.0

0.2

0.4

0.6

0.8

Mg

a

Consider the θj samples in the histogram bar centeredat a: the rejection step “slices off” the top portion of thebar. Repeat for all a: accepted θj mimic the lower curve!

Chapter 5: Bayesian Computation – p. 11/42

Page 34: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Rejection Sampling: informal “proof”

-3 -2 -1 0 1 2 3

0.0

0.2

0.4

0.6

0.8

Mg

a

Consider the θj samples in the histogram bar centeredat a: the rejection step “slices off” the top portion of thebar. Repeat for all a: accepted θj mimic the lower curve!

Need to choose M as small as possible (efficiency),and watch for “envelope violations”!

Chapter 5: Bayesian Computation – p. 11/42

Page 35: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Markov chain Monte Carlo methods– iterative MC methods, useful when difficult or impossibleto find a feasible importance or envelope density.

Given two unknowns x (think parameters) and y (thinkmissing data), we can often write

p(x) =

∫p(x|y)p(y)dy and p(y) =

∫p(y|x)p(x)dx ,

where p(x|y) and p(y|x) are known; we seek p(x).

Chapter 5: Bayesian Computation – p. 12/42

Page 36: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Markov chain Monte Carlo methods– iterative MC methods, useful when difficult or impossibleto find a feasible importance or envelope density.

Given two unknowns x (think parameters) and y (thinkmissing data), we can often write

p(x) =

∫p(x|y)p(y)dy and p(y) =

∫p(y|x)p(x)dx ,

where p(x|y) and p(y|x) are known; we seek p(x).

Analytical solution via substitution:

p(x) =

∫p(x|y)

∫p(y|x′)p(x′)dx′dy =

∫h(x, x′)p(x′)dx′ ,

where h(x, x′) =∫

p(x|y)p(y|x′)dy.

Chapter 5: Bayesian Computation – p. 12/42

Page 37: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Substitution samplingThis determines a fixed point system

pi+1(x) =

∫h(x, x′)pi(x

′)dx′ ,

which converges under mild conditions.

Chapter 5: Bayesian Computation – p. 13/42

Page 38: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Substitution samplingThis determines a fixed point system

pi+1(x) =

∫h(x, x′)pi(x

′)dx′ ,

which converges under mild conditions.

Tanner and Wong (1987): Can also use a sampling-based approach: Draw X(0) ∼ p0(x), thenY (1) ∼ p(y|x(0)), and finally X(1) ∼ p(x|y(1)).Then X(1) has marginal distribution

p1(x) =

∫p(x|y)p1(y)dy =

∫h(x, x′)p0(x

′)dx′ .

Chapter 5: Bayesian Computation – p. 13/42

Page 39: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Substitution samplingThis determines a fixed point system

pi+1(x) =

∫h(x, x′)pi(x

′)dx′ ,

which converges under mild conditions.

Tanner and Wong (1987): Can also use a sampling-based approach: Draw X(0) ∼ p0(x), thenY (1) ∼ p(y|x(0)), and finally X(1) ∼ p(x|y(1)).Then X(1) has marginal distribution

p1(x) =

∫p(x|y)p1(y)dy =

∫h(x, x′)p0(x

′)dx′ .

Repeating this process produces pairs (X(i), Y (i)) such

that X(i) d→ X ∼ p(x) and Y (i) d

→ Y ∼ p(y).Chapter 5: Bayesian Computation – p. 13/42

Page 40: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Substitution sampling

That is, for i sufficiently large, X(i) may be thought of asa sample from the true marginal density p(x).Tanner and Wong refer to this (bivariate) algorithm asdata augmentation.

Chapter 5: Bayesian Computation – p. 14/42

Page 41: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Substitution sampling

That is, for i sufficiently large, X(i) may be thought of asa sample from the true marginal density p(x).Tanner and Wong refer to this (bivariate) algorithm asdata augmentation.

The luxury of avoiding the integration above has comeat the price of obtaining not the marginal density p(x)itself, but only a sample from this density.

Chapter 5: Bayesian Computation – p. 14/42

Page 42: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Substitution sampling

That is, for i sufficiently large, X(i) may be thought of asa sample from the true marginal density p(x).Tanner and Wong refer to this (bivariate) algorithm asdata augmentation.

The luxury of avoiding the integration above has comeat the price of obtaining not the marginal density p(x)itself, but only a sample from this density.

Can we extend this idea to more than 2 dimensions?...

Chapter 5: Bayesian Computation – p. 14/42

Page 43: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Gibbs samplingSuppose the joint distribution of θ = (θ1, . . . , θK) isuniquely determined by the full conditional distributions,{pi(θi|θj 6=i), i = 1, . . . ,K}.

Chapter 5: Bayesian Computation – p. 15/42

Page 44: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Gibbs samplingSuppose the joint distribution of θ = (θ1, . . . , θK) isuniquely determined by the full conditional distributions,{pi(θi|θj 6=i), i = 1, . . . ,K}.

Given an arbitrary set of starting values {θ(0)1 , . . . , θ

(0)K },

Draw θ(1)1 ∼ p1(θ1|θ

(0)2 , . . . , θ

(0)K ),

Draw θ(1)2 ∼ p2(θ2|θ

(1)1 , θ

(0)3 , . . . , θ

(0)K ),

...

Draw θ(1)K ∼ pK(θK |θ

(1)1 , . . . , θ

(1)K−1),

Chapter 5: Bayesian Computation – p. 15/42

Page 45: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Gibbs samplingSuppose the joint distribution of θ = (θ1, . . . , θK) isuniquely determined by the full conditional distributions,{pi(θi|θj 6=i), i = 1, . . . ,K}.

Given an arbitrary set of starting values {θ(0)1 , . . . , θ

(0)K },

Draw θ(1)1 ∼ p1(θ1|θ

(0)2 , . . . , θ

(0)K ),

Draw θ(1)2 ∼ p2(θ2|θ

(1)1 , θ

(0)3 , . . . , θ

(0)K ),

...

Draw θ(1)K ∼ pK(θK |θ

(1)1 , . . . , θ

(1)K−1),

Under mild conditions,

(θ(t)1 , . . . , θ

(t)K )

d→ (θ1, · · · , θK) ∼ p as t → ∞ .

Chapter 5: Bayesian Computation – p. 15/42

Page 46: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Gibbs sampling (cont’d)

For t sufficiently large (say, bigger than t0), {θ(t)}Tt=t0+1

is a (correlated) sample from the true posterior.

Chapter 5: Bayesian Computation – p. 16/42

Page 47: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Gibbs sampling (cont’d)

For t sufficiently large (say, bigger than t0), {θ(t)}Tt=t0+1

is a (correlated) sample from the true posterior.

We might therefore use a sample mean to estimate theposterior mean, i.e.,

E(θi|y) =1

T − t0

T∑

t=t0+1

θ(t)i .

Chapter 5: Bayesian Computation – p. 16/42

Page 48: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Gibbs sampling (cont’d)

For t sufficiently large (say, bigger than t0), {θ(t)}Tt=t0+1

is a (correlated) sample from the true posterior.

We might therefore use a sample mean to estimate theposterior mean, i.e.,

E(θi|y) =1

T − t0

T∑

t=t0+1

θ(t)i .

The time from t = 0 to t = t0 is commonly known as theburn-in period; one can safely adapt (change) anMCMC algorithm during this preconvergence period,since these samples will be discarded anyway

Chapter 5: Bayesian Computation – p. 16/42

Page 49: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Gibbs sampling (cont’d)In practice, we may actually run m parallel Gibbssampling chains, instead of only 1, for some modest m(say, m = 5). Discarding the burn-in period, we obtain

E(θi|y) =1

m(T − t0)

m∑

j=1

T∑

t=t0+1

θ(t)i,j ,

where now the j subscript indicates chain number.

Chapter 5: Bayesian Computation – p. 17/42

Page 50: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Gibbs sampling (cont’d)In practice, we may actually run m parallel Gibbssampling chains, instead of only 1, for some modest m(say, m = 5). Discarding the burn-in period, we obtain

E(θi|y) =1

m(T − t0)

m∑

j=1

T∑

t=t0+1

θ(t)i,j ,

where now the j subscript indicates chain number.

A density estimate p(θi|y) may be obtained by

smoothing the histogram of the {θ(t)i,j }, or as

p(θi|y) =1

m(T − t0)

m∑

j=1

T∑

t=t0+1

p(θi|θ(t)k 6=i,j , y)

∫p(θi|θk 6=i,y)p(θk 6=i|y)dθk 6=i

Chapter 5: Bayesian Computation – p. 17/42

Page 51: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Example 2.5 (revisited)Consider the model

Yi|θiind∼ Poisson(θisi), θi

ind∼ G(α, β),

β ∼ IG(c, d), i = 1, . . . , k,

where α, c, d, and the si are known. Thus

Chapter 5: Bayesian Computation – p. 18/42

Page 52: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Example 2.5 (revisited)Consider the model

Yi|θiind∼ Poisson(θisi), θi

ind∼ G(α, β),

β ∼ IG(c, d), i = 1, . . . , k,

where α, c, d, and the si are known. Thus

f(yi|θi) =e−(θisi)(θisi)

yi

yi!, yi ≥ 0, θi > 0,

g(θi|β) =θα−1i e−θi/β

Γ(α)βα, α > 0, β > 0,

h(β) =e−1/(βd)

Γ(c)dcβc+1, c > 0, d > 0.

Note g is conjugate for f , and h is conjugate for g!Chapter 5: Bayesian Computation – p. 18/42

Page 53: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Example 2.5 (revisited)

To implement the Gibbs sampler, we require the fullconditional distributions of β and the θi.

Chapter 5: Bayesian Computation – p. 19/42

Page 54: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Example 2.5 (revisited)

To implement the Gibbs sampler, we require the fullconditional distributions of β and the θi.

By Bayes’ Rule, each of these is proportional to thecomplete Bayesian model specification,

[k∏

i=1

f(yi|θi)g(θi|β)

]h(β)

Chapter 5: Bayesian Computation – p. 19/42

Page 55: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Example 2.5 (revisited)

To implement the Gibbs sampler, we require the fullconditional distributions of β and the θi.

By Bayes’ Rule, each of these is proportional to thecomplete Bayesian model specification,

[k∏

i=1

f(yi|θi)g(θi|β)

]h(β)

Thus we can find full conditional distributions bydropping irrelevant terms from this expression, andnormalizing!

Expressions given on the next slide...

Chapter 5: Bayesian Computation – p. 19/42

Page 56: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Example 2.5 (revisited)

p(θi|θj 6=i, β,y) ∝ f(yi|θi)g(θi|β)

∝ θyi+α−1i e−θi(si+1/β)

∝ G(θi | yi + α, (si + 1/β)−1

), and

p(β|{θi},y) ∝

[k∏

i=1

g(θi|β)

]h(β) ∝

[k∏

i=1

e−θi/β

βα

]e−1/(βd)

βc+1

∝e−

1

β (∑k

i=1θi+

1

d)

βkα+c+1

∝ IG

β kα + c,

(k∑

i=1

θi + 1/d

)−1 .

Chapter 5: Bayesian Computation – p. 20/42

Page 57: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Example 2.5 (revisited)Thus the {θ

(t)i } and β(t) may be sampled directly!

Chapter 5: Bayesian Computation – p. 21/42

Page 58: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Example 2.5 (revisited)Thus the {θ

(t)i } and β(t) may be sampled directly!

If α were also unknown, harder since

p(α|{θi}, β,y) ∝

[k∏

i=1

g(θi|α, β)

]h(α)

is not proportional to any standard family. So resort to:

Chapter 5: Bayesian Computation – p. 21/42

Page 59: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Example 2.5 (revisited)Thus the {θ

(t)i } and β(t) may be sampled directly!

If α were also unknown, harder since

p(α|{θi}, β,y) ∝

[k∏

i=1

g(θi|α, β)

]h(α)

is not proportional to any standard family. So resort to:adaptive rejection sampling (ARS): providedp(α|{θi}, β,y) is log-concave, or

Chapter 5: Bayesian Computation – p. 21/42

Page 60: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Example 2.5 (revisited)Thus the {θ

(t)i } and β(t) may be sampled directly!

If α were also unknown, harder since

p(α|{θi}, β,y) ∝

[k∏

i=1

g(θi|α, β)

]h(α)

is not proportional to any standard family. So resort to:adaptive rejection sampling (ARS): providedp(α|{θi}, β,y) is log-concave, orMetropolis-Hastings sampling – IOU for now!

Chapter 5: Bayesian Computation – p. 21/42

Page 61: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Example 2.5 (revisited)Thus the {θ

(t)i } and β(t) may be sampled directly!

If α were also unknown, harder since

p(α|{θi}, β,y) ∝

[k∏

i=1

g(θi|α, β)

]h(α)

is not proportional to any standard family. So resort to:adaptive rejection sampling (ARS): providedp(α|{θi}, β,y) is log-concave, orMetropolis-Hastings sampling – IOU for now!

Note: This is the order the WinBUGS software uses whenderiving full conditionals!

Chapter 5: Bayesian Computation – p. 21/42

Page 62: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Example 2.5 (revisited)Thus the {θ

(t)i } and β(t) may be sampled directly!

If α were also unknown, harder since

p(α|{θi}, β,y) ∝

[k∏

i=1

g(θi|α, β)

]h(α)

is not proportional to any standard family. So resort to:adaptive rejection sampling (ARS): providedp(α|{θi}, β,y) is log-concave, orMetropolis-Hastings sampling – IOU for now!

Note: This is the order the WinBUGS software uses whenderiving full conditionals!

This is the standard “hybrid approach": Use Gibbsoverall, with “substeps" for awkward full conditionals

Chapter 5: Bayesian Computation – p. 21/42

Page 63: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Example 5.6: Rat dataConsider the longitudinal data model

Yijind∼ N

(αi + βixij , σ2

),

where Yij is the weight of the ith rat at measurementpoint j, while xij denotes its age in days, fori = 1, . . . , k = 30, and j = 1, . . . , ni = 5 for all i(see text p.147 for actual data).

Chapter 5: Bayesian Computation – p. 22/42

Page 64: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Example 5.6: Rat dataConsider the longitudinal data model

Yijind∼ N

(αi + βixij , σ2

),

where Yij is the weight of the ith rat at measurementpoint j, while xij denotes its age in days, fori = 1, . . . , k = 30, and j = 1, . . . , ni = 5 for all i(see text p.147 for actual data).

Adopt the random effects model

θi ≡

(αi

βi

)iid∼ N

(θ0 ≡

(α0

β0

), Σ

), i = 1, . . . , k ,

which is conjugate with the likelihood (see generalnormal linear model in Example 2.7, text p.34).

Chapter 5: Bayesian Computation – p. 22/42

Page 65: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Example 5.6: Rat dataPriors: Conjugate forms are again available, namely

σ2 ∼ IG(a, b) ,

θ0 ∼ N(η, C) , and

Σ−1 ∼ W((ρR)−1, ρ

), (1)

where W denotes the Wishart (multivariate gamma)distribution; see Appendix A.2.2, text p.328.

Chapter 5: Bayesian Computation – p. 23/42

Page 66: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Example 5.6: Rat dataPriors: Conjugate forms are again available, namely

σ2 ∼ IG(a, b) ,

θ0 ∼ N(η, C) , and

Σ−1 ∼ W((ρR)−1, ρ

), (2)

where W denotes the Wishart (multivariate gamma)distribution; see Appendix A.2.2, text p.328.

We assume the hyperparameters (a, b,η, C, ρ, and R)are all known, so there are 30(2) + 3 + 3 = 66 unknownparameters in the model.

Yet the Gibbs sampler is relatively straightforward toimplement here, thanks to the conjugacy at each stagein the hierarchy.

Chapter 5: Bayesian Computation – p. 23/42

Page 67: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Example 5.6: Rat dataFull conditional derivation: Consider for example θi.Since we are allowed to think of the remainingparameters as fixed, by Example 2.7 (Lindley andSmith, 1972) we have

θi|y, θ0, Σ−1, σ2 ∼ N (Didi , Di) where

D−1i = σ−2XT

i Xi + Σ−1 and di = σ−2XTi yi + Σ−1θ0,

for yi =

yi1...

yini

, and Xi =

1 xi1...

...1 xini

.

Chapter 5: Bayesian Computation – p. 24/42

Page 68: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Example 5.6: Rat dataFull conditional derivation: Consider for example θi.Since we are allowed to think of the remainingparameters as fixed, by Example 2.7 (Lindley andSmith, 1972) we have

θi|y, θ0, Σ−1, σ2 ∼ N (Didi , Di) where

D−1i = σ−2XT

i Xi + Σ−1 and di = σ−2XTi yi + Σ−1θ0,

for yi =

yi1...

yini

, and Xi =

1 xi1...

...1 xini

.

Similarly, the full conditionals for θ0, Σ−1, and σ2

emerge in closed form as normal, Wishart, and inversegamma, respectively!

Chapter 5: Bayesian Computation – p. 24/42

Page 69: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Example 5.6: Rat dataUsing vague hyperpriors, run 3 initially overdispersedparallel sampling chains for 500 iterations each:

iteration

mu

0 100 200 300 400 500

100

200

Trace of mu(500 values per trace)

mu

90 100 110 120

0.0

0.05

Kernel density for mu(1200 values)

•• • ••• ••• •• •••• ••• • • • ••• ••• •• ••• • •• • •• ••• • ••• ••• ••• •• •• ••• •••• ••• •••• • •• ••• • •• • •••• •• •• •• •• •• •• •••• • •••• •• •• •••• •• ••• •• •• ••• ••••• • •• •• •• ••• ••• •• • ••••••••• • •• ••••• ••• • •••• ••• •• •• •• ••• •• •• •• •• •• •••• •• • •• ••• •• ••• • •• ••••• ••••• •••• • ••• •••• ••••• ••• • •• •• • • •• • •• • • •• ••• • ••• ••••••• •• ••••• •••• ••• •• • • •• •• • ••• •• •••• •• • •• •• •• •• •• •••• •• • •• •• • ••• •• • ••• •• • ••• • ••• •• • •••• •••••• •• • •• •• • •• ••••• • • •••••••• •• •• •••••• • •••••••••••••••••• •••••••••• ••••••••••••••••••••••••••••••••• ••••••••••••••• • •••••••• •••••••••• •••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• • •• ••• •••• ••• • ••••••••• •••• •••• ••• • •••• • ••• ••• •• ••••• •• ••• ••• •• • ••• •• ••• ••• •• •• • •• • •• •• •• •• ••••• •• ••• • •• •• • •• •• • •• •••• ••• •• ••• ••• • •••• •• •• •••• • ••• •••• ••• •• •••••• • • •• ••• ••• ••• ••• ••• •• • •••• •• •• •••• • ••• •• • ••• •• ••• • •• •••• •• •• • •• •• •••• ••••• •• • •••• •• •• •• •• •• •• •••• • ••••• •••• •• •• •• •••••• • • •• •••• •• •••••••• •• •• •••• • • •• •• ••• •• •• •••••• ••••• ••• • •••• •• ••• •• ••• • ••• •• •• ••• ••••• •• ••• ••• •• •••• •• •• ••••• • ••••• •• •• ••• •••• •••• • •••••• •• • •••• ••• ••• • • ••• •••• ••• ••••• • • •• •• ••• •• •••• •••••• ••• ••• ••••• ••• • •••• • ••• ••• ••• •• • •• •• •••• •• •• ••• ••• •••••• •• • •• • • ••••• •••• ••• ••• •• • ••• ••• ••• •• •• •• • •• • ••• •••• ••• ••• •• •• •• • •• • ••••••• •• • ••• •• ••• ••• •••• • • ••• •

iteration

beta

0

0 100 200 300 400 500

05

10

Trace of beta0(500 values per trace)

beta0

5.8 6.0 6.2 6.4 6.6 6.8

02

Kernel density for beta0(1200 values)

•• •• • •• ••• •• ••• •• ••• ••• •• • ••• ••••• • ••• • •• •• ••• ••• •• • • •••• •• •••• ••• •• ••• • ••••• •• •• • •• •• •• •• •••• ••• • ••• • • ••• ••• •• •••• ••• ••••• ••••• ••• •• •• •• •• • •• •• •• •• ••• ••••• • ••• ••• ••• •• •• • •••••• ••••• •••• • •• ••••• •• ••• •• • •••••• • •••• • •••••••• • • •••••••• •• •• ••• •• •• ••• ••• • •• • ••• • ••••••• •••••• ••••••• •••• •• •• • • ••• •••••• •••• ••• •• •• • •••• • ••• • ••••• •••• ••••• •• ••• ••• •••• ••• ••• • •• • •••• • •••• ••• •• ••••• ••• • ••• •• • •• •• •• •••• • • • ••• • •••••••••••••••• •••• • ••••••••••• • •••••••••••• ••••••••••••••••••• ••••••••••• • • •••• ••• • ••••••• ••••••••••• ••••••••••••••••••••••••••••••••••••••••••••••••••••• •••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• ••••••••••• • ••••• •••• •••••••••• ••• •• ••• •• • ••• •• ••• • •••• ••• • •• • • •••• •• •••• ••• •••• • ••••• • ••• •• •••••• • • • •••• •• • ••• •• • ••••• ••• ••• •• •• •• •••••• • ••• ••• ••••• ••• •• •• •• •• • •••• ••• ••• •• • •••• •• • •••••• ••• • •••• •• ••• ••• •• • • •••• •• •••• •• •••• •••• •• •• •• •• •• •• •• ••• • •••• • •••• •• •• ••• ••••• • ••••• ••• •• •••• ••• ••• •• ••• • •• ••• •• • • ••• •• ••• ••• •• •• •••••• •• •• ••• • ••••• •••• •••• •• • •• •• •• ••• ••• • ••••• •••• •• •••• • ••• ••• •• •• •••••• • ••• •• ••• •• • ••• ••• •• • •• ••• •• • • •••• ••• ••• •• •••• • ••• • • •• •• •• • •• • •• • •• •• •••• •••• • • ••••• •• ••• • • •• •••••• • •• •• •• •• ••••••• •• •• ••• •• • •• •••• •• ••••• •••• ••• ••• ••• • • ••• • •• • • •• •• • ••

Chapter 5: Bayesian Computation – p. 25/42

Page 70: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Example 5.6: Rat dataUsing vague hyperpriors, run 3 initially overdispersedparallel sampling chains for 500 iterations each:

iteration

mu

0 100 200 300 400 500

100

200

Trace of mu(500 values per trace)

mu

90 100 110 120

0.0

0.05

Kernel density for mu(1200 values)

•• • ••• ••• •• •••• ••• • • • ••• ••• •• ••• • •• • •• ••• • ••• ••• ••• •• •• ••• •••• ••• •••• • •• ••• • •• • •••• •• •• •• •• •• •• •••• • •••• •• •• •••• •• ••• •• •• ••• ••••• • •• •• •• ••• ••• •• • ••••••••• • •• ••••• ••• • •••• ••• •• •• •• ••• •• •• •• •• •• •••• •• • •• ••• •• ••• • •• ••••• ••••• •••• • ••• •••• ••••• ••• • •• •• • • •• • •• • • •• ••• • ••• ••••••• •• ••••• •••• ••• •• • • •• •• • ••• •• •••• •• • •• •• •• •• •• •••• •• • •• •• • ••• •• • ••• •• • ••• • ••• •• • •••• •••••• •• • •• •• • •• ••••• • • •••••••• •• •• •••••• • •••••••••••••••••• •••••••••• ••••••••••••••••••••••••••••••••• ••••••••••••••• • •••••••• •••••••••• •••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• • •• ••• •••• ••• • ••••••••• •••• •••• ••• • •••• • ••• ••• •• ••••• •• ••• ••• •• • ••• •• ••• ••• •• •• • •• • •• •• •• •• ••••• •• ••• • •• •• • •• •• • •• •••• ••• •• ••• ••• • •••• •• •• •••• • ••• •••• ••• •• •••••• • • •• ••• ••• ••• ••• ••• •• • •••• •• •• •••• • ••• •• • ••• •• ••• • •• •••• •• •• • •• •• •••• ••••• •• • •••• •• •• •• •• •• •• •••• • ••••• •••• •• •• •• •••••• • • •• •••• •• •••••••• •• •• •••• • • •• •• ••• •• •• •••••• ••••• ••• • •••• •• ••• •• ••• • ••• •• •• ••• ••••• •• ••• ••• •• •••• •• •• ••••• • ••••• •• •• ••• •••• •••• • •••••• •• • •••• ••• ••• • • ••• •••• ••• ••••• • • •• •• ••• •• •••• •••••• ••• ••• ••••• ••• • •••• • ••• ••• ••• •• • •• •• •••• •• •• ••• ••• •••••• •• • •• • • ••••• •••• ••• ••• •• • ••• ••• ••• •• •• •• • •• • ••• •••• ••• ••• •• •• •• • •• • ••••••• •• • ••• •• ••• ••• •••• • • ••• •

iteration

beta

0

0 100 200 300 400 500

05

10

Trace of beta0(500 values per trace)

beta0

5.8 6.0 6.2 6.4 6.6 6.8

02

Kernel density for beta0(1200 values)

•• •• • •• ••• •• ••• •• ••• ••• •• • ••• ••••• • ••• • •• •• ••• ••• •• • • •••• •• •••• ••• •• ••• • ••••• •• •• • •• •• •• •• •••• ••• • ••• • • ••• ••• •• •••• ••• ••••• ••••• ••• •• •• •• •• • •• •• •• •• ••• ••••• • ••• ••• ••• •• •• • •••••• ••••• •••• • •• ••••• •• ••• •• • •••••• • •••• • •••••••• • • •••••••• •• •• ••• •• •• ••• ••• • •• • ••• • ••••••• •••••• ••••••• •••• •• •• • • ••• •••••• •••• ••• •• •• • •••• • ••• • ••••• •••• ••••• •• ••• ••• •••• ••• ••• • •• • •••• • •••• ••• •• ••••• ••• • ••• •• • •• •• •• •••• • • • ••• • •••••••••••••••• •••• • ••••••••••• • •••••••••••• ••••••••••••••••••• ••••••••••• • • •••• ••• • ••••••• ••••••••••• ••••••••••••••••••••••••••••••••••••••••••••••••••••• •••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• ••••••••••• • ••••• •••• •••••••••• ••• •• ••• •• • ••• •• ••• • •••• ••• • •• • • •••• •• •••• ••• •••• • ••••• • ••• •• •••••• • • • •••• •• • ••• •• • ••••• ••• ••• •• •• •• •••••• • ••• ••• ••••• ••• •• •• •• •• • •••• ••• ••• •• • •••• •• • •••••• ••• • •••• •• ••• ••• •• • • •••• •• •••• •• •••• •••• •• •• •• •• •• •• •• ••• • •••• • •••• •• •• ••• ••••• • ••••• ••• •• •••• ••• ••• •• ••• • •• ••• •• • • ••• •• ••• ••• •• •• •••••• •• •• ••• • ••••• •••• •••• •• • •• •• •• ••• ••• • ••••• •••• •• •••• • ••• ••• •• •• •••••• • ••• •• ••• •• • ••• ••• •• • •• ••• •• • • •••• ••• ••• •• •••• • ••• • • •• •• •• • •• • •• • •• •• •••• •••• • • ••••• •• ••• • • •• •••••• • •• •• •• •• ••••••• •• •• ••• •• • •• •••• •• ••••• •••• ••• ••• ••• • • ••• • •• • • •• •• • ••

The output from all three chains over iterations 101–500is used in the posterior kernel density estimates (col 2)

Chapter 5: Bayesian Computation – p. 25/42

Page 71: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Example 5.6: Rat dataUsing vague hyperpriors, run 3 initially overdispersedparallel sampling chains for 500 iterations each:

iteration

mu

0 100 200 300 400 500

100

200

Trace of mu(500 values per trace)

mu

90 100 110 120

0.0

0.05

Kernel density for mu(1200 values)

•• • ••• ••• •• •••• ••• • • • ••• ••• •• ••• • •• • •• ••• • ••• ••• ••• •• •• ••• •••• ••• •••• • •• ••• • •• • •••• •• •• •• •• •• •• •••• • •••• •• •• •••• •• ••• •• •• ••• ••••• • •• •• •• ••• ••• •• • ••••••••• • •• ••••• ••• • •••• ••• •• •• •• ••• •• •• •• •• •• •••• •• • •• ••• •• ••• • •• ••••• ••••• •••• • ••• •••• ••••• ••• • •• •• • • •• • •• • • •• ••• • ••• ••••••• •• ••••• •••• ••• •• • • •• •• • ••• •• •••• •• • •• •• •• •• •• •••• •• • •• •• • ••• •• • ••• •• • ••• • ••• •• • •••• •••••• •• • •• •• • •• ••••• • • •••••••• •• •• •••••• • •••••••••••••••••• •••••••••• ••••••••••••••••••••••••••••••••• ••••••••••••••• • •••••••• •••••••••• •••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• • •• ••• •••• ••• • ••••••••• •••• •••• ••• • •••• • ••• ••• •• ••••• •• ••• ••• •• • ••• •• ••• ••• •• •• • •• • •• •• •• •• ••••• •• ••• • •• •• • •• •• • •• •••• ••• •• ••• ••• • •••• •• •• •••• • ••• •••• ••• •• •••••• • • •• ••• ••• ••• ••• ••• •• • •••• •• •• •••• • ••• •• • ••• •• ••• • •• •••• •• •• • •• •• •••• ••••• •• • •••• •• •• •• •• •• •• •••• • ••••• •••• •• •• •• •••••• • • •• •••• •• •••••••• •• •• •••• • • •• •• ••• •• •• •••••• ••••• ••• • •••• •• ••• •• ••• • ••• •• •• ••• ••••• •• ••• ••• •• •••• •• •• ••••• • ••••• •• •• ••• •••• •••• • •••••• •• • •••• ••• ••• • • ••• •••• ••• ••••• • • •• •• ••• •• •••• •••••• ••• ••• ••••• ••• • •••• • ••• ••• ••• •• • •• •• •••• •• •• ••• ••• •••••• •• • •• • • ••••• •••• ••• ••• •• • ••• ••• ••• •• •• •• • •• • ••• •••• ••• ••• •• •• •• • •• • ••••••• •• • ••• •• ••• ••• •••• • • ••• •

iteration

beta

0

0 100 200 300 400 500

05

10

Trace of beta0(500 values per trace)

beta0

5.8 6.0 6.2 6.4 6.6 6.8

02

Kernel density for beta0(1200 values)

•• •• • •• ••• •• ••• •• ••• ••• •• • ••• ••••• • ••• • •• •• ••• ••• •• • • •••• •• •••• ••• •• ••• • ••••• •• •• • •• •• •• •• •••• ••• • ••• • • ••• ••• •• •••• ••• ••••• ••••• ••• •• •• •• •• • •• •• •• •• ••• ••••• • ••• ••• ••• •• •• • •••••• ••••• •••• • •• ••••• •• ••• •• • •••••• • •••• • •••••••• • • •••••••• •• •• ••• •• •• ••• ••• • •• • ••• • ••••••• •••••• ••••••• •••• •• •• • • ••• •••••• •••• ••• •• •• • •••• • ••• • ••••• •••• ••••• •• ••• ••• •••• ••• ••• • •• • •••• • •••• ••• •• ••••• ••• • ••• •• • •• •• •• •••• • • • ••• • •••••••••••••••• •••• • ••••••••••• • •••••••••••• ••••••••••••••••••• ••••••••••• • • •••• ••• • ••••••• ••••••••••• ••••••••••••••••••••••••••••••••••••••••••••••••••••• •••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• ••••••••••• • ••••• •••• •••••••••• ••• •• ••• •• • ••• •• ••• • •••• ••• • •• • • •••• •• •••• ••• •••• • ••••• • ••• •• •••••• • • • •••• •• • ••• •• • ••••• ••• ••• •• •• •• •••••• • ••• ••• ••••• ••• •• •• •• •• • •••• ••• ••• •• • •••• •• • •••••• ••• • •••• •• ••• ••• •• • • •••• •• •••• •• •••• •••• •• •• •• •• •• •• •• ••• • •••• • •••• •• •• ••• ••••• • ••••• ••• •• •••• ••• ••• •• ••• • •• ••• •• • • ••• •• ••• ••• •• •• •••••• •• •• ••• • ••••• •••• •••• •• • •• •• •• ••• ••• • ••••• •••• •• •••• • ••• ••• •• •• •••••• • ••• •• ••• •• • ••• ••• •• • •• ••• •• • • •••• ••• ••• •• •••• • ••• • • •• •• •• • •• • •• • •• •• •••• •••• • • ••••• •• ••• • • •• •••••• • •• •• •• •• ••••••• •• •• ••• •• • •• •••• •• ••••• •••• ••• ••• ••• • • ••• • •• • • •• •• • ••

The output from all three chains over iterations 101–500is used in the posterior kernel density estimates (col 2)

The average rat weighs about 106 grams at birth, andgains about 6.2 grams per day.

Chapter 5: Bayesian Computation – p. 25/42

Page 72: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Metropolis algorithmWhat happens if the full conditional p(θi|θj 6=i,y) is notavailable in closed form? Typically, p(θi|θj 6=i,y) will beavailable up to proportionality constant, since it isproportional to the portion of the Bayesian model(likelihood times prior) that involves θi.

Chapter 5: Bayesian Computation – p. 26/42

Page 73: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Metropolis algorithmWhat happens if the full conditional p(θi|θj 6=i,y) is notavailable in closed form? Typically, p(θi|θj 6=i,y) will beavailable up to proportionality constant, since it isproportional to the portion of the Bayesian model(likelihood times prior) that involves θi.

Suppose the true joint posterior for θ has unnormalizeddensity p(θ).

Chapter 5: Bayesian Computation – p. 26/42

Page 74: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Metropolis algorithmWhat happens if the full conditional p(θi|θj 6=i,y) is notavailable in closed form? Typically, p(θi|θj 6=i,y) will beavailable up to proportionality constant, since it isproportional to the portion of the Bayesian model(likelihood times prior) that involves θi.

Suppose the true joint posterior for θ has unnormalizeddensity p(θ).

Choose a candidate density q(θ∗|θ(t−1)) that is a validdensity function for every possible value of theconditioning variable θ(t−1), and satisfies

q(θ∗|θ(t−1)) = q(θ(t−1)|θ∗) ,

i.e., q is symmetric in its arguments.

Chapter 5: Bayesian Computation – p. 26/42

Page 75: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Metropolis algorithm (cont’d)Given a starting value θ(0) at iteration t = 0, the algorithmproceeds as follows:

Metropolis Algorithm: For (t ∈ 1 : T ), repeat:

Chapter 5: Bayesian Computation – p. 27/42

Page 76: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Metropolis algorithm (cont’d)Given a starting value θ(0) at iteration t = 0, the algorithmproceeds as follows:

Metropolis Algorithm: For (t ∈ 1 : T ), repeat:

1. Draw θ∗ from q(·|θ(t−1))

Chapter 5: Bayesian Computation – p. 27/42

Page 77: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Metropolis algorithm (cont’d)Given a starting value θ(0) at iteration t = 0, the algorithmproceeds as follows:

Metropolis Algorithm: For (t ∈ 1 : T ), repeat:

1. Draw θ∗ from q(·|θ(t−1))

2. Compute the ratior = p(θ∗)/p(θ(t−1)) = exp[log p(θ∗) − log p(θ(t−1))]

Chapter 5: Bayesian Computation – p. 27/42

Page 78: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Metropolis algorithm (cont’d)Given a starting value θ(0) at iteration t = 0, the algorithmproceeds as follows:

Metropolis Algorithm: For (t ∈ 1 : T ), repeat:

1. Draw θ∗ from q(·|θ(t−1))

2. Compute the ratior = p(θ∗)/p(θ(t−1)) = exp[log p(θ∗) − log p(θ(t−1))]

3. If r ≥ 1, set θ(t) = θ∗;

If r < 1, set θ(t) =

{θ∗ with probability r

θ(t−1) with probability 1 − r.

Chapter 5: Bayesian Computation – p. 27/42

Page 79: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Metropolis algorithm (cont’d)Given a starting value θ(0) at iteration t = 0, the algorithmproceeds as follows:

Metropolis Algorithm: For (t ∈ 1 : T ), repeat:

1. Draw θ∗ from q(·|θ(t−1))

2. Compute the ratior = p(θ∗)/p(θ(t−1)) = exp[log p(θ∗) − log p(θ(t−1))]

3. If r ≥ 1, set θ(t) = θ∗;

If r < 1, set θ(t) =

{θ∗ with probability r

θ(t−1) with probability 1 − r.

Then a draw θ(t) converges in distribution to a drawfrom the true posterior density p(θ|y).

Chapter 5: Bayesian Computation – p. 27/42

Page 80: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Metropolis algorithm (cont’d)Given a starting value θ(0) at iteration t = 0, the algorithmproceeds as follows:

Metropolis Algorithm: For (t ∈ 1 : T ), repeat:

1. Draw θ∗ from q(·|θ(t−1))

2. Compute the ratior = p(θ∗)/p(θ(t−1)) = exp[log p(θ∗) − log p(θ(t−1))]

3. If r ≥ 1, set θ(t) = θ∗;

If r < 1, set θ(t) =

{θ∗ with probability r

θ(t−1) with probability 1 − r.

Then a draw θ(t) converges in distribution to a drawfrom the true posterior density p(θ|y).

Note: When used as a substep in a larger (e.g., Gibbs)algorithm, we often use T = 1 (convergence still OK).

Chapter 5: Bayesian Computation – p. 27/42

Page 81: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Metropolis algorithm (cont’d)How to choose the candidate density? The usualapproach (after θ has been transformed to havesupport ℜk, if necessary) is to set

q(θ∗|θ(t−1)) = N(θ∗|θ(t−1), Σ) .

In one dimension, MCMC “folklore” suggests choosingΣ to provide an observed acceptance ratio near 50%.

Chapter 5: Bayesian Computation – p. 28/42

Page 82: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Metropolis algorithm (cont’d)How to choose the candidate density? The usualapproach (after θ has been transformed to havesupport ℜk, if necessary) is to set

q(θ∗|θ(t−1)) = N(θ∗|θ(t−1), Σ) .

In one dimension, MCMC “folklore” suggests choosingΣ to provide an observed acceptance ratio near 50%.

Hastings (1970) showed we can drop the requirementthat q be symmetric, provided we use

r =p(θ∗)q(θ(t−1) | θ∗)

p(θ(t−1))q(θ∗ | θ(t−1))

– useful for asymmetric target densities!– this form called the Metropolis-Hastings algorithm

Chapter 5: Bayesian Computation – p. 28/42

Page 83: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Convergence assessmentWhen it is safe to stop and summarize MCMC output?

We would like to ensure that∫|pt(θ) − p(θ)|dθ < ǫ, but

all we can hope to see is∫|pt(θ) − pt+k(θ)|dθ!

Chapter 5: Bayesian Computation – p. 29/42

Page 84: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Convergence assessmentWhen it is safe to stop and summarize MCMC output?

We would like to ensure that∫|pt(θ) − p(θ)|dθ < ǫ, but

all we can hope to see is∫|pt(θ) − pt+k(θ)|dθ!

Controversy: Does the eventual mixing of “initiallyoverdispersed” parallel sampling chains provideworthwhile information on convergence?

Chapter 5: Bayesian Computation – p. 29/42

Page 85: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Convergence assessmentWhen it is safe to stop and summarize MCMC output?

We would like to ensure that∫|pt(θ) − p(θ)|dθ < ǫ, but

all we can hope to see is∫|pt(θ) − pt+k(θ)|dθ!

Controversy: Does the eventual mixing of “initiallyoverdispersed” parallel sampling chains provideworthwhile information on convergence?

While one can never “prove” convergence of aMCMC algorithm using only a finite realization fromthe chain, poor mixing of parallel chains can helpdiscover extreme forms of nonconvergence

Chapter 5: Bayesian Computation – p. 29/42

Page 86: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Convergence assessmentWhen it is safe to stop and summarize MCMC output?

We would like to ensure that∫|pt(θ) − p(θ)|dθ < ǫ, but

all we can hope to see is∫|pt(θ) − pt+k(θ)|dθ!

Controversy: Does the eventual mixing of “initiallyoverdispersed” parallel sampling chains provideworthwhile information on convergence?

While one can never “prove” convergence of aMCMC algorithm using only a finite realization fromthe chain, poor mixing of parallel chains can helpdiscover extreme forms of nonconvergence

Still, it’s tricky: a slowly converging sampler may beindistinguishable from one that will never converge(e.g., due to nonidentifiability)!

Chapter 5: Bayesian Computation – p. 29/42

Page 87: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Convergence diagnosticsVarious summaries of MCMC output, such as

sample autocorrelations in one or more chains:close to 0 indicates near-independence, and sochain should more quickly traverse the entireparameter space :)close to 1 indicates the sampler is “stuck” :(

Chapter 5: Bayesian Computation – p. 30/42

Page 88: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Convergence diagnosticsVarious summaries of MCMC output, such as

sample autocorrelations in one or more chains:close to 0 indicates near-independence, and sochain should more quickly traverse the entireparameter space :)close to 1 indicates the sampler is “stuck” :(

Gelman/Rubin shrink factor,

√R =

√(N − 1

N+

m + 1

mN

B

W

)df

df − 2

N→∞−→ 1 ,

where B/N is the variance between the means from them parallel chains, W is the average of the mwithin-chain variances, and df is the degrees offreedom of an approximating t density to the posterior.

Chapter 5: Bayesian Computation – p. 30/42

Page 89: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Convergence diagnosis strategyRun a few (3 to 5) parallel chains, with starting pointsbelieved to be overdispersed

say, covering ±3 prior standard deviations from theprior mean

Chapter 5: Bayesian Computation – p. 31/42

Page 90: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Convergence diagnosis strategyRun a few (3 to 5) parallel chains, with starting pointsbelieved to be overdispersed

say, covering ±3 prior standard deviations from theprior mean

Overlay the resulting sample traces for a representativesubset of the parameters

say, most of the fixed effects, some of the variancecomponents, and a few well-chosen random effects)

Chapter 5: Bayesian Computation – p. 31/42

Page 91: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Convergence diagnosis strategyRun a few (3 to 5) parallel chains, with starting pointsbelieved to be overdispersed

say, covering ±3 prior standard deviations from theprior mean

Overlay the resulting sample traces for a representativesubset of the parameters

say, most of the fixed effects, some of the variancecomponents, and a few well-chosen random effects)

Annotate each plot with lag 1 sample autocorrelationsand perhaps Gelman and Rubin diagnostics

Chapter 5: Bayesian Computation – p. 31/42

Page 92: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Convergence diagnosis strategyRun a few (3 to 5) parallel chains, with starting pointsbelieved to be overdispersed

say, covering ±3 prior standard deviations from theprior mean

Overlay the resulting sample traces for a representativesubset of the parameters

say, most of the fixed effects, some of the variancecomponents, and a few well-chosen random effects)

Annotate each plot with lag 1 sample autocorrelationsand perhaps Gelman and Rubin diagnostics

Investigate bivariate plots and crosscorrelations amongparameters suspected of being confounded, just as onemight do regarding collinearity in linear regression.

Chapter 5: Bayesian Computation – p. 31/42

Page 93: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Variance estimationHow good is our MCMC estimate once we get it?

Suppose a single long chain of (post-convergence)MCMC samples {λ(t)}N

t=1. Let

E(λ|y) = λN =1

N

N∑

t=1

λ(t) .

Chapter 5: Bayesian Computation – p. 32/42

Page 94: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Variance estimationHow good is our MCMC estimate once we get it?

Suppose a single long chain of (post-convergence)MCMC samples {λ(t)}N

t=1. Let

E(λ|y) = λN =1

N

N∑

t=1

λ(t) .

Then by the CLT, under iid sampling we could take

V ariid(λN ) = s2λ/N =

1

N(N − 1)

N∑

t=1

(λ(t) − λN )2 .

But this is likely an underestimate due to positiveautocorrelation in the MCMC samples.

Chapter 5: Bayesian Computation – p. 32/42

Page 95: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Variance estimation (cont’d)

To avoid wasteful parallel sampling or “thinning,”compute the effective sample size,

ESS = N/κ(λ) ,

where κ(λ) = 1 + 2∑∞

k=1 ρk(λ) is the autocorrelationtime, and we cut off the sum when ρk(λ) < ǫ

Chapter 5: Bayesian Computation – p. 33/42

Page 96: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Variance estimation (cont’d)

To avoid wasteful parallel sampling or “thinning,”compute the effective sample size,

ESS = N/κ(λ) ,

where κ(λ) = 1 + 2∑∞

k=1 ρk(λ) is the autocorrelationtime, and we cut off the sum when ρk(λ) < ǫ

ThenV arESS(λN ) = s2

λ/ESS(λ)

Note: κ(λ) ≥ 1, so ESS(λ) ≤ N , and so we have thatV arESS(λN ) ≥ V ariid(λN ), in concert with intuition.

Chapter 5: Bayesian Computation – p. 33/42

Page 97: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Variance estimation (cont’d)

Another alternative: Batching: Divide the run into msuccessive batches of length k with batch meansb1, . . . , bm. Then λN = b = 1

m

∑mi=1 bi, and

V arbatch(λN ) =1

m(m − 1)

m∑

i=1

(bi − λN )2 ,

provided that k is large enough so that the correlationbetween batches is negligible.

Chapter 5: Bayesian Computation – p. 34/42

Page 98: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Variance estimation (cont’d)

Another alternative: Batching: Divide the run into msuccessive batches of length k with batch meansb1, . . . , bm. Then λN = b = 1

m

∑mi=1 bi, and

V arbatch(λN ) =1

m(m − 1)

m∑

i=1

(bi − λN )2 ,

provided that k is large enough so that the correlationbetween batches is negligible.

For any V used to approximate V ar(λN ), a 95% CI forE(λ|y) is then given by

λN ± z.025

√V .

Chapter 5: Bayesian Computation – p. 34/42

Page 99: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Recent DevelopmentsOverrelaxation: Try to speed convergence by inducingnegative autocorrelation within the chains

Chapter 5: Bayesian Computation – p. 35/42

Page 100: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Recent DevelopmentsOverrelaxation: Try to speed convergence by inducingnegative autocorrelation within the chains

Neal (1998): Generate {θi,k}Kk=1 independently from the

full conditional p(θi|θj 6=i,y). Ordering these along withthe old value, we have

θi,0 ≤ θi,1 ≤ · · · ≤ θi,r ≡ θ(t−1)i ≤ · · · ≤ θi,K ,

so that r is the index of the old value. Then take

θ(t)i = θi,K−r .

Chapter 5: Bayesian Computation – p. 35/42

Page 101: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Recent DevelopmentsOverrelaxation: Try to speed convergence by inducingnegative autocorrelation within the chains

Neal (1998): Generate {θi,k}Kk=1 independently from the

full conditional p(θi|θj 6=i,y). Ordering these along withthe old value, we have

θi,0 ≤ θi,1 ≤ · · · ≤ θi,r ≡ θ(t−1)i ≤ · · · ≤ θi,K ,

so that r is the index of the old value. Then take

θ(t)i = θi,K−r .

Note that K = 1 produces Gibbs sampling, while largeK produces progressively more overrelaxation.

Chapter 5: Bayesian Computation – p. 35/42

Page 102: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Recent DevelopmentsOverrelaxation: Try to speed convergence by inducingnegative autocorrelation within the chains

Neal (1998): Generate {θi,k}Kk=1 independently from the

full conditional p(θi|θj 6=i,y). Ordering these along withthe old value, we have

θi,0 ≤ θi,1 ≤ · · · ≤ θi,r ≡ θ(t−1)i ≤ · · · ≤ θi,K ,

so that r is the index of the old value. Then take

θ(t)i = θi,K−r .

Note that K = 1 produces Gibbs sampling, while largeK produces progressively more overrelaxation.

Generation of the K random variables can be avoided ifthe full conditional cdf and inverse cdf are available.

Chapter 5: Bayesian Computation – p. 35/42

Page 103: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Blocking and Structured MCMCA general approach to accelerating MCMC convergence viamultivariate updating of blocks of correlated parameters.

For hierarchical linear models expressible as

y

0

M

=

X1 0

H1 H2

G1 G2

[

θ1

θ2

]+

ǫ

δ

ξ

,

Chapter 5: Bayesian Computation – p. 36/42

Page 104: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Blocking and Structured MCMCA general approach to accelerating MCMC convergence viamultivariate updating of blocks of correlated parameters.

For hierarchical linear models expressible as

y

0

M

=

X1 0

H1 H2

G1 G2

[

θ1

θ2

]+

ǫ

δ

ξ

,

or more compactly as

Y = Xθ + E,

where X and Y are known, θ is unknown, and E is anerror term having Cov(E) = Γ, where Γ is block diagonal:

Cov(E) = Γ = Diag (Cov(ǫ), Cov(δ), Cov(ξ)) .

Chapter 5: Bayesian Computation – p. 36/42

Page 105: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Blocking and Structured MCMCUnder Gaussian errors, p(θ | Y,X,Γ) is given by

N((X ′Γ−1X)−1X ′Γ−1Y , (X ′Γ−1X)−1

).

Chapter 5: Bayesian Computation – p. 37/42

Page 106: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Blocking and Structured MCMCUnder Gaussian errors, p(θ | Y,X,Γ) is given by

N((X ′Γ−1X)−1X ′Γ−1Y , (X ′Γ−1X)−1

).

With conjugate (inverse gamma or Wishart) priors forthe variance components, p(Γ | Y,X,θ) is also available,producing a giant two-block Gibbs sampler.

Chapter 5: Bayesian Computation – p. 37/42

Page 107: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Blocking and Structured MCMCUnder Gaussian errors, p(θ | Y,X,Γ) is given by

N((X ′Γ−1X)−1X ′Γ−1Y , (X ′Γ−1X)−1

).

With conjugate (inverse gamma or Wishart) priors forthe variance components, p(Γ | Y,X,θ) is also available,producing a giant two-block Gibbs sampler.

May want to use the above density only as a candidatedistribution in a Hastings independence chain algorithm(avoid frequent inversion of X ′Γ−1X).

Chapter 5: Bayesian Computation – p. 37/42

Page 108: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Blocking and Structured MCMCUnder Gaussian errors, p(θ | Y,X,Γ) is given by

N((X ′Γ−1X)−1X ′Γ−1Y , (X ′Γ−1X)−1

).

With conjugate (inverse gamma or Wishart) priors forthe variance components, p(Γ | Y,X,θ) is also available,producing a giant two-block Gibbs sampler.

May want to use the above density only as a candidatedistribution in a Hastings independence chain algorithm(avoid frequent inversion of X ′Γ−1X).

For hierarchical nonlinear models, replace the data y

with “low shrinkage” estimates y = θ1 (say, MLEs), andproceed with the Hastings implementation!

Chapter 5: Bayesian Computation – p. 37/42

Page 109: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Auxiliary variables and slice samplingEase and/or accelerate sampling from p(θ) by adding anauxiliary (or latent) variable u ∼ p(u|θ).

Simple slice sampler: Take u|θ ∼ Unif(0, p(θ)), so thatp(θ, u) ∝ I{0≤u≤p(θ)}(θ, u). So Gibbs = 2 uniform updates:

Chapter 5: Bayesian Computation – p. 38/42

Page 110: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Auxiliary variables and slice samplingEase and/or accelerate sampling from p(θ) by adding anauxiliary (or latent) variable u ∼ p(u|θ).

Simple slice sampler: Take u|θ ∼ Unif(0, p(θ)), so thatp(θ, u) ∝ I{0≤u≤p(θ)}(θ, u). So Gibbs = 2 uniform updates:

u|θ ∼ Unif(0, p(θ)), and

Chapter 5: Bayesian Computation – p. 38/42

Page 111: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Auxiliary variables and slice samplingEase and/or accelerate sampling from p(θ) by adding anauxiliary (or latent) variable u ∼ p(u|θ).

Simple slice sampler: Take u|θ ∼ Unif(0, p(θ)), so thatp(θ, u) ∝ I{0≤u≤p(θ)}(θ, u). So Gibbs = 2 uniform updates:

u|θ ∼ Unif(0, p(θ)), andθ|u ∼ Unif(θ : p(θ) ≥ u)

Chapter 5: Bayesian Computation – p. 38/42

Page 112: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Auxiliary variables and slice samplingEase and/or accelerate sampling from p(θ) by adding anauxiliary (or latent) variable u ∼ p(u|θ).

Simple slice sampler: Take u|θ ∼ Unif(0, p(θ)), so thatp(θ, u) ∝ I{0≤u≤p(θ)}(θ, u). So Gibbs = 2 uniform updates:

u|θ ∼ Unif(0, p(θ)), andθ|u ∼ Unif(θ : p(θ) ≥ u)

The 2nd update (over the “slice” defined by u) requiresp(θ) to be invertible, either analytically or numerically.

Chapter 5: Bayesian Computation – p. 38/42

Page 113: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Auxiliary variables and slice samplingEase and/or accelerate sampling from p(θ) by adding anauxiliary (or latent) variable u ∼ p(u|θ).

Simple slice sampler: Take u|θ ∼ Unif(0, p(θ)), so thatp(θ, u) ∝ I{0≤u≤p(θ)}(θ, u). So Gibbs = 2 uniform updates:

u|θ ∼ Unif(0, p(θ)), andθ|u ∼ Unif(θ : p(θ) ≥ u)

The 2nd update (over the “slice” defined by u) requiresp(θ) to be invertible, either analytically or numerically.

Product slice sampler: if p(θ) ∝ p(θ)∏n

i=1 Li(θ),Li(θ) > 0, then introduce u = (u1, . . . , un) whereui|θ ∼ Unif(0, Li(θ)) for all i.

Chapter 5: Bayesian Computation – p. 38/42

Page 114: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Auxiliary variables and slice samplingEase and/or accelerate sampling from p(θ) by adding anauxiliary (or latent) variable u ∼ p(u|θ).

Simple slice sampler: Take u|θ ∼ Unif(0, p(θ)), so thatp(θ, u) ∝ I{0≤u≤p(θ)}(θ, u). So Gibbs = 2 uniform updates:

u|θ ∼ Unif(0, p(θ)), andθ|u ∼ Unif(θ : p(θ) ≥ u)

The 2nd update (over the “slice” defined by u) requiresp(θ) to be invertible, either analytically or numerically.

Product slice sampler: if p(θ) ∝ p(θ)∏n

i=1 Li(θ),Li(θ) > 0, then introduce u = (u1, . . . , un) whereui|θ ∼ Unif(0, Li(θ)) for all i.

Slice samplers typically have excellent convergenceproperties, and are increasingly used as alternatives toMetropolis steps!

Chapter 5: Bayesian Computation – p. 38/42

Page 115: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Reversible Jump MCMCMethod for handling models of varying dimension, orBayesian model choice more generally.

Suppose model M = j has parameter vector θj ofdimension nj , j = 1, . . . ,K. The algorithm is:

Chapter 5: Bayesian Computation – p. 39/42

Page 116: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Reversible Jump MCMCMethod for handling models of varying dimension, orBayesian model choice more generally.

Suppose model M = j has parameter vector θj ofdimension nj , j = 1, . . . ,K. The algorithm is:

1. Let the current state of the Markov chain be (j,θj),where θj is of dimension nj.

Chapter 5: Bayesian Computation – p. 39/42

Page 117: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Reversible Jump MCMCMethod for handling models of varying dimension, orBayesian model choice more generally.

Suppose model M = j has parameter vector θj ofdimension nj , j = 1, . . . ,K. The algorithm is:

1. Let the current state of the Markov chain be (j,θj),where θj is of dimension nj.

2. Propose a new model j′ with probability h(j, j′).

Chapter 5: Bayesian Computation – p. 39/42

Page 118: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Reversible Jump MCMCMethod for handling models of varying dimension, orBayesian model choice more generally.

Suppose model M = j has parameter vector θj ofdimension nj , j = 1, . . . ,K. The algorithm is:

1. Let the current state of the Markov chain be (j,θj),where θj is of dimension nj.

2. Propose a new model j′ with probability h(j, j′).

3. Generate u from a proposal density q(u|θj , j , j′).

Chapter 5: Bayesian Computation – p. 39/42

Page 119: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Reversible Jump MCMCMethod for handling models of varying dimension, orBayesian model choice more generally.

Suppose model M = j has parameter vector θj ofdimension nj , j = 1, . . . ,K. The algorithm is:

1. Let the current state of the Markov chain be (j,θj),where θj is of dimension nj.

2. Propose a new model j′ with probability h(j, j′).

3. Generate u from a proposal density q(u|θj , j , j′).

4. Set (θ′j′ ,u

′) = gj ,j ′(θj ,u), where gj ,j ′ is a deterministicfunction that is 1-1 and onto. This is a “dimensionmatching” function, specified so that

nj + dim(u) = nj′ + dim(u′) .

Chapter 5: Bayesian Computation – p. 39/42

Page 120: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Reversible Jump MCMC5. Accept the proposed move (from j to j′) with probability

αj→j′, which is the minimum of 1 and

f(y|θ′j ′,M = j ′)p(θ′

j ′|M = j ′)πj ′h(j ′, j )q(u′|θj ′ , j′, j )

f(y|θj ,M = j )p(θj |M = j )πjh(j , j ′)q(u|θj , j , j ′)

∣∣∣∣∂g(θj ,u)

∂(θj ,u)

∣∣∣∣ .

Chapter 5: Bayesian Computation – p. 40/42

Page 121: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

Reversible Jump MCMC5. Accept the proposed move (from j to j′) with probability

αj→j′, which is the minimum of 1 and

f(y|θ′j ′,M = j ′)p(θ′

j ′|M = j ′)πj ′h(j ′, j )q(u′|θj ′ , j′, j )

f(y|θj ,M = j )p(θj |M = j )πjh(j , j ′)q(u|θj , j , j ′)

∣∣∣∣∂g(θj ,u)

∂(θj ,u)

∣∣∣∣ .

At convergence, we obtain {M (g)}Gg=1 from p(M |y), so

p(M = j|y) =number of M (g) = j

total number of M (g), j = 1, . . . ,K ,

as well as the Bayes factor between models j and j′,

BF jj′ =p(M = j|y)/p(M = j′|y)

p(M = j)/p(M = j′).

Chapter 5: Bayesian Computation – p. 40/42

Page 122: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

RJMCMC Example 1

Suppose K = 2 models, for which θ1 ∈ ℜ and θ2 ∈ ℜ2,and θ1 is a subvector of θ2. Then when moving fromj = 1 to j′ = 2, we may draw u ∼ q(u) and set

θ′2 = (θ1, u) .

Chapter 5: Bayesian Computation – p. 41/42

Page 123: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

RJMCMC Example 1

Suppose K = 2 models, for which θ1 ∈ ℜ and θ2 ∈ ℜ2,and θ1 is a subvector of θ2. Then when moving fromj = 1 to j′ = 2, we may draw u ∼ q(u) and set

θ′2 = (θ1, u) .

That is, the dimension matching g is the identityfunction, and so the Jacobian in step 5 is equal to 1.

Chapter 5: Bayesian Computation – p. 41/42

Page 124: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

RJMCMC Example 2Consider choosing between a time series model havinga constant mean level θ1, and a changepoint modelhaving level θ2,1 before time t and θ2,2 afterward.

Chapter 5: Bayesian Computation – p. 42/42

Page 125: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

RJMCMC Example 2Consider choosing between a time series model havinga constant mean level θ1, and a changepoint modelhaving level θ2,1 before time t and θ2,2 afterward.

When moving from Model 2 to 1, set

θ′1 =θ2,1 + θ2,2

2.

Chapter 5: Bayesian Computation – p. 42/42

Page 126: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

RJMCMC Example 2Consider choosing between a time series model havinga constant mean level θ1, and a changepoint modelhaving level θ2,1 before time t and θ2,2 afterward.

When moving from Model 2 to 1, set

θ′1 =θ2,1 + θ2,2

2.

To ensure reversibility of this move, when going fromModel 1 to 2 we might sample u ∼ q(u) and set

θ′2,1 = θ1 − u and θ′2,2 = θ1 + u ,

since this is a 1-1 and onto function corresponding tothe deterministic “down move” we selected above.

Chapter 5: Bayesian Computation – p. 42/42

Page 127: Bayesian computation - Biostatisticsph7440/pubh7440/slide5.pdf · 2005-04-27 · Bayesian computation prehistory (1763 – 1960): Conjugate priors 1960’s: Numerical quadrature –

RJMCMC Example 2Consider choosing between a time series model havinga constant mean level θ1, and a changepoint modelhaving level θ2,1 before time t and θ2,2 afterward.

When moving from Model 2 to 1, set

θ′1 =θ2,1 + θ2,2

2.

To ensure reversibility of this move, when going fromModel 1 to 2 we might sample u ∼ q(u) and set

θ′2,1 = θ1 − u and θ′2,2 = θ1 + u ,

since this is a 1-1 and onto function corresponding tothe deterministic “down move” we selected above.

RJMCMC tricky (and not in WinBUGS) but extremelygeneral! Chapter 5: Bayesian Computation – p. 42/42