Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Bayesian computationprehistory (1763 – 1960): Conjugate priors
Chapter 5: Bayesian Computation – p. 1/42
Bayesian computationprehistory (1763 – 1960): Conjugate priors
1960’s: Numerical quadrature – Newton-Cotesmethods, Gaussian quadrature, etc.
Chapter 5: Bayesian Computation – p. 1/42
Bayesian computationprehistory (1763 – 1960): Conjugate priors
1960’s: Numerical quadrature – Newton-Cotesmethods, Gaussian quadrature, etc.
1970’s: Expectation-Maximization (“EM”) algorithm –iterative mode-finder
Chapter 5: Bayesian Computation – p. 1/42
Bayesian computationprehistory (1763 – 1960): Conjugate priors
1960’s: Numerical quadrature – Newton-Cotesmethods, Gaussian quadrature, etc.
1970’s: Expectation-Maximization (“EM”) algorithm –iterative mode-finder
1980’s: Asymptotic methods – Laplace’s method,saddlepoint approximations
Chapter 5: Bayesian Computation – p. 1/42
Bayesian computationprehistory (1763 – 1960): Conjugate priors
1960’s: Numerical quadrature – Newton-Cotesmethods, Gaussian quadrature, etc.
1970’s: Expectation-Maximization (“EM”) algorithm –iterative mode-finder
1980’s: Asymptotic methods – Laplace’s method,saddlepoint approximations
1980’s: Noniterative Monte Carlo methods – Directposterior sampling and indirect methods (importancesampling, rejection, etc.)
Chapter 5: Bayesian Computation – p. 1/42
Bayesian computationprehistory (1763 – 1960): Conjugate priors
1960’s: Numerical quadrature – Newton-Cotesmethods, Gaussian quadrature, etc.
1970’s: Expectation-Maximization (“EM”) algorithm –iterative mode-finder
1980’s: Asymptotic methods – Laplace’s method,saddlepoint approximations
1980’s: Noniterative Monte Carlo methods – Directposterior sampling and indirect methods (importancesampling, rejection, etc.)
1990’s: Markov chain Monte Carlo (MCMC) – Gibbssampler, Metropolis-Hastings algorithm
Chapter 5: Bayesian Computation – p. 1/42
Bayesian computationprehistory (1763 – 1960): Conjugate priors
1960’s: Numerical quadrature – Newton-Cotesmethods, Gaussian quadrature, etc.
1970’s: Expectation-Maximization (“EM”) algorithm –iterative mode-finder
1980’s: Asymptotic methods – Laplace’s method,saddlepoint approximations
1980’s: Noniterative Monte Carlo methods – Directposterior sampling and indirect methods (importancesampling, rejection, etc.)
1990’s: Markov chain Monte Carlo (MCMC) – Gibbssampler, Metropolis-Hastings algorithm
⇒ MCMC methods broadly applicable, but require care inparametrization and convergence diagnosis!
Chapter 5: Bayesian Computation – p. 1/42
Asymptotic methodsWhen n is large, f(x|θ) will be quite peaked relative top(θ), and so p(θ|x) will be approximately normal.
Chapter 5: Bayesian Computation – p. 2/42
Asymptotic methodsWhen n is large, f(x|θ) will be quite peaked relative top(θ), and so p(θ|x) will be approximately normal.
“Bayesian Central Limit Theorem”: Suppose
X1, . . . , Xniid∼ fi(xi|θ), and that the prior p(θ) and the
likelihood f(x|θ) are positive and twice differentiable
near θp, the posterior mode of θ. Then for large n
p(θ|x)·∼ N(θ
p, [Ip(x)]−1) ,
where [Ip(x)]−1 is the “generalized” observed Fisherinformation matrix for θ, i.e., minus the inverse Hessianof the log posterior evaluated at the mode,
Ipij(x) = −
[∂2
∂θi∂θjlog (f(x|θ)p(θ))
]
θ=θp
.
Chapter 5: Bayesian Computation – p. 2/42
Example 5.1: Hamburger patties againRecall X|θ ∼ Bin(n, θ) and θ ∼ Beta(α, β).
Under a flat prior on θ, we have
l(θ) = log f(x|θ)p(θ) ∝ x log θ + (n − x) log(1 − θ) .
Taking the derivative of l(θ) and equating to zero, weobtain θp = θ = x/n, the familiar binomial proportion.
Chapter 5: Bayesian Computation – p. 3/42
Example 5.1: Hamburger patties againRecall X|θ ∼ Bin(n, θ) and θ ∼ Beta(α, β).
Under a flat prior on θ, we have
l(θ) = log f(x|θ)p(θ) ∝ x log θ + (n − x) log(1 − θ) .
Taking the derivative of l(θ) and equating to zero, weobtain θp = θ = x/n, the familiar binomial proportion.
The second derivative is
∂2l(θ)
∂θ2=
−x
θ2−
n − x
(1 − θ)2,
so that
∂2l(θ)
∂θ2
∣∣∣∣θ=θ
= −x
θ2−
n − x
(1 − θ)2= −
n
θ−
n
1 − θ.
Chapter 5: Bayesian Computation – p. 3/42
Example 5.1: Hamburger patties again
Thus [Ip(x)]−1 =(
nθ
+ n1−θ
)−1=(
nθ(1−θ)
)−1= θ(1−θ)
n ,
which is the usual frequentist expression for V ar(θ).Thus the Bayesian CLT gives
p(θ|x)·∼ N
(θ ,
θ(1 − θ)
n
)
Chapter 5: Bayesian Computation – p. 4/42
Example 5.1: Hamburger patties again
Thus [Ip(x)]−1 =(
nθ
+ n1−θ
)−1=(
nθ(1−θ)
)−1= θ(1−θ)
n ,
which is the usual frequentist expression for V ar(θ).Thus the Bayesian CLT gives
p(θ|x)·∼ N
(θ ,
θ(1 − θ)
n
)
Notice that a frequentist might instead use MLEasymptotics to write
θ | θ·∼ N
(θ ,
θ(1 − θ)
n
),
leading to identical inferences for θ, but for differentreasons and with different interpretations!
Chapter 5: Bayesian Computation – p. 4/42
Example 5.1: Hamburger patties againComparison of this normal approximation to the exactposterior, a Beta(14, 4) distribution (recall n = 16):
post
erio
r de
nsity
0.0 0.2 0.4 0.6 0.8 1.0
01
23
4
θ
exact (beta)approximate (normal)
Similar modes, but very different tail behavior: 95% crediblesets are (.57, .93) for exact, but (.62, 1.0) for normalapproximation.
Chapter 5: Bayesian Computation – p. 5/42
Higher order approximationsThe Bayesian CLT is a first order approximation, since
E(g(θ)) = g(θ) [1 + O (1/n)] .
Chapter 5: Bayesian Computation – p. 6/42
Higher order approximationsThe Bayesian CLT is a first order approximation, since
E(g(θ)) = g(θ) [1 + O (1/n)] .
Second order approximations (i.e., to order O(1/n2))again requiring only mode and Hessian calculations areavailable via Laplace’s Method (C&L, Sec. 5.2.2).
Chapter 5: Bayesian Computation – p. 6/42
Higher order approximationsThe Bayesian CLT is a first order approximation, since
E(g(θ)) = g(θ) [1 + O (1/n)] .
Second order approximations (i.e., to order O(1/n2))again requiring only mode and Hessian calculations areavailable via Laplace’s Method (C&L, Sec. 5.2.2).
Advantages of Asymptotic Methods:deterministic, noniterative algorithmsubstitutes differentiation for integrationfacilitates studies of Bayesian robustness
Chapter 5: Bayesian Computation – p. 6/42
Higher order approximationsThe Bayesian CLT is a first order approximation, since
E(g(θ)) = g(θ) [1 + O (1/n)] .
Second order approximations (i.e., to order O(1/n2))again requiring only mode and Hessian calculations areavailable via Laplace’s Method (C&L, Sec. 5.2.2).
Advantages of Asymptotic Methods:deterministic, noniterative algorithmsubstitutes differentiation for integrationfacilitates studies of Bayesian robustness
Disadvantages of Asymptotic Methods:requires well-parametrized, unimodal posteriorθ must be of at most moderate dimensionn must be large, but is beyond our control
Chapter 5: Bayesian Computation – p. 6/42
Noninterative Monte Carlo MethodsRecall Monte Carlo integration: Suppose θ ∼ h(θ), and
we seek γ ≡ E[f(θ)] =∫
f(θ)h(θ)dθ. Then if θiiid∼ h(θ),
γ =1
N
N∑
j=1
f(θj)w.p. 1→ E[f(θ)] as N → ∞ ( SLLN )
Chapter 5: Bayesian Computation – p. 7/42
Noninterative Monte Carlo MethodsRecall Monte Carlo integration: Suppose θ ∼ h(θ), and
we seek γ ≡ E[f(θ)] =∫
f(θ)h(θ)dθ. Then if θiiid∼ h(θ),
γ =1
N
N∑
j=1
f(θj)w.p. 1→ E[f(θ)] as N → ∞ ( SLLN )
Since γ is itself a sample mean of independentobservations, V ar(γ) = V ar[f(θ)]/N . This means
se(γ) =
√√√√ 1
N(N − 1)
N∑
j=1
[f(θj) − γ]2 .
=⇒ γ ± 2 se(γ) provides a 95% (frequentist!) CI for γ
Chapter 5: Bayesian Computation – p. 7/42
Noninterative Monte Carlo MethodsRecall Monte Carlo integration: Suppose θ ∼ h(θ), and
we seek γ ≡ E[f(θ)] =∫
f(θ)h(θ)dθ. Then if θiiid∼ h(θ),
γ =1
N
N∑
j=1
f(θj)w.p. 1→ E[f(θ)] as N → ∞ ( SLLN )
Since γ is itself a sample mean of independentobservations, V ar(γ) = V ar[f(θ)]/N . This means
se(γ) =
√√√√ 1
N(N − 1)
N∑
j=1
[f(θj) − γ]2 .
=⇒ γ ± 2 se(γ) provides a 95% (frequentist!) CI for γ
Note dependence on N , rather than n!Chapter 5: Bayesian Computation – p. 7/42
Example 5.2: Direct bivariate samplingIf Yi
iid∼ N(µ, σ2), i = 1, . . . , n, and p(µ, σ) = 1
σ , then
µ|σ2, y ∼ N(y, σ2/n) , and
σ2|y ∼ Kχ−2n−1 = K · IG
(n − 1
2, 2
),
where K =∑n
i=1(yi − y)2.
Chapter 5: Bayesian Computation – p. 8/42
Example 5.2: Direct bivariate samplingIf Yi
iid∼ N(µ, σ2), i = 1, . . . , n, and p(µ, σ) = 1
σ , then
µ|σ2, y ∼ N(y, σ2/n) , and
σ2|y ∼ Kχ−2n−1 = K · IG
(n − 1
2, 2
),
where K =∑n
i=1(yi − y)2.
So draw {(µj , σ2j ), j = 1, . . . , N} from p(µ, σ2|y) as:
σ2j ∼ Kχ−2
n−1 ; then µj ∼ N(y, σ2j /n), j = 1, . . . , N .
Chapter 5: Bayesian Computation – p. 8/42
Example 5.2: Direct bivariate samplingIf Yi
iid∼ N(µ, σ2), i = 1, . . . , n, and p(µ, σ) = 1
σ , then
µ|σ2, y ∼ N(y, σ2/n) , and
σ2|y ∼ Kχ−2n−1 = K · IG
(n − 1
2, 2
),
where K =∑n
i=1(yi − y)2.
So draw {(µj , σ2j ), j = 1, . . . , N} from p(µ, σ2|y) as:
σ2j ∼ Kχ−2
n−1 ; then µj ∼ N(y, σ2j /n), j = 1, . . . , N .
To estimate the posterior mean: E(µ|y) = 1N
∑Nj=1 µj .
Chapter 5: Bayesian Computation – p. 8/42
Example 5.2: Direct bivariate samplingIf Yi
iid∼ N(µ, σ2), i = 1, . . . , n, and p(µ, σ) = 1
σ , then
µ|σ2, y ∼ N(y, σ2/n) , and
σ2|y ∼ Kχ−2n−1 = K · IG
(n − 1
2, 2
),
where K =∑n
i=1(yi − y)2.
So draw {(µj , σ2j ), j = 1, . . . , N} from p(µ, σ2|y) as:
σ2j ∼ Kχ−2
n−1 ; then µj ∼ N(y, σ2j /n), j = 1, . . . , N .
To estimate the posterior mean: E(µ|y) = 1N
∑Nj=1 µj .
Transformations also easy: To estimate the coefficientof variation, γ = σ/µ, define γj = σj/µj , j = 1, . . . , N ;summarize with moments or histograms!
Chapter 5: Bayesian Computation – p. 8/42
Indirect MethodsImportance sampling: Suppose we wish to approximate
E(f(θ)|y) =
∫f(θ)L(θ)p(θ)dθ∫
L(θ)p(θ)dθ,
and we can roughly approximate the normalizedlikelihood times prior, cL(θ)p(θ), by some density g(θ)from which we can easily sample – say, a multivariate t.
Chapter 5: Bayesian Computation – p. 9/42
Indirect MethodsImportance sampling: Suppose we wish to approximate
E(f(θ)|y) =
∫f(θ)L(θ)p(θ)dθ∫
L(θ)p(θ)dθ,
and we can roughly approximate the normalizedlikelihood times prior, cL(θ)p(θ), by some density g(θ)from which we can easily sample – say, a multivariate t.
Then defining the weight function w(θ) = L(θ)p(θ)/g(θ),
E(f(θ)|y) =
∫f(θ)w(θ)g(θ)dθ∫
w(θ)g(θ)dθ≈
1N
∑Nj=1 f(θj)w(θj)
1N
∑Nj=1 w(θj)
,
where θjiid∼ g(θ). Here, g(θ) is called the importance
function; a good match to cL(θ)p(θ) will produce roughlyequal weights, hence a good approximation. Chapter 5: Bayesian Computation – p. 9/42
Rejection samplingHere, instead of trying to approximate the posterior
h(θ) =L(θ)p(θ)∫L(θ)p(θ)dθ
,
we try to “blanket” it: suppose there exists a constantM > 0 and a smooth density g(θ), called the envelopefunction, such that L(θ)p(θ) < Mg(θ) for all θ.The algorithm proceeds as follows:
Chapter 5: Bayesian Computation – p. 10/42
Rejection samplingHere, instead of trying to approximate the posterior
h(θ) =L(θ)p(θ)∫L(θ)p(θ)dθ
,
we try to “blanket” it: suppose there exists a constantM > 0 and a smooth density g(θ), called the envelopefunction, such that L(θ)p(θ) < Mg(θ) for all θ.The algorithm proceeds as follows:(i) Generate θj ∼ g(θ).
Chapter 5: Bayesian Computation – p. 10/42
Rejection samplingHere, instead of trying to approximate the posterior
h(θ) =L(θ)p(θ)∫L(θ)p(θ)dθ
,
we try to “blanket” it: suppose there exists a constantM > 0 and a smooth density g(θ), called the envelopefunction, such that L(θ)p(θ) < Mg(θ) for all θ.The algorithm proceeds as follows:(i) Generate θj ∼ g(θ).(ii) Generate U ∼ Uniform(0, 1).
Chapter 5: Bayesian Computation – p. 10/42
Rejection samplingHere, instead of trying to approximate the posterior
h(θ) =L(θ)p(θ)∫L(θ)p(θ)dθ
,
we try to “blanket” it: suppose there exists a constantM > 0 and a smooth density g(θ), called the envelopefunction, such that L(θ)p(θ) < Mg(θ) for all θ.The algorithm proceeds as follows:(i) Generate θj ∼ g(θ).(ii) Generate U ∼ Uniform(0, 1).(iii) If MUg(θj) < L(θj)p(θj), accept θj ; else reject θj .
Chapter 5: Bayesian Computation – p. 10/42
Rejection samplingHere, instead of trying to approximate the posterior
h(θ) =L(θ)p(θ)∫L(θ)p(θ)dθ
,
we try to “blanket” it: suppose there exists a constantM > 0 and a smooth density g(θ), called the envelopefunction, such that L(θ)p(θ) < Mg(θ) for all θ.The algorithm proceeds as follows:(i) Generate θj ∼ g(θ).(ii) Generate U ∼ Uniform(0, 1).(iii) If MUg(θj) < L(θj)p(θj), accept θj ; else reject θj .(iv) Return to step (i) and repeat, until the desired
sample {θj , j = 1, . . . , N} is obtained. Its memberswill be random variates from h(θ).
Chapter 5: Bayesian Computation – p. 10/42
Rejection Sampling: informal “proof”
-3 -2 -1 0 1 2 3
0.0
0.2
0.4
0.6
0.8
Mg
Lπ
a
Consider the θj samples in the histogram bar centeredat a: the rejection step “slices off” the top portion of thebar. Repeat for all a: accepted θj mimic the lower curve!
Chapter 5: Bayesian Computation – p. 11/42
Rejection Sampling: informal “proof”
-3 -2 -1 0 1 2 3
0.0
0.2
0.4
0.6
0.8
Mg
Lπ
a
Consider the θj samples in the histogram bar centeredat a: the rejection step “slices off” the top portion of thebar. Repeat for all a: accepted θj mimic the lower curve!
Need to choose M as small as possible (efficiency),and watch for “envelope violations”!
Chapter 5: Bayesian Computation – p. 11/42
Markov chain Monte Carlo methods– iterative MC methods, useful when difficult or impossibleto find a feasible importance or envelope density.
Given two unknowns x (think parameters) and y (thinkmissing data), we can often write
p(x) =
∫p(x|y)p(y)dy and p(y) =
∫p(y|x)p(x)dx ,
where p(x|y) and p(y|x) are known; we seek p(x).
Chapter 5: Bayesian Computation – p. 12/42
Markov chain Monte Carlo methods– iterative MC methods, useful when difficult or impossibleto find a feasible importance or envelope density.
Given two unknowns x (think parameters) and y (thinkmissing data), we can often write
p(x) =
∫p(x|y)p(y)dy and p(y) =
∫p(y|x)p(x)dx ,
where p(x|y) and p(y|x) are known; we seek p(x).
Analytical solution via substitution:
p(x) =
∫p(x|y)
∫p(y|x′)p(x′)dx′dy =
∫h(x, x′)p(x′)dx′ ,
where h(x, x′) =∫
p(x|y)p(y|x′)dy.
Chapter 5: Bayesian Computation – p. 12/42
Substitution samplingThis determines a fixed point system
pi+1(x) =
∫h(x, x′)pi(x
′)dx′ ,
which converges under mild conditions.
Chapter 5: Bayesian Computation – p. 13/42
Substitution samplingThis determines a fixed point system
pi+1(x) =
∫h(x, x′)pi(x
′)dx′ ,
which converges under mild conditions.
Tanner and Wong (1987): Can also use a sampling-based approach: Draw X(0) ∼ p0(x), thenY (1) ∼ p(y|x(0)), and finally X(1) ∼ p(x|y(1)).Then X(1) has marginal distribution
p1(x) =
∫p(x|y)p1(y)dy =
∫h(x, x′)p0(x
′)dx′ .
Chapter 5: Bayesian Computation – p. 13/42
Substitution samplingThis determines a fixed point system
pi+1(x) =
∫h(x, x′)pi(x
′)dx′ ,
which converges under mild conditions.
Tanner and Wong (1987): Can also use a sampling-based approach: Draw X(0) ∼ p0(x), thenY (1) ∼ p(y|x(0)), and finally X(1) ∼ p(x|y(1)).Then X(1) has marginal distribution
p1(x) =
∫p(x|y)p1(y)dy =
∫h(x, x′)p0(x
′)dx′ .
Repeating this process produces pairs (X(i), Y (i)) such
that X(i) d→ X ∼ p(x) and Y (i) d
→ Y ∼ p(y).Chapter 5: Bayesian Computation – p. 13/42
Substitution sampling
That is, for i sufficiently large, X(i) may be thought of asa sample from the true marginal density p(x).Tanner and Wong refer to this (bivariate) algorithm asdata augmentation.
Chapter 5: Bayesian Computation – p. 14/42
Substitution sampling
That is, for i sufficiently large, X(i) may be thought of asa sample from the true marginal density p(x).Tanner and Wong refer to this (bivariate) algorithm asdata augmentation.
The luxury of avoiding the integration above has comeat the price of obtaining not the marginal density p(x)itself, but only a sample from this density.
Chapter 5: Bayesian Computation – p. 14/42
Substitution sampling
That is, for i sufficiently large, X(i) may be thought of asa sample from the true marginal density p(x).Tanner and Wong refer to this (bivariate) algorithm asdata augmentation.
The luxury of avoiding the integration above has comeat the price of obtaining not the marginal density p(x)itself, but only a sample from this density.
Can we extend this idea to more than 2 dimensions?...
Chapter 5: Bayesian Computation – p. 14/42
Gibbs samplingSuppose the joint distribution of θ = (θ1, . . . , θK) isuniquely determined by the full conditional distributions,{pi(θi|θj 6=i), i = 1, . . . ,K}.
Chapter 5: Bayesian Computation – p. 15/42
Gibbs samplingSuppose the joint distribution of θ = (θ1, . . . , θK) isuniquely determined by the full conditional distributions,{pi(θi|θj 6=i), i = 1, . . . ,K}.
Given an arbitrary set of starting values {θ(0)1 , . . . , θ
(0)K },
Draw θ(1)1 ∼ p1(θ1|θ
(0)2 , . . . , θ
(0)K ),
Draw θ(1)2 ∼ p2(θ2|θ
(1)1 , θ
(0)3 , . . . , θ
(0)K ),
...
Draw θ(1)K ∼ pK(θK |θ
(1)1 , . . . , θ
(1)K−1),
Chapter 5: Bayesian Computation – p. 15/42
Gibbs samplingSuppose the joint distribution of θ = (θ1, . . . , θK) isuniquely determined by the full conditional distributions,{pi(θi|θj 6=i), i = 1, . . . ,K}.
Given an arbitrary set of starting values {θ(0)1 , . . . , θ
(0)K },
Draw θ(1)1 ∼ p1(θ1|θ
(0)2 , . . . , θ
(0)K ),
Draw θ(1)2 ∼ p2(θ2|θ
(1)1 , θ
(0)3 , . . . , θ
(0)K ),
...
Draw θ(1)K ∼ pK(θK |θ
(1)1 , . . . , θ
(1)K−1),
Under mild conditions,
(θ(t)1 , . . . , θ
(t)K )
d→ (θ1, · · · , θK) ∼ p as t → ∞ .
Chapter 5: Bayesian Computation – p. 15/42
Gibbs sampling (cont’d)
For t sufficiently large (say, bigger than t0), {θ(t)}Tt=t0+1
is a (correlated) sample from the true posterior.
Chapter 5: Bayesian Computation – p. 16/42
Gibbs sampling (cont’d)
For t sufficiently large (say, bigger than t0), {θ(t)}Tt=t0+1
is a (correlated) sample from the true posterior.
We might therefore use a sample mean to estimate theposterior mean, i.e.,
E(θi|y) =1
T − t0
T∑
t=t0+1
θ(t)i .
Chapter 5: Bayesian Computation – p. 16/42
Gibbs sampling (cont’d)
For t sufficiently large (say, bigger than t0), {θ(t)}Tt=t0+1
is a (correlated) sample from the true posterior.
We might therefore use a sample mean to estimate theposterior mean, i.e.,
E(θi|y) =1
T − t0
T∑
t=t0+1
θ(t)i .
The time from t = 0 to t = t0 is commonly known as theburn-in period; one can safely adapt (change) anMCMC algorithm during this preconvergence period,since these samples will be discarded anyway
Chapter 5: Bayesian Computation – p. 16/42
Gibbs sampling (cont’d)In practice, we may actually run m parallel Gibbssampling chains, instead of only 1, for some modest m(say, m = 5). Discarding the burn-in period, we obtain
E(θi|y) =1
m(T − t0)
m∑
j=1
T∑
t=t0+1
θ(t)i,j ,
where now the j subscript indicates chain number.
Chapter 5: Bayesian Computation – p. 17/42
Gibbs sampling (cont’d)In practice, we may actually run m parallel Gibbssampling chains, instead of only 1, for some modest m(say, m = 5). Discarding the burn-in period, we obtain
E(θi|y) =1
m(T − t0)
m∑
j=1
T∑
t=t0+1
θ(t)i,j ,
where now the j subscript indicates chain number.
A density estimate p(θi|y) may be obtained by
smoothing the histogram of the {θ(t)i,j }, or as
p(θi|y) =1
m(T − t0)
m∑
j=1
T∑
t=t0+1
p(θi|θ(t)k 6=i,j , y)
≈
∫p(θi|θk 6=i,y)p(θk 6=i|y)dθk 6=i
Chapter 5: Bayesian Computation – p. 17/42
Example 2.5 (revisited)Consider the model
Yi|θiind∼ Poisson(θisi), θi
ind∼ G(α, β),
β ∼ IG(c, d), i = 1, . . . , k,
where α, c, d, and the si are known. Thus
Chapter 5: Bayesian Computation – p. 18/42
Example 2.5 (revisited)Consider the model
Yi|θiind∼ Poisson(θisi), θi
ind∼ G(α, β),
β ∼ IG(c, d), i = 1, . . . , k,
where α, c, d, and the si are known. Thus
f(yi|θi) =e−(θisi)(θisi)
yi
yi!, yi ≥ 0, θi > 0,
g(θi|β) =θα−1i e−θi/β
Γ(α)βα, α > 0, β > 0,
h(β) =e−1/(βd)
Γ(c)dcβc+1, c > 0, d > 0.
Note g is conjugate for f , and h is conjugate for g!Chapter 5: Bayesian Computation – p. 18/42
Example 2.5 (revisited)
To implement the Gibbs sampler, we require the fullconditional distributions of β and the θi.
Chapter 5: Bayesian Computation – p. 19/42
Example 2.5 (revisited)
To implement the Gibbs sampler, we require the fullconditional distributions of β and the θi.
By Bayes’ Rule, each of these is proportional to thecomplete Bayesian model specification,
[k∏
i=1
f(yi|θi)g(θi|β)
]h(β)
Chapter 5: Bayesian Computation – p. 19/42
Example 2.5 (revisited)
To implement the Gibbs sampler, we require the fullconditional distributions of β and the θi.
By Bayes’ Rule, each of these is proportional to thecomplete Bayesian model specification,
[k∏
i=1
f(yi|θi)g(θi|β)
]h(β)
Thus we can find full conditional distributions bydropping irrelevant terms from this expression, andnormalizing!
Expressions given on the next slide...
Chapter 5: Bayesian Computation – p. 19/42
Example 2.5 (revisited)
p(θi|θj 6=i, β,y) ∝ f(yi|θi)g(θi|β)
∝ θyi+α−1i e−θi(si+1/β)
∝ G(θi | yi + α, (si + 1/β)−1
), and
p(β|{θi},y) ∝
[k∏
i=1
g(θi|β)
]h(β) ∝
[k∏
i=1
e−θi/β
βα
]e−1/(βd)
βc+1
∝e−
1
β (∑k
i=1θi+
1
d)
βkα+c+1
∝ IG
β kα + c,
(k∑
i=1
θi + 1/d
)−1 .
Chapter 5: Bayesian Computation – p. 20/42
Example 2.5 (revisited)Thus the {θ
(t)i } and β(t) may be sampled directly!
Chapter 5: Bayesian Computation – p. 21/42
Example 2.5 (revisited)Thus the {θ
(t)i } and β(t) may be sampled directly!
If α were also unknown, harder since
p(α|{θi}, β,y) ∝
[k∏
i=1
g(θi|α, β)
]h(α)
is not proportional to any standard family. So resort to:
Chapter 5: Bayesian Computation – p. 21/42
Example 2.5 (revisited)Thus the {θ
(t)i } and β(t) may be sampled directly!
If α were also unknown, harder since
p(α|{θi}, β,y) ∝
[k∏
i=1
g(θi|α, β)
]h(α)
is not proportional to any standard family. So resort to:adaptive rejection sampling (ARS): providedp(α|{θi}, β,y) is log-concave, or
Chapter 5: Bayesian Computation – p. 21/42
Example 2.5 (revisited)Thus the {θ
(t)i } and β(t) may be sampled directly!
If α were also unknown, harder since
p(α|{θi}, β,y) ∝
[k∏
i=1
g(θi|α, β)
]h(α)
is not proportional to any standard family. So resort to:adaptive rejection sampling (ARS): providedp(α|{θi}, β,y) is log-concave, orMetropolis-Hastings sampling – IOU for now!
Chapter 5: Bayesian Computation – p. 21/42
Example 2.5 (revisited)Thus the {θ
(t)i } and β(t) may be sampled directly!
If α were also unknown, harder since
p(α|{θi}, β,y) ∝
[k∏
i=1
g(θi|α, β)
]h(α)
is not proportional to any standard family. So resort to:adaptive rejection sampling (ARS): providedp(α|{θi}, β,y) is log-concave, orMetropolis-Hastings sampling – IOU for now!
Note: This is the order the WinBUGS software uses whenderiving full conditionals!
Chapter 5: Bayesian Computation – p. 21/42
Example 2.5 (revisited)Thus the {θ
(t)i } and β(t) may be sampled directly!
If α were also unknown, harder since
p(α|{θi}, β,y) ∝
[k∏
i=1
g(θi|α, β)
]h(α)
is not proportional to any standard family. So resort to:adaptive rejection sampling (ARS): providedp(α|{θi}, β,y) is log-concave, orMetropolis-Hastings sampling – IOU for now!
Note: This is the order the WinBUGS software uses whenderiving full conditionals!
This is the standard “hybrid approach": Use Gibbsoverall, with “substeps" for awkward full conditionals
Chapter 5: Bayesian Computation – p. 21/42
Example 5.6: Rat dataConsider the longitudinal data model
Yijind∼ N
(αi + βixij , σ2
),
where Yij is the weight of the ith rat at measurementpoint j, while xij denotes its age in days, fori = 1, . . . , k = 30, and j = 1, . . . , ni = 5 for all i(see text p.147 for actual data).
Chapter 5: Bayesian Computation – p. 22/42
Example 5.6: Rat dataConsider the longitudinal data model
Yijind∼ N
(αi + βixij , σ2
),
where Yij is the weight of the ith rat at measurementpoint j, while xij denotes its age in days, fori = 1, . . . , k = 30, and j = 1, . . . , ni = 5 for all i(see text p.147 for actual data).
Adopt the random effects model
θi ≡
(αi
βi
)iid∼ N
(θ0 ≡
(α0
β0
), Σ
), i = 1, . . . , k ,
which is conjugate with the likelihood (see generalnormal linear model in Example 2.7, text p.34).
Chapter 5: Bayesian Computation – p. 22/42
Example 5.6: Rat dataPriors: Conjugate forms are again available, namely
σ2 ∼ IG(a, b) ,
θ0 ∼ N(η, C) , and
Σ−1 ∼ W((ρR)−1, ρ
), (1)
where W denotes the Wishart (multivariate gamma)distribution; see Appendix A.2.2, text p.328.
Chapter 5: Bayesian Computation – p. 23/42
Example 5.6: Rat dataPriors: Conjugate forms are again available, namely
σ2 ∼ IG(a, b) ,
θ0 ∼ N(η, C) , and
Σ−1 ∼ W((ρR)−1, ρ
), (2)
where W denotes the Wishart (multivariate gamma)distribution; see Appendix A.2.2, text p.328.
We assume the hyperparameters (a, b,η, C, ρ, and R)are all known, so there are 30(2) + 3 + 3 = 66 unknownparameters in the model.
Yet the Gibbs sampler is relatively straightforward toimplement here, thanks to the conjugacy at each stagein the hierarchy.
Chapter 5: Bayesian Computation – p. 23/42
Example 5.6: Rat dataFull conditional derivation: Consider for example θi.Since we are allowed to think of the remainingparameters as fixed, by Example 2.7 (Lindley andSmith, 1972) we have
θi|y, θ0, Σ−1, σ2 ∼ N (Didi , Di) where
D−1i = σ−2XT
i Xi + Σ−1 and di = σ−2XTi yi + Σ−1θ0,
for yi =
yi1...
yini
, and Xi =
1 xi1...
...1 xini
.
Chapter 5: Bayesian Computation – p. 24/42
Example 5.6: Rat dataFull conditional derivation: Consider for example θi.Since we are allowed to think of the remainingparameters as fixed, by Example 2.7 (Lindley andSmith, 1972) we have
θi|y, θ0, Σ−1, σ2 ∼ N (Didi , Di) where
D−1i = σ−2XT
i Xi + Σ−1 and di = σ−2XTi yi + Σ−1θ0,
for yi =
yi1...
yini
, and Xi =
1 xi1...
...1 xini
.
Similarly, the full conditionals for θ0, Σ−1, and σ2
emerge in closed form as normal, Wishart, and inversegamma, respectively!
Chapter 5: Bayesian Computation – p. 24/42
Example 5.6: Rat dataUsing vague hyperpriors, run 3 initially overdispersedparallel sampling chains for 500 iterations each:
iteration
mu
0 100 200 300 400 500
100
200
Trace of mu(500 values per trace)
mu
90 100 110 120
0.0
0.05
Kernel density for mu(1200 values)
•• • ••• ••• •• •••• ••• • • • ••• ••• •• ••• • •• • •• ••• • ••• ••• ••• •• •• ••• •••• ••• •••• • •• ••• • •• • •••• •• •• •• •• •• •• •••• • •••• •• •• •••• •• ••• •• •• ••• ••••• • •• •• •• ••• ••• •• • ••••••••• • •• ••••• ••• • •••• ••• •• •• •• ••• •• •• •• •• •• •••• •• • •• ••• •• ••• • •• ••••• ••••• •••• • ••• •••• ••••• ••• • •• •• • • •• • •• • • •• ••• • ••• ••••••• •• ••••• •••• ••• •• • • •• •• • ••• •• •••• •• • •• •• •• •• •• •••• •• • •• •• • ••• •• • ••• •• • ••• • ••• •• • •••• •••••• •• • •• •• • •• ••••• • • •••••••• •• •• •••••• • •••••••••••••••••• •••••••••• ••••••••••••••••••••••••••••••••• ••••••••••••••• • •••••••• •••••••••• •••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• • •• ••• •••• ••• • ••••••••• •••• •••• ••• • •••• • ••• ••• •• ••••• •• ••• ••• •• • ••• •• ••• ••• •• •• • •• • •• •• •• •• ••••• •• ••• • •• •• • •• •• • •• •••• ••• •• ••• ••• • •••• •• •• •••• • ••• •••• ••• •• •••••• • • •• ••• ••• ••• ••• ••• •• • •••• •• •• •••• • ••• •• • ••• •• ••• • •• •••• •• •• • •• •• •••• ••••• •• • •••• •• •• •• •• •• •• •••• • ••••• •••• •• •• •• •••••• • • •• •••• •• •••••••• •• •• •••• • • •• •• ••• •• •• •••••• ••••• ••• • •••• •• ••• •• ••• • ••• •• •• ••• ••••• •• ••• ••• •• •••• •• •• ••••• • ••••• •• •• ••• •••• •••• • •••••• •• • •••• ••• ••• • • ••• •••• ••• ••••• • • •• •• ••• •• •••• •••••• ••• ••• ••••• ••• • •••• • ••• ••• ••• •• • •• •• •••• •• •• ••• ••• •••••• •• • •• • • ••••• •••• ••• ••• •• • ••• ••• ••• •• •• •• • •• • ••• •••• ••• ••• •• •• •• • •• • ••••••• •• • ••• •• ••• ••• •••• • • ••• •
iteration
beta
0
0 100 200 300 400 500
05
10
Trace of beta0(500 values per trace)
beta0
5.8 6.0 6.2 6.4 6.6 6.8
02
Kernel density for beta0(1200 values)
•• •• • •• ••• •• ••• •• ••• ••• •• • ••• ••••• • ••• • •• •• ••• ••• •• • • •••• •• •••• ••• •• ••• • ••••• •• •• • •• •• •• •• •••• ••• • ••• • • ••• ••• •• •••• ••• ••••• ••••• ••• •• •• •• •• • •• •• •• •• ••• ••••• • ••• ••• ••• •• •• • •••••• ••••• •••• • •• ••••• •• ••• •• • •••••• • •••• • •••••••• • • •••••••• •• •• ••• •• •• ••• ••• • •• • ••• • ••••••• •••••• ••••••• •••• •• •• • • ••• •••••• •••• ••• •• •• • •••• • ••• • ••••• •••• ••••• •• ••• ••• •••• ••• ••• • •• • •••• • •••• ••• •• ••••• ••• • ••• •• • •• •• •• •••• • • • ••• • •••••••••••••••• •••• • ••••••••••• • •••••••••••• ••••••••••••••••••• ••••••••••• • • •••• ••• • ••••••• ••••••••••• ••••••••••••••••••••••••••••••••••••••••••••••••••••• •••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• ••••••••••• • ••••• •••• •••••••••• ••• •• ••• •• • ••• •• ••• • •••• ••• • •• • • •••• •• •••• ••• •••• • ••••• • ••• •• •••••• • • • •••• •• • ••• •• • ••••• ••• ••• •• •• •• •••••• • ••• ••• ••••• ••• •• •• •• •• • •••• ••• ••• •• • •••• •• • •••••• ••• • •••• •• ••• ••• •• • • •••• •• •••• •• •••• •••• •• •• •• •• •• •• •• ••• • •••• • •••• •• •• ••• ••••• • ••••• ••• •• •••• ••• ••• •• ••• • •• ••• •• • • ••• •• ••• ••• •• •• •••••• •• •• ••• • ••••• •••• •••• •• • •• •• •• ••• ••• • ••••• •••• •• •••• • ••• ••• •• •• •••••• • ••• •• ••• •• • ••• ••• •• • •• ••• •• • • •••• ••• ••• •• •••• • ••• • • •• •• •• • •• • •• • •• •• •••• •••• • • ••••• •• ••• • • •• •••••• • •• •• •• •• ••••••• •• •• ••• •• • •• •••• •• ••••• •••• ••• ••• ••• • • ••• • •• • • •• •• • ••
Chapter 5: Bayesian Computation – p. 25/42
Example 5.6: Rat dataUsing vague hyperpriors, run 3 initially overdispersedparallel sampling chains for 500 iterations each:
iteration
mu
0 100 200 300 400 500
100
200
Trace of mu(500 values per trace)
mu
90 100 110 120
0.0
0.05
Kernel density for mu(1200 values)
•• • ••• ••• •• •••• ••• • • • ••• ••• •• ••• • •• • •• ••• • ••• ••• ••• •• •• ••• •••• ••• •••• • •• ••• • •• • •••• •• •• •• •• •• •• •••• • •••• •• •• •••• •• ••• •• •• ••• ••••• • •• •• •• ••• ••• •• • ••••••••• • •• ••••• ••• • •••• ••• •• •• •• ••• •• •• •• •• •• •••• •• • •• ••• •• ••• • •• ••••• ••••• •••• • ••• •••• ••••• ••• • •• •• • • •• • •• • • •• ••• • ••• ••••••• •• ••••• •••• ••• •• • • •• •• • ••• •• •••• •• • •• •• •• •• •• •••• •• • •• •• • ••• •• • ••• •• • ••• • ••• •• • •••• •••••• •• • •• •• • •• ••••• • • •••••••• •• •• •••••• • •••••••••••••••••• •••••••••• ••••••••••••••••••••••••••••••••• ••••••••••••••• • •••••••• •••••••••• •••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• • •• ••• •••• ••• • ••••••••• •••• •••• ••• • •••• • ••• ••• •• ••••• •• ••• ••• •• • ••• •• ••• ••• •• •• • •• • •• •• •• •• ••••• •• ••• • •• •• • •• •• • •• •••• ••• •• ••• ••• • •••• •• •• •••• • ••• •••• ••• •• •••••• • • •• ••• ••• ••• ••• ••• •• • •••• •• •• •••• • ••• •• • ••• •• ••• • •• •••• •• •• • •• •• •••• ••••• •• • •••• •• •• •• •• •• •• •••• • ••••• •••• •• •• •• •••••• • • •• •••• •• •••••••• •• •• •••• • • •• •• ••• •• •• •••••• ••••• ••• • •••• •• ••• •• ••• • ••• •• •• ••• ••••• •• ••• ••• •• •••• •• •• ••••• • ••••• •• •• ••• •••• •••• • •••••• •• • •••• ••• ••• • • ••• •••• ••• ••••• • • •• •• ••• •• •••• •••••• ••• ••• ••••• ••• • •••• • ••• ••• ••• •• • •• •• •••• •• •• ••• ••• •••••• •• • •• • • ••••• •••• ••• ••• •• • ••• ••• ••• •• •• •• • •• • ••• •••• ••• ••• •• •• •• • •• • ••••••• •• • ••• •• ••• ••• •••• • • ••• •
iteration
beta
0
0 100 200 300 400 500
05
10
Trace of beta0(500 values per trace)
beta0
5.8 6.0 6.2 6.4 6.6 6.8
02
Kernel density for beta0(1200 values)
•• •• • •• ••• •• ••• •• ••• ••• •• • ••• ••••• • ••• • •• •• ••• ••• •• • • •••• •• •••• ••• •• ••• • ••••• •• •• • •• •• •• •• •••• ••• • ••• • • ••• ••• •• •••• ••• ••••• ••••• ••• •• •• •• •• • •• •• •• •• ••• ••••• • ••• ••• ••• •• •• • •••••• ••••• •••• • •• ••••• •• ••• •• • •••••• • •••• • •••••••• • • •••••••• •• •• ••• •• •• ••• ••• • •• • ••• • ••••••• •••••• ••••••• •••• •• •• • • ••• •••••• •••• ••• •• •• • •••• • ••• • ••••• •••• ••••• •• ••• ••• •••• ••• ••• • •• • •••• • •••• ••• •• ••••• ••• • ••• •• • •• •• •• •••• • • • ••• • •••••••••••••••• •••• • ••••••••••• • •••••••••••• ••••••••••••••••••• ••••••••••• • • •••• ••• • ••••••• ••••••••••• ••••••••••••••••••••••••••••••••••••••••••••••••••••• •••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• ••••••••••• • ••••• •••• •••••••••• ••• •• ••• •• • ••• •• ••• • •••• ••• • •• • • •••• •• •••• ••• •••• • ••••• • ••• •• •••••• • • • •••• •• • ••• •• • ••••• ••• ••• •• •• •• •••••• • ••• ••• ••••• ••• •• •• •• •• • •••• ••• ••• •• • •••• •• • •••••• ••• • •••• •• ••• ••• •• • • •••• •• •••• •• •••• •••• •• •• •• •• •• •• •• ••• • •••• • •••• •• •• ••• ••••• • ••••• ••• •• •••• ••• ••• •• ••• • •• ••• •• • • ••• •• ••• ••• •• •• •••••• •• •• ••• • ••••• •••• •••• •• • •• •• •• ••• ••• • ••••• •••• •• •••• • ••• ••• •• •• •••••• • ••• •• ••• •• • ••• ••• •• • •• ••• •• • • •••• ••• ••• •• •••• • ••• • • •• •• •• • •• • •• • •• •• •••• •••• • • ••••• •• ••• • • •• •••••• • •• •• •• •• ••••••• •• •• ••• •• • •• •••• •• ••••• •••• ••• ••• ••• • • ••• • •• • • •• •• • ••
The output from all three chains over iterations 101–500is used in the posterior kernel density estimates (col 2)
Chapter 5: Bayesian Computation – p. 25/42
Example 5.6: Rat dataUsing vague hyperpriors, run 3 initially overdispersedparallel sampling chains for 500 iterations each:
iteration
mu
0 100 200 300 400 500
100
200
Trace of mu(500 values per trace)
mu
90 100 110 120
0.0
0.05
Kernel density for mu(1200 values)
•• • ••• ••• •• •••• ••• • • • ••• ••• •• ••• • •• • •• ••• • ••• ••• ••• •• •• ••• •••• ••• •••• • •• ••• • •• • •••• •• •• •• •• •• •• •••• • •••• •• •• •••• •• ••• •• •• ••• ••••• • •• •• •• ••• ••• •• • ••••••••• • •• ••••• ••• • •••• ••• •• •• •• ••• •• •• •• •• •• •••• •• • •• ••• •• ••• • •• ••••• ••••• •••• • ••• •••• ••••• ••• • •• •• • • •• • •• • • •• ••• • ••• ••••••• •• ••••• •••• ••• •• • • •• •• • ••• •• •••• •• • •• •• •• •• •• •••• •• • •• •• • ••• •• • ••• •• • ••• • ••• •• • •••• •••••• •• • •• •• • •• ••••• • • •••••••• •• •• •••••• • •••••••••••••••••• •••••••••• ••••••••••••••••••••••••••••••••• ••••••••••••••• • •••••••• •••••••••• •••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• • •• ••• •••• ••• • ••••••••• •••• •••• ••• • •••• • ••• ••• •• ••••• •• ••• ••• •• • ••• •• ••• ••• •• •• • •• • •• •• •• •• ••••• •• ••• • •• •• • •• •• • •• •••• ••• •• ••• ••• • •••• •• •• •••• • ••• •••• ••• •• •••••• • • •• ••• ••• ••• ••• ••• •• • •••• •• •• •••• • ••• •• • ••• •• ••• • •• •••• •• •• • •• •• •••• ••••• •• • •••• •• •• •• •• •• •• •••• • ••••• •••• •• •• •• •••••• • • •• •••• •• •••••••• •• •• •••• • • •• •• ••• •• •• •••••• ••••• ••• • •••• •• ••• •• ••• • ••• •• •• ••• ••••• •• ••• ••• •• •••• •• •• ••••• • ••••• •• •• ••• •••• •••• • •••••• •• • •••• ••• ••• • • ••• •••• ••• ••••• • • •• •• ••• •• •••• •••••• ••• ••• ••••• ••• • •••• • ••• ••• ••• •• • •• •• •••• •• •• ••• ••• •••••• •• • •• • • ••••• •••• ••• ••• •• • ••• ••• ••• •• •• •• • •• • ••• •••• ••• ••• •• •• •• • •• • ••••••• •• • ••• •• ••• ••• •••• • • ••• •
iteration
beta
0
0 100 200 300 400 500
05
10
Trace of beta0(500 values per trace)
beta0
5.8 6.0 6.2 6.4 6.6 6.8
02
Kernel density for beta0(1200 values)
•• •• • •• ••• •• ••• •• ••• ••• •• • ••• ••••• • ••• • •• •• ••• ••• •• • • •••• •• •••• ••• •• ••• • ••••• •• •• • •• •• •• •• •••• ••• • ••• • • ••• ••• •• •••• ••• ••••• ••••• ••• •• •• •• •• • •• •• •• •• ••• ••••• • ••• ••• ••• •• •• • •••••• ••••• •••• • •• ••••• •• ••• •• • •••••• • •••• • •••••••• • • •••••••• •• •• ••• •• •• ••• ••• • •• • ••• • ••••••• •••••• ••••••• •••• •• •• • • ••• •••••• •••• ••• •• •• • •••• • ••• • ••••• •••• ••••• •• ••• ••• •••• ••• ••• • •• • •••• • •••• ••• •• ••••• ••• • ••• •• • •• •• •• •••• • • • ••• • •••••••••••••••• •••• • ••••••••••• • •••••••••••• ••••••••••••••••••• ••••••••••• • • •••• ••• • ••••••• ••••••••••• ••••••••••••••••••••••••••••••••••••••••••••••••••••• •••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• ••••••••••• • ••••• •••• •••••••••• ••• •• ••• •• • ••• •• ••• • •••• ••• • •• • • •••• •• •••• ••• •••• • ••••• • ••• •• •••••• • • • •••• •• • ••• •• • ••••• ••• ••• •• •• •• •••••• • ••• ••• ••••• ••• •• •• •• •• • •••• ••• ••• •• • •••• •• • •••••• ••• • •••• •• ••• ••• •• • • •••• •• •••• •• •••• •••• •• •• •• •• •• •• •• ••• • •••• • •••• •• •• ••• ••••• • ••••• ••• •• •••• ••• ••• •• ••• • •• ••• •• • • ••• •• ••• ••• •• •• •••••• •• •• ••• • ••••• •••• •••• •• • •• •• •• ••• ••• • ••••• •••• •• •••• • ••• ••• •• •• •••••• • ••• •• ••• •• • ••• ••• •• • •• ••• •• • • •••• ••• ••• •• •••• • ••• • • •• •• •• • •• • •• • •• •• •••• •••• • • ••••• •• ••• • • •• •••••• • •• •• •• •• ••••••• •• •• ••• •• • •• •••• •• ••••• •••• ••• ••• ••• • • ••• • •• • • •• •• • ••
The output from all three chains over iterations 101–500is used in the posterior kernel density estimates (col 2)
The average rat weighs about 106 grams at birth, andgains about 6.2 grams per day.
Chapter 5: Bayesian Computation – p. 25/42
Metropolis algorithmWhat happens if the full conditional p(θi|θj 6=i,y) is notavailable in closed form? Typically, p(θi|θj 6=i,y) will beavailable up to proportionality constant, since it isproportional to the portion of the Bayesian model(likelihood times prior) that involves θi.
Chapter 5: Bayesian Computation – p. 26/42
Metropolis algorithmWhat happens if the full conditional p(θi|θj 6=i,y) is notavailable in closed form? Typically, p(θi|θj 6=i,y) will beavailable up to proportionality constant, since it isproportional to the portion of the Bayesian model(likelihood times prior) that involves θi.
Suppose the true joint posterior for θ has unnormalizeddensity p(θ).
Chapter 5: Bayesian Computation – p. 26/42
Metropolis algorithmWhat happens if the full conditional p(θi|θj 6=i,y) is notavailable in closed form? Typically, p(θi|θj 6=i,y) will beavailable up to proportionality constant, since it isproportional to the portion of the Bayesian model(likelihood times prior) that involves θi.
Suppose the true joint posterior for θ has unnormalizeddensity p(θ).
Choose a candidate density q(θ∗|θ(t−1)) that is a validdensity function for every possible value of theconditioning variable θ(t−1), and satisfies
q(θ∗|θ(t−1)) = q(θ(t−1)|θ∗) ,
i.e., q is symmetric in its arguments.
Chapter 5: Bayesian Computation – p. 26/42
Metropolis algorithm (cont’d)Given a starting value θ(0) at iteration t = 0, the algorithmproceeds as follows:
Metropolis Algorithm: For (t ∈ 1 : T ), repeat:
Chapter 5: Bayesian Computation – p. 27/42
Metropolis algorithm (cont’d)Given a starting value θ(0) at iteration t = 0, the algorithmproceeds as follows:
Metropolis Algorithm: For (t ∈ 1 : T ), repeat:
1. Draw θ∗ from q(·|θ(t−1))
Chapter 5: Bayesian Computation – p. 27/42
Metropolis algorithm (cont’d)Given a starting value θ(0) at iteration t = 0, the algorithmproceeds as follows:
Metropolis Algorithm: For (t ∈ 1 : T ), repeat:
1. Draw θ∗ from q(·|θ(t−1))
2. Compute the ratior = p(θ∗)/p(θ(t−1)) = exp[log p(θ∗) − log p(θ(t−1))]
Chapter 5: Bayesian Computation – p. 27/42
Metropolis algorithm (cont’d)Given a starting value θ(0) at iteration t = 0, the algorithmproceeds as follows:
Metropolis Algorithm: For (t ∈ 1 : T ), repeat:
1. Draw θ∗ from q(·|θ(t−1))
2. Compute the ratior = p(θ∗)/p(θ(t−1)) = exp[log p(θ∗) − log p(θ(t−1))]
3. If r ≥ 1, set θ(t) = θ∗;
If r < 1, set θ(t) =
{θ∗ with probability r
θ(t−1) with probability 1 − r.
Chapter 5: Bayesian Computation – p. 27/42
Metropolis algorithm (cont’d)Given a starting value θ(0) at iteration t = 0, the algorithmproceeds as follows:
Metropolis Algorithm: For (t ∈ 1 : T ), repeat:
1. Draw θ∗ from q(·|θ(t−1))
2. Compute the ratior = p(θ∗)/p(θ(t−1)) = exp[log p(θ∗) − log p(θ(t−1))]
3. If r ≥ 1, set θ(t) = θ∗;
If r < 1, set θ(t) =
{θ∗ with probability r
θ(t−1) with probability 1 − r.
Then a draw θ(t) converges in distribution to a drawfrom the true posterior density p(θ|y).
Chapter 5: Bayesian Computation – p. 27/42
Metropolis algorithm (cont’d)Given a starting value θ(0) at iteration t = 0, the algorithmproceeds as follows:
Metropolis Algorithm: For (t ∈ 1 : T ), repeat:
1. Draw θ∗ from q(·|θ(t−1))
2. Compute the ratior = p(θ∗)/p(θ(t−1)) = exp[log p(θ∗) − log p(θ(t−1))]
3. If r ≥ 1, set θ(t) = θ∗;
If r < 1, set θ(t) =
{θ∗ with probability r
θ(t−1) with probability 1 − r.
Then a draw θ(t) converges in distribution to a drawfrom the true posterior density p(θ|y).
Note: When used as a substep in a larger (e.g., Gibbs)algorithm, we often use T = 1 (convergence still OK).
Chapter 5: Bayesian Computation – p. 27/42
Metropolis algorithm (cont’d)How to choose the candidate density? The usualapproach (after θ has been transformed to havesupport ℜk, if necessary) is to set
q(θ∗|θ(t−1)) = N(θ∗|θ(t−1), Σ) .
In one dimension, MCMC “folklore” suggests choosingΣ to provide an observed acceptance ratio near 50%.
Chapter 5: Bayesian Computation – p. 28/42
Metropolis algorithm (cont’d)How to choose the candidate density? The usualapproach (after θ has been transformed to havesupport ℜk, if necessary) is to set
q(θ∗|θ(t−1)) = N(θ∗|θ(t−1), Σ) .
In one dimension, MCMC “folklore” suggests choosingΣ to provide an observed acceptance ratio near 50%.
Hastings (1970) showed we can drop the requirementthat q be symmetric, provided we use
r =p(θ∗)q(θ(t−1) | θ∗)
p(θ(t−1))q(θ∗ | θ(t−1))
– useful for asymmetric target densities!– this form called the Metropolis-Hastings algorithm
Chapter 5: Bayesian Computation – p. 28/42
Convergence assessmentWhen it is safe to stop and summarize MCMC output?
We would like to ensure that∫|pt(θ) − p(θ)|dθ < ǫ, but
all we can hope to see is∫|pt(θ) − pt+k(θ)|dθ!
Chapter 5: Bayesian Computation – p. 29/42
Convergence assessmentWhen it is safe to stop and summarize MCMC output?
We would like to ensure that∫|pt(θ) − p(θ)|dθ < ǫ, but
all we can hope to see is∫|pt(θ) − pt+k(θ)|dθ!
Controversy: Does the eventual mixing of “initiallyoverdispersed” parallel sampling chains provideworthwhile information on convergence?
Chapter 5: Bayesian Computation – p. 29/42
Convergence assessmentWhen it is safe to stop and summarize MCMC output?
We would like to ensure that∫|pt(θ) − p(θ)|dθ < ǫ, but
all we can hope to see is∫|pt(θ) − pt+k(θ)|dθ!
Controversy: Does the eventual mixing of “initiallyoverdispersed” parallel sampling chains provideworthwhile information on convergence?
While one can never “prove” convergence of aMCMC algorithm using only a finite realization fromthe chain, poor mixing of parallel chains can helpdiscover extreme forms of nonconvergence
Chapter 5: Bayesian Computation – p. 29/42
Convergence assessmentWhen it is safe to stop and summarize MCMC output?
We would like to ensure that∫|pt(θ) − p(θ)|dθ < ǫ, but
all we can hope to see is∫|pt(θ) − pt+k(θ)|dθ!
Controversy: Does the eventual mixing of “initiallyoverdispersed” parallel sampling chains provideworthwhile information on convergence?
While one can never “prove” convergence of aMCMC algorithm using only a finite realization fromthe chain, poor mixing of parallel chains can helpdiscover extreme forms of nonconvergence
Still, it’s tricky: a slowly converging sampler may beindistinguishable from one that will never converge(e.g., due to nonidentifiability)!
Chapter 5: Bayesian Computation – p. 29/42
Convergence diagnosticsVarious summaries of MCMC output, such as
sample autocorrelations in one or more chains:close to 0 indicates near-independence, and sochain should more quickly traverse the entireparameter space :)close to 1 indicates the sampler is “stuck” :(
Chapter 5: Bayesian Computation – p. 30/42
Convergence diagnosticsVarious summaries of MCMC output, such as
sample autocorrelations in one or more chains:close to 0 indicates near-independence, and sochain should more quickly traverse the entireparameter space :)close to 1 indicates the sampler is “stuck” :(
Gelman/Rubin shrink factor,
√R =
√(N − 1
N+
m + 1
mN
B
W
)df
df − 2
N→∞−→ 1 ,
where B/N is the variance between the means from them parallel chains, W is the average of the mwithin-chain variances, and df is the degrees offreedom of an approximating t density to the posterior.
Chapter 5: Bayesian Computation – p. 30/42
Convergence diagnosis strategyRun a few (3 to 5) parallel chains, with starting pointsbelieved to be overdispersed
say, covering ±3 prior standard deviations from theprior mean
Chapter 5: Bayesian Computation – p. 31/42
Convergence diagnosis strategyRun a few (3 to 5) parallel chains, with starting pointsbelieved to be overdispersed
say, covering ±3 prior standard deviations from theprior mean
Overlay the resulting sample traces for a representativesubset of the parameters
say, most of the fixed effects, some of the variancecomponents, and a few well-chosen random effects)
Chapter 5: Bayesian Computation – p. 31/42
Convergence diagnosis strategyRun a few (3 to 5) parallel chains, with starting pointsbelieved to be overdispersed
say, covering ±3 prior standard deviations from theprior mean
Overlay the resulting sample traces for a representativesubset of the parameters
say, most of the fixed effects, some of the variancecomponents, and a few well-chosen random effects)
Annotate each plot with lag 1 sample autocorrelationsand perhaps Gelman and Rubin diagnostics
Chapter 5: Bayesian Computation – p. 31/42
Convergence diagnosis strategyRun a few (3 to 5) parallel chains, with starting pointsbelieved to be overdispersed
say, covering ±3 prior standard deviations from theprior mean
Overlay the resulting sample traces for a representativesubset of the parameters
say, most of the fixed effects, some of the variancecomponents, and a few well-chosen random effects)
Annotate each plot with lag 1 sample autocorrelationsand perhaps Gelman and Rubin diagnostics
Investigate bivariate plots and crosscorrelations amongparameters suspected of being confounded, just as onemight do regarding collinearity in linear regression.
Chapter 5: Bayesian Computation – p. 31/42
Variance estimationHow good is our MCMC estimate once we get it?
Suppose a single long chain of (post-convergence)MCMC samples {λ(t)}N
t=1. Let
E(λ|y) = λN =1
N
N∑
t=1
λ(t) .
Chapter 5: Bayesian Computation – p. 32/42
Variance estimationHow good is our MCMC estimate once we get it?
Suppose a single long chain of (post-convergence)MCMC samples {λ(t)}N
t=1. Let
E(λ|y) = λN =1
N
N∑
t=1
λ(t) .
Then by the CLT, under iid sampling we could take
V ariid(λN ) = s2λ/N =
1
N(N − 1)
N∑
t=1
(λ(t) − λN )2 .
But this is likely an underestimate due to positiveautocorrelation in the MCMC samples.
Chapter 5: Bayesian Computation – p. 32/42
Variance estimation (cont’d)
To avoid wasteful parallel sampling or “thinning,”compute the effective sample size,
ESS = N/κ(λ) ,
where κ(λ) = 1 + 2∑∞
k=1 ρk(λ) is the autocorrelationtime, and we cut off the sum when ρk(λ) < ǫ
Chapter 5: Bayesian Computation – p. 33/42
Variance estimation (cont’d)
To avoid wasteful parallel sampling or “thinning,”compute the effective sample size,
ESS = N/κ(λ) ,
where κ(λ) = 1 + 2∑∞
k=1 ρk(λ) is the autocorrelationtime, and we cut off the sum when ρk(λ) < ǫ
ThenV arESS(λN ) = s2
λ/ESS(λ)
Note: κ(λ) ≥ 1, so ESS(λ) ≤ N , and so we have thatV arESS(λN ) ≥ V ariid(λN ), in concert with intuition.
Chapter 5: Bayesian Computation – p. 33/42
Variance estimation (cont’d)
Another alternative: Batching: Divide the run into msuccessive batches of length k with batch meansb1, . . . , bm. Then λN = b = 1
m
∑mi=1 bi, and
V arbatch(λN ) =1
m(m − 1)
m∑
i=1
(bi − λN )2 ,
provided that k is large enough so that the correlationbetween batches is negligible.
Chapter 5: Bayesian Computation – p. 34/42
Variance estimation (cont’d)
Another alternative: Batching: Divide the run into msuccessive batches of length k with batch meansb1, . . . , bm. Then λN = b = 1
m
∑mi=1 bi, and
V arbatch(λN ) =1
m(m − 1)
m∑
i=1
(bi − λN )2 ,
provided that k is large enough so that the correlationbetween batches is negligible.
For any V used to approximate V ar(λN ), a 95% CI forE(λ|y) is then given by
λN ± z.025
√V .
Chapter 5: Bayesian Computation – p. 34/42
Recent DevelopmentsOverrelaxation: Try to speed convergence by inducingnegative autocorrelation within the chains
Chapter 5: Bayesian Computation – p. 35/42
Recent DevelopmentsOverrelaxation: Try to speed convergence by inducingnegative autocorrelation within the chains
Neal (1998): Generate {θi,k}Kk=1 independently from the
full conditional p(θi|θj 6=i,y). Ordering these along withthe old value, we have
θi,0 ≤ θi,1 ≤ · · · ≤ θi,r ≡ θ(t−1)i ≤ · · · ≤ θi,K ,
so that r is the index of the old value. Then take
θ(t)i = θi,K−r .
Chapter 5: Bayesian Computation – p. 35/42
Recent DevelopmentsOverrelaxation: Try to speed convergence by inducingnegative autocorrelation within the chains
Neal (1998): Generate {θi,k}Kk=1 independently from the
full conditional p(θi|θj 6=i,y). Ordering these along withthe old value, we have
θi,0 ≤ θi,1 ≤ · · · ≤ θi,r ≡ θ(t−1)i ≤ · · · ≤ θi,K ,
so that r is the index of the old value. Then take
θ(t)i = θi,K−r .
Note that K = 1 produces Gibbs sampling, while largeK produces progressively more overrelaxation.
Chapter 5: Bayesian Computation – p. 35/42
Recent DevelopmentsOverrelaxation: Try to speed convergence by inducingnegative autocorrelation within the chains
Neal (1998): Generate {θi,k}Kk=1 independently from the
full conditional p(θi|θj 6=i,y). Ordering these along withthe old value, we have
θi,0 ≤ θi,1 ≤ · · · ≤ θi,r ≡ θ(t−1)i ≤ · · · ≤ θi,K ,
so that r is the index of the old value. Then take
θ(t)i = θi,K−r .
Note that K = 1 produces Gibbs sampling, while largeK produces progressively more overrelaxation.
Generation of the K random variables can be avoided ifthe full conditional cdf and inverse cdf are available.
Chapter 5: Bayesian Computation – p. 35/42
Blocking and Structured MCMCA general approach to accelerating MCMC convergence viamultivariate updating of blocks of correlated parameters.
For hierarchical linear models expressible as
y
0
M
=
X1 0
H1 H2
G1 G2
[
θ1
θ2
]+
ǫ
δ
ξ
,
Chapter 5: Bayesian Computation – p. 36/42
Blocking and Structured MCMCA general approach to accelerating MCMC convergence viamultivariate updating of blocks of correlated parameters.
For hierarchical linear models expressible as
y
0
M
=
X1 0
H1 H2
G1 G2
[
θ1
θ2
]+
ǫ
δ
ξ
,
or more compactly as
Y = Xθ + E,
where X and Y are known, θ is unknown, and E is anerror term having Cov(E) = Γ, where Γ is block diagonal:
Cov(E) = Γ = Diag (Cov(ǫ), Cov(δ), Cov(ξ)) .
Chapter 5: Bayesian Computation – p. 36/42
Blocking and Structured MCMCUnder Gaussian errors, p(θ | Y,X,Γ) is given by
N((X ′Γ−1X)−1X ′Γ−1Y , (X ′Γ−1X)−1
).
Chapter 5: Bayesian Computation – p. 37/42
Blocking and Structured MCMCUnder Gaussian errors, p(θ | Y,X,Γ) is given by
N((X ′Γ−1X)−1X ′Γ−1Y , (X ′Γ−1X)−1
).
With conjugate (inverse gamma or Wishart) priors forthe variance components, p(Γ | Y,X,θ) is also available,producing a giant two-block Gibbs sampler.
Chapter 5: Bayesian Computation – p. 37/42
Blocking and Structured MCMCUnder Gaussian errors, p(θ | Y,X,Γ) is given by
N((X ′Γ−1X)−1X ′Γ−1Y , (X ′Γ−1X)−1
).
With conjugate (inverse gamma or Wishart) priors forthe variance components, p(Γ | Y,X,θ) is also available,producing a giant two-block Gibbs sampler.
May want to use the above density only as a candidatedistribution in a Hastings independence chain algorithm(avoid frequent inversion of X ′Γ−1X).
Chapter 5: Bayesian Computation – p. 37/42
Blocking and Structured MCMCUnder Gaussian errors, p(θ | Y,X,Γ) is given by
N((X ′Γ−1X)−1X ′Γ−1Y , (X ′Γ−1X)−1
).
With conjugate (inverse gamma or Wishart) priors forthe variance components, p(Γ | Y,X,θ) is also available,producing a giant two-block Gibbs sampler.
May want to use the above density only as a candidatedistribution in a Hastings independence chain algorithm(avoid frequent inversion of X ′Γ−1X).
For hierarchical nonlinear models, replace the data y
with “low shrinkage” estimates y = θ1 (say, MLEs), andproceed with the Hastings implementation!
Chapter 5: Bayesian Computation – p. 37/42
Auxiliary variables and slice samplingEase and/or accelerate sampling from p(θ) by adding anauxiliary (or latent) variable u ∼ p(u|θ).
Simple slice sampler: Take u|θ ∼ Unif(0, p(θ)), so thatp(θ, u) ∝ I{0≤u≤p(θ)}(θ, u). So Gibbs = 2 uniform updates:
Chapter 5: Bayesian Computation – p. 38/42
Auxiliary variables and slice samplingEase and/or accelerate sampling from p(θ) by adding anauxiliary (or latent) variable u ∼ p(u|θ).
Simple slice sampler: Take u|θ ∼ Unif(0, p(θ)), so thatp(θ, u) ∝ I{0≤u≤p(θ)}(θ, u). So Gibbs = 2 uniform updates:
u|θ ∼ Unif(0, p(θ)), and
Chapter 5: Bayesian Computation – p. 38/42
Auxiliary variables and slice samplingEase and/or accelerate sampling from p(θ) by adding anauxiliary (or latent) variable u ∼ p(u|θ).
Simple slice sampler: Take u|θ ∼ Unif(0, p(θ)), so thatp(θ, u) ∝ I{0≤u≤p(θ)}(θ, u). So Gibbs = 2 uniform updates:
u|θ ∼ Unif(0, p(θ)), andθ|u ∼ Unif(θ : p(θ) ≥ u)
Chapter 5: Bayesian Computation – p. 38/42
Auxiliary variables and slice samplingEase and/or accelerate sampling from p(θ) by adding anauxiliary (or latent) variable u ∼ p(u|θ).
Simple slice sampler: Take u|θ ∼ Unif(0, p(θ)), so thatp(θ, u) ∝ I{0≤u≤p(θ)}(θ, u). So Gibbs = 2 uniform updates:
u|θ ∼ Unif(0, p(θ)), andθ|u ∼ Unif(θ : p(θ) ≥ u)
The 2nd update (over the “slice” defined by u) requiresp(θ) to be invertible, either analytically or numerically.
Chapter 5: Bayesian Computation – p. 38/42
Auxiliary variables and slice samplingEase and/or accelerate sampling from p(θ) by adding anauxiliary (or latent) variable u ∼ p(u|θ).
Simple slice sampler: Take u|θ ∼ Unif(0, p(θ)), so thatp(θ, u) ∝ I{0≤u≤p(θ)}(θ, u). So Gibbs = 2 uniform updates:
u|θ ∼ Unif(0, p(θ)), andθ|u ∼ Unif(θ : p(θ) ≥ u)
The 2nd update (over the “slice” defined by u) requiresp(θ) to be invertible, either analytically or numerically.
Product slice sampler: if p(θ) ∝ p(θ)∏n
i=1 Li(θ),Li(θ) > 0, then introduce u = (u1, . . . , un) whereui|θ ∼ Unif(0, Li(θ)) for all i.
Chapter 5: Bayesian Computation – p. 38/42
Auxiliary variables and slice samplingEase and/or accelerate sampling from p(θ) by adding anauxiliary (or latent) variable u ∼ p(u|θ).
Simple slice sampler: Take u|θ ∼ Unif(0, p(θ)), so thatp(θ, u) ∝ I{0≤u≤p(θ)}(θ, u). So Gibbs = 2 uniform updates:
u|θ ∼ Unif(0, p(θ)), andθ|u ∼ Unif(θ : p(θ) ≥ u)
The 2nd update (over the “slice” defined by u) requiresp(θ) to be invertible, either analytically or numerically.
Product slice sampler: if p(θ) ∝ p(θ)∏n
i=1 Li(θ),Li(θ) > 0, then introduce u = (u1, . . . , un) whereui|θ ∼ Unif(0, Li(θ)) for all i.
Slice samplers typically have excellent convergenceproperties, and are increasingly used as alternatives toMetropolis steps!
Chapter 5: Bayesian Computation – p. 38/42
Reversible Jump MCMCMethod for handling models of varying dimension, orBayesian model choice more generally.
Suppose model M = j has parameter vector θj ofdimension nj , j = 1, . . . ,K. The algorithm is:
Chapter 5: Bayesian Computation – p. 39/42
Reversible Jump MCMCMethod for handling models of varying dimension, orBayesian model choice more generally.
Suppose model M = j has parameter vector θj ofdimension nj , j = 1, . . . ,K. The algorithm is:
1. Let the current state of the Markov chain be (j,θj),where θj is of dimension nj.
Chapter 5: Bayesian Computation – p. 39/42
Reversible Jump MCMCMethod for handling models of varying dimension, orBayesian model choice more generally.
Suppose model M = j has parameter vector θj ofdimension nj , j = 1, . . . ,K. The algorithm is:
1. Let the current state of the Markov chain be (j,θj),where θj is of dimension nj.
2. Propose a new model j′ with probability h(j, j′).
Chapter 5: Bayesian Computation – p. 39/42
Reversible Jump MCMCMethod for handling models of varying dimension, orBayesian model choice more generally.
Suppose model M = j has parameter vector θj ofdimension nj , j = 1, . . . ,K. The algorithm is:
1. Let the current state of the Markov chain be (j,θj),where θj is of dimension nj.
2. Propose a new model j′ with probability h(j, j′).
3. Generate u from a proposal density q(u|θj , j , j′).
Chapter 5: Bayesian Computation – p. 39/42
Reversible Jump MCMCMethod for handling models of varying dimension, orBayesian model choice more generally.
Suppose model M = j has parameter vector θj ofdimension nj , j = 1, . . . ,K. The algorithm is:
1. Let the current state of the Markov chain be (j,θj),where θj is of dimension nj.
2. Propose a new model j′ with probability h(j, j′).
3. Generate u from a proposal density q(u|θj , j , j′).
4. Set (θ′j′ ,u
′) = gj ,j ′(θj ,u), where gj ,j ′ is a deterministicfunction that is 1-1 and onto. This is a “dimensionmatching” function, specified so that
nj + dim(u) = nj′ + dim(u′) .
Chapter 5: Bayesian Computation – p. 39/42
Reversible Jump MCMC5. Accept the proposed move (from j to j′) with probability
αj→j′, which is the minimum of 1 and
f(y|θ′j ′,M = j ′)p(θ′
j ′|M = j ′)πj ′h(j ′, j )q(u′|θj ′ , j′, j )
f(y|θj ,M = j )p(θj |M = j )πjh(j , j ′)q(u|θj , j , j ′)
∣∣∣∣∂g(θj ,u)
∂(θj ,u)
∣∣∣∣ .
Chapter 5: Bayesian Computation – p. 40/42
Reversible Jump MCMC5. Accept the proposed move (from j to j′) with probability
αj→j′, which is the minimum of 1 and
f(y|θ′j ′,M = j ′)p(θ′
j ′|M = j ′)πj ′h(j ′, j )q(u′|θj ′ , j′, j )
f(y|θj ,M = j )p(θj |M = j )πjh(j , j ′)q(u|θj , j , j ′)
∣∣∣∣∂g(θj ,u)
∂(θj ,u)
∣∣∣∣ .
At convergence, we obtain {M (g)}Gg=1 from p(M |y), so
p(M = j|y) =number of M (g) = j
total number of M (g), j = 1, . . . ,K ,
as well as the Bayes factor between models j and j′,
BF jj′ =p(M = j|y)/p(M = j′|y)
p(M = j)/p(M = j′).
Chapter 5: Bayesian Computation – p. 40/42
RJMCMC Example 1
Suppose K = 2 models, for which θ1 ∈ ℜ and θ2 ∈ ℜ2,and θ1 is a subvector of θ2. Then when moving fromj = 1 to j′ = 2, we may draw u ∼ q(u) and set
θ′2 = (θ1, u) .
Chapter 5: Bayesian Computation – p. 41/42
RJMCMC Example 1
Suppose K = 2 models, for which θ1 ∈ ℜ and θ2 ∈ ℜ2,and θ1 is a subvector of θ2. Then when moving fromj = 1 to j′ = 2, we may draw u ∼ q(u) and set
θ′2 = (θ1, u) .
That is, the dimension matching g is the identityfunction, and so the Jacobian in step 5 is equal to 1.
Chapter 5: Bayesian Computation – p. 41/42
RJMCMC Example 2Consider choosing between a time series model havinga constant mean level θ1, and a changepoint modelhaving level θ2,1 before time t and θ2,2 afterward.
Chapter 5: Bayesian Computation – p. 42/42
RJMCMC Example 2Consider choosing between a time series model havinga constant mean level θ1, and a changepoint modelhaving level θ2,1 before time t and θ2,2 afterward.
When moving from Model 2 to 1, set
θ′1 =θ2,1 + θ2,2
2.
Chapter 5: Bayesian Computation – p. 42/42
RJMCMC Example 2Consider choosing between a time series model havinga constant mean level θ1, and a changepoint modelhaving level θ2,1 before time t and θ2,2 afterward.
When moving from Model 2 to 1, set
θ′1 =θ2,1 + θ2,2
2.
To ensure reversibility of this move, when going fromModel 1 to 2 we might sample u ∼ q(u) and set
θ′2,1 = θ1 − u and θ′2,2 = θ1 + u ,
since this is a 1-1 and onto function corresponding tothe deterministic “down move” we selected above.
Chapter 5: Bayesian Computation – p. 42/42
RJMCMC Example 2Consider choosing between a time series model havinga constant mean level θ1, and a changepoint modelhaving level θ2,1 before time t and θ2,2 afterward.
When moving from Model 2 to 1, set
θ′1 =θ2,1 + θ2,2
2.
To ensure reversibility of this move, when going fromModel 1 to 2 we might sample u ∼ q(u) and set
θ′2,1 = θ1 − u and θ′2,2 = θ1 + u ,
since this is a 1-1 and onto function corresponding tothe deterministic “down move” we selected above.
RJMCMC tricky (and not in WinBUGS) but extremelygeneral! Chapter 5: Bayesian Computation – p. 42/42