103
On some computational methods for Bayesian model choice On some computational methods for Bayesian model choice Christian P. Robert CREST-INSEE and Universit´ e Paris Dauphine http://xianblog.wordpress.com JSM 2009, Washington D.C., August 2, 2009 Joint works with M. Beaumont, N. Chopin, J.-M. Cornuet, J.-M. Marin and D. Wraith

Jsm09 talk

Embed Size (px)

DESCRIPTION

Slides of the talk I gave at the joint Statistical meeting in Washington DC on August 5, 2009.

Citation preview

Page 1: Jsm09 talk

On some computational methods for Bayesian model choice

On some computational methods for Bayesianmodel choice

Christian P. Robert

CREST-INSEE and Universite Paris Dauphinehttp://xianblog.wordpress.com

JSM 2009, Washington D.C., August 2, 2009Joint works with M. Beaumont, N. Chopin,J.-M. Cornuet, J.-M. Marin and D. Wraith

Page 2: Jsm09 talk

On some computational methods for Bayesian model choice

Outline

1 Evidence

2 Importance sampling solutions

3 Nested sampling

4 ABC model choice

Page 3: Jsm09 talk

On some computational methods for Bayesian model choice

Evidence

Model choice and model comparison

Choice between models

Several models available for the same observation vector x

Mi : x ∼ fi(x|θi), i ∈ I

where the family I can be finite or infinite

Page 4: Jsm09 talk

On some computational methods for Bayesian model choice

Evidence

Bayesian model choiceProbabilise the entire model/parameter space

allocate probabilities pi to all models Mi

define priors πi(θi) for each parameter space Θi

compute

π(Mi|x) =pi

∫Θi

fi(x|θi)πi(θi)dθi∑j

pj

∫Θj

fj(x|θj)πj(θj)dθj

take largest π(Mi|x) to determine “best” model,or use averaged predictive∑

j

π(Mj |x)∫

Θj

fj(x′|θj)πj(θj |x)dθj

Page 5: Jsm09 talk

On some computational methods for Bayesian model choice

Evidence

Bayesian model choiceProbabilise the entire model/parameter space

allocate probabilities pi to all models Mi

define priors πi(θi) for each parameter space Θi

compute

π(Mi|x) =pi

∫Θi

fi(x|θi)πi(θi)dθi∑j

pj

∫Θj

fj(x|θj)πj(θj)dθj

take largest π(Mi|x) to determine “best” model,or use averaged predictive∑

j

π(Mj |x)∫

Θj

fj(x′|θj)πj(θj |x)dθj

Page 6: Jsm09 talk

On some computational methods for Bayesian model choice

Evidence

Bayesian model choiceProbabilise the entire model/parameter space

allocate probabilities pi to all models Mi

define priors πi(θi) for each parameter space Θi

compute

π(Mi|x) =pi

∫Θi

fi(x|θi)πi(θi)dθi∑j

pj

∫Θj

fj(x|θj)πj(θj)dθj

take largest π(Mi|x) to determine “best” model,or use averaged predictive∑

j

π(Mj |x)∫

Θj

fj(x′|θj)πj(θj |x)dθj

Page 7: Jsm09 talk

On some computational methods for Bayesian model choice

Evidence

Bayesian model choiceProbabilise the entire model/parameter space

allocate probabilities pi to all models Mi

define priors πi(θi) for each parameter space Θi

compute

π(Mi|x) =pi

∫Θi

fi(x|θi)πi(θi)dθi∑j

pj

∫Θj

fj(x|θj)πj(θj)dθj

take largest π(Mi|x) to determine “best” model,or use averaged predictive∑

j

π(Mj |x)∫

Θj

fj(x′|θj)πj(θj |x)dθj

Page 8: Jsm09 talk

On some computational methods for Bayesian model choice

Evidence

Bayes factor

Definition (Bayes factors)

Central quantity

Bij(x) ==

∫Θifi(x|θi)πi(θi)dθi∫

Θjfj(x|θj)πj(θj)dθj

=mi(x)mj(x)

[Jeffreys, 1939]

Page 9: Jsm09 talk

On some computational methods for Bayesian model choice

Evidence

Evidence

All these problems end up with a similar quantity, the evidencek = 1, . . .

Zk =∫

Θk

πk(θk)Lk(θk) dθk,

aka the marginal likelihood.

Page 10: Jsm09 talk

On some computational methods for Bayesian model choice

Importance sampling solutions

Importance sampling revisited

1 Evidence

2 Importance sampling solutionsBridge samplingHarmonic meansOne bridge back

3 Nested sampling

4 ABC model choice

Page 11: Jsm09 talk

On some computational methods for Bayesian model choice

Importance sampling solutions

Bridge sampling

Bayes factor approximation

When approximating the Bayes factor

B01 =

∫Θ0

f0(x|θ0)π0(θ0)dθ0∫Θ1

f1(x|θ1)π1(θ1)dθ1

use of importance functions $0 and $1 and

B01 =n−1

0

∑n0i=1 f0(x|θi0)π0(θi0)/$0(θi0)

n−11

∑n1i=1 f1(x|θi1)π1(θi1)/$1(θi1)

Page 12: Jsm09 talk

On some computational methods for Bayesian model choice

Importance sampling solutions

Bridge sampling

Bridge sampling identity

Generic representation

B12 =

∫π2(θ|x)α(θ)π1(θ|x)dθ∫π1(θ|x)α(θ)π2(θ|x)dθ

∀ α(·)

1n1

n1∑i=1

π2(θ1i|x)α(θ1i)

1n2

n2∑i=1

π1(θ2i|x)α(θ2i)θji ∼ πj(θ|x)

[Gelman & Meng, 1998; Chen, Shao & Ibrahim, 2000]

Page 13: Jsm09 talk

On some computational methods for Bayesian model choice

Importance sampling solutions

Bridge sampling

Optimal bridge sampling

The optimal choice of auxiliary function is

α? =n1 + n2

n1π1(θ|x) + n2π2(θ|x)

leading to

B12 ≈

1n1

n1∑i=1

π2(θ1i|x)n1π1(θ1i|x) + n2π2(θ1i|x)

1n2

n2∑i=1

π1(θ2i|x)n1π1(θ2i|x) + n2π2(θ2i|x)

Back later!

Page 14: Jsm09 talk

On some computational methods for Bayesian model choice

Importance sampling solutions

Bridge sampling

Optimal bridge sampling (2)

Reason:

Var(B12)B2

12

≈ 1n1n2

{∫π1(θ)π2(θ)[n1π1(θ) + n2π2(θ)]α(θ)2 dθ(∫

π1(θ)π2(θ)α(θ) dθ)2 − 1

}

δ methodDrag: Dependence on the unknown normalising constants solvediteratively

Page 15: Jsm09 talk

On some computational methods for Bayesian model choice

Importance sampling solutions

Bridge sampling

Optimal bridge sampling (2)

Reason:

Var(B12)B2

12

≈ 1n1n2

{∫π1(θ)π2(θ)[n1π1(θ) + n2π2(θ)]α(θ)2 dθ(∫

π1(θ)π2(θ)α(θ) dθ)2 − 1

}

δ methodDrag: Dependence on the unknown normalising constants solvediteratively

Page 16: Jsm09 talk

On some computational methods for Bayesian model choice

Importance sampling solutions

Harmonic means

Approximating Zk from a posterior sample

Use of the [harmonic mean] identity

Eπk[

ϕ(θk)πk(θk)Lk(θk)

∣∣∣∣x] =∫

ϕ(θk)πk(θk)Lk(θk)

πk(θk)Lk(θk)Zk

dθk =1Zk

no matter what the proposal ϕ(·) is.[Gelfand & Dey, 1994; Bartolucci et al., 2006]

Direct exploitation of the MCMC output

Page 17: Jsm09 talk

On some computational methods for Bayesian model choice

Importance sampling solutions

Harmonic means

Approximating Zk from a posterior sample

Use of the [harmonic mean] identity

Eπk[

ϕ(θk)πk(θk)Lk(θk)

∣∣∣∣x] =∫

ϕ(θk)πk(θk)Lk(θk)

πk(θk)Lk(θk)Zk

dθk =1Zk

no matter what the proposal ϕ(·) is.[Gelfand & Dey, 1994; Bartolucci et al., 2006]

Direct exploitation of the MCMC output

Page 18: Jsm09 talk

On some computational methods for Bayesian model choice

Importance sampling solutions

Harmonic means

RJ-MCMC interpretation

If ϕ is the density of an alternative to model pik(θk)Lk(θk|x), theacceptance probability for a move from model pik(θk)Lk(θk|x) tomodel ϕ(θk) is based upon

ϕ(θk)πk(θk)Lk(θk|x)

(under a proper choice of prior weight)

Unbiased estimator of the odds ratio[Bartolucci, Scaccia, and Mira, 2006]

Page 19: Jsm09 talk

On some computational methods for Bayesian model choice

Importance sampling solutions

Harmonic means

An infamous example

Whenϕ(θk) = πk(θk)

harmonic mean approximation to B12 is

B21 =

1n1

n1∑i=1

1/L1(θ1i|x)

1n2

n2∑i=1

1/L2(θ2i|x)

θji ∼ πj(θ|x)

[Newton & Raftery, 1994; Raftery et al., 2008]Infamous: Most often leads to an infinite variance!!!

[Radford Neal’s blog, 2008]

Page 20: Jsm09 talk

On some computational methods for Bayesian model choice

Importance sampling solutions

Harmonic means

An infamous example

Whenϕ(θk) = πk(θk)

harmonic mean approximation to B12 is

B21 =

1n1

n1∑i=1

1/L1(θ1i|x)

1n2

n2∑i=1

1/L2(θ2i|x)

θji ∼ πj(θ|x)

[Newton & Raftery, 1994; Raftery et al., 2008]Infamous: Most often leads to an infinite variance!!!

[Radford Neal’s blog, 2008]

Page 21: Jsm09 talk

On some computational methods for Bayesian model choice

Importance sampling solutions

Harmonic means

“The Worst Monte Carlo Method Ever”

“The good news is that the Law of Large Numbers guarantees thatthis estimator is consistent ie, it will very likely be very close to thecorrect answer if you use a sufficiently large number of points fromthe posterior distribution.The bad news is that the number of points required for thisestimator to get close to the right answer will often be greaterthan the number of atoms in the observable universe. The evenworse news is that it’s easy for people to not realize this, and tonaively accept estimates that are nowhere close to the correctvalue of the marginal likelihood.”

[Radford Neal’s blog, Aug. 23, 2008]

Page 22: Jsm09 talk

On some computational methods for Bayesian model choice

Importance sampling solutions

Harmonic means

“The Worst Monte Carlo Method Ever”

“The good news is that the Law of Large Numbers guarantees thatthis estimator is consistent ie, it will very likely be very close to thecorrect answer if you use a sufficiently large number of points fromthe posterior distribution.The bad news is that the number of points required for thisestimator to get close to the right answer will often be greaterthan the number of atoms in the observable universe. The evenworse news is that it’s easy for people to not realize this, and tonaively accept estimates that are nowhere close to the correctvalue of the marginal likelihood.”

[Radford Neal’s blog, Aug. 23, 2008]

Page 23: Jsm09 talk

On some computational methods for Bayesian model choice

Importance sampling solutions

Harmonic means

Comparison with regular importance sampling

Harmonic mean: Constraint opposed to usual importance samplingconstraints: ϕ(θ) must have lighter (rather than fatter) tails thanπk(θk)Lk(θk) for the approximation

Z1k = 1

/1T

T∑t=1

ϕ(θ(t)k )

πk(θ(t)k )Lk(θ

(t)k )

to have a finite variance.E.g., use finite support kernels for ϕ

Page 24: Jsm09 talk

On some computational methods for Bayesian model choice

Importance sampling solutions

Harmonic means

Comparison with regular importance sampling

Harmonic mean: Constraint opposed to usual importance samplingconstraints: ϕ(θ) must have lighter (rather than fatter) tails thanπk(θk)Lk(θk) for the approximation

Z1k = 1

/1T

T∑t=1

ϕ(θ(t)k )

πk(θ(t)k )Lk(θ

(t)k )

to have a finite variance.E.g., use finite support kernels for ϕ

Page 25: Jsm09 talk

On some computational methods for Bayesian model choice

Importance sampling solutions

Harmonic means

HPD supports

In practice, possibility to use the HPD region defined by an MCMCsample as

Hα = {θ; π(θ|x) ≥ qα[π(·|x)]}

where qα[π(·|x)] is the empirical α quantile of the observedπ(θi|x))’s

c© Construction of ϕ as uniformly supported by Hα or an ellipsoidapproximation whose volume can be computed

Page 26: Jsm09 talk

On some computational methods for Bayesian model choice

Importance sampling solutions

Harmonic means

HPD supports

In practice, possibility to use the HPD region defined by an MCMCsample as

Hα = {θ; π(θ|x) ≥ qα[π(·|x)]}

where qα[π(·|x)] is the empirical α quantile of the observedπ(θi|x))’s

c© Construction of ϕ as uniformly supported by Hα or an ellipsoidapproximation whose volume can be computed

Page 27: Jsm09 talk

On some computational methods for Bayesian model choice

Importance sampling solutions

Harmonic means

HPD supports: illustrationGibbs sample of 103 parameters (θ, σ2) produced for the normalmodel, x1, . . . , xn ∼ N (θ, σ2) with x = 0, s2 = 1 and n = 10,under Jeffreys’ prior, leads to a straightforward approximation tothe 10% HPD region, replaced by an ellipsoid.

Page 28: Jsm09 talk

On some computational methods for Bayesian model choice

Importance sampling solutions

Harmonic means

HPD supports: illustrationGibbs sample of 103 parameters (θ, σ2) produced for the normalmodel, x1, . . . , xn ∼ N (θ, σ2) with x = 0, s2 = 1 and n = 10,under Jeffreys’ prior, leads to a straightforward approximation tothe 50% HPD region, replaced by an ellipsoid.

Page 29: Jsm09 talk

On some computational methods for Bayesian model choice

Importance sampling solutions

Harmonic means

Comparison with regular importance sampling (cont’d)

Compare Z1k with a standard importance sampling approximation

Z2k =1T

T∑t=1

πk(θ(t)k )Lk(θ

(t)k )

ϕ(θ(t)k )

where the θ(t)k ’s are generated from the density ϕ(·) (with fatter

tails like t’s)

Page 30: Jsm09 talk

On some computational methods for Bayesian model choice

Importance sampling solutions

One bridge back

Approximating Zk using a mixture representation

Bridge sampling recall

Design a specific mixture for simulation [importance sampling]purposes, with density

ϕk(θk) ∝ ωπk(θk)Lk(θk) + ϕ(θk) ,

where ϕ(·) is arbitrary (but normalised)Note: ω is not a probability weight

Page 31: Jsm09 talk

On some computational methods for Bayesian model choice

Importance sampling solutions

One bridge back

Approximating Zk using a mixture representation

Bridge sampling recall

Design a specific mixture for simulation [importance sampling]purposes, with density

ϕk(θk) ∝ ωπk(θk)Lk(θk) + ϕ(θk) ,

where ϕ(·) is arbitrary (but normalised)Note: ω is not a probability weight

Page 32: Jsm09 talk

On some computational methods for Bayesian model choice

Importance sampling solutions

One bridge back

Approximating Z using a mixture representation (cont’d)

Corresponding MCMC (=Gibbs) sampler

At iteration t

1 Take δ(t) = 1 with probability

ωπk(θ(t−1)k )Lk(θ

(t−1)k )

/(ωπk(θ

(t−1)k )Lk(θ

(t−1)k ) + ϕ(θ(t−1)

k ))

and δ(t) = 2 otherwise;

2 If δ(t) = 1, generate θ(t)k ∼ MCMC(θ(t−1)

k , θk) whereMCMC(θk, θ′k) denotes an arbitrary MCMC kernel associatedwith the posterior πk(θk|x) ∝ πk(θk)Lk(θk);

3 If δ(t) = 2, generate θ(t)k ∼ ϕ(θk) independently

Page 33: Jsm09 talk

On some computational methods for Bayesian model choice

Importance sampling solutions

One bridge back

Approximating Z using a mixture representation (cont’d)

Corresponding MCMC (=Gibbs) sampler

At iteration t

1 Take δ(t) = 1 with probability

ωπk(θ(t−1)k )Lk(θ

(t−1)k )

/(ωπk(θ

(t−1)k )Lk(θ

(t−1)k ) + ϕ(θ(t−1)

k ))

and δ(t) = 2 otherwise;

2 If δ(t) = 1, generate θ(t)k ∼ MCMC(θ(t−1)

k , θk) whereMCMC(θk, θ′k) denotes an arbitrary MCMC kernel associatedwith the posterior πk(θk|x) ∝ πk(θk)Lk(θk);

3 If δ(t) = 2, generate θ(t)k ∼ ϕ(θk) independently

Page 34: Jsm09 talk

On some computational methods for Bayesian model choice

Importance sampling solutions

One bridge back

Approximating Z using a mixture representation (cont’d)

Corresponding MCMC (=Gibbs) sampler

At iteration t

1 Take δ(t) = 1 with probability

ωπk(θ(t−1)k )Lk(θ

(t−1)k )

/(ωπk(θ

(t−1)k )Lk(θ

(t−1)k ) + ϕ(θ(t−1)

k ))

and δ(t) = 2 otherwise;

2 If δ(t) = 1, generate θ(t)k ∼ MCMC(θ(t−1)

k , θk) whereMCMC(θk, θ′k) denotes an arbitrary MCMC kernel associatedwith the posterior πk(θk|x) ∝ πk(θk)Lk(θk);

3 If δ(t) = 2, generate θ(t)k ∼ ϕ(θk) independently

Page 35: Jsm09 talk

On some computational methods for Bayesian model choice

Importance sampling solutions

One bridge back

Evidence approximation by mixtures

Rao-Blackwellised estimate

ξ =1T

T∑t=1

ωπk(θ(t)k )Lk(θ

(t)k )/ωπk(θ

(t)k )Lk(θ

(t)k ) + ϕ(θ(t)

k ) ,

converges to ωZk/{ωZk + 1}Deduce Z3k from ωZ3k/{ωZ3k + 1} = ξ ie

Z3k =

∑Tt=1 ωπk(θ

(t)k )Lk(θ

(t)k )/ωπ(θ(t)

k )Lk(θ(t)k ) + ϕ(θ(t)

k )

∑Tt=1 ϕ(θ(t)

k )/ωπk(θ

(t)k )Lk(θ

(t)k ) + ϕ(θ(t)

k )

[Bridge sampler redux]

Page 36: Jsm09 talk

On some computational methods for Bayesian model choice

Importance sampling solutions

One bridge back

Evidence approximation by mixtures

Rao-Blackwellised estimate

ξ =1T

T∑t=1

ωπk(θ(t)k )Lk(θ

(t)k )/ωπk(θ

(t)k )Lk(θ

(t)k ) + ϕ(θ(t)

k ) ,

converges to ωZk/{ωZk + 1}Deduce Z3k from ωZ3k/{ωZ3k + 1} = ξ ie

Z3k =

∑Tt=1 ωπk(θ

(t)k )Lk(θ

(t)k )/ωπ(θ(t)

k )Lk(θ(t)k ) + ϕ(θ(t)

k )

∑Tt=1 ϕ(θ(t)

k )/ωπk(θ

(t)k )Lk(θ

(t)k ) + ϕ(θ(t)

k )

[Bridge sampler redux]

Page 37: Jsm09 talk

On some computational methods for Bayesian model choice

Importance sampling solutions

One bridge back

Mixture illustration

For the normal example, use of the same uniform ϕ over the ellipseapproximating the same HPD region Hα and ω equal to one-tenthof the true evidence. Gibbs samples of size 104

Page 38: Jsm09 talk

On some computational methods for Bayesian model choice

Importance sampling solutions

One bridge back

Mixture illustrationFor the normal example, use of the same uniform ϕ over the ellipseapproximating the same HPD region Hα and ω equal to one-tenthof the true evidence. Gibbs samples of size 104 and 100 MonteCarlo replicas

Page 39: Jsm09 talk

On some computational methods for Bayesian model choice

Nested sampling

Nested sampling

1 Evidence

2 Importance sampling solutions

3 Nested samplingPurposeImplementationError ratesConstraints

4 ABC model choice

Page 40: Jsm09 talk

On some computational methods for Bayesian model choice

Nested sampling

Purpose

Nested sampling: Goal

Skilling’s (2007) technique using the one-dimensionalrepresentation:

Z = Eπ[L(θ)] =∫ 1

0ϕ(x) dx

withϕ−1(l) = P π(L(θ) > l).

Note; ϕ(·) is intractable in most cases.

Page 41: Jsm09 talk

On some computational methods for Bayesian model choice

Nested sampling

Implementation

Nested sampling: First approximation

Approximate Z by a Riemann sum:

Z =N∑i=1

(xi−1 − xi)ϕ(xi)

where the xi’s are either:

deterministic: xi = e−i/N

or random:

x0 = 0, xi+1 = tixi, ti ∼ Be(N, 1)

so that E[log xi] = −i/N .

Page 42: Jsm09 talk

On some computational methods for Bayesian model choice

Nested sampling

Implementation

Nested sampling: Alternative representation

Another representation is

Z =N−1∑i=0

{ϕ(xi+1)− ϕ(xi)}xi

which is a special case of

Z =N−1∑i=0

{L(θ(i+1))− L(θ(i))}π({θ;L(θ) > L(θ(i))})

where · · ·L(θ(i+1)) < L(θ(i)) · · ·[Lebesgue version of Riemann’s sum]

Page 43: Jsm09 talk

On some computational methods for Bayesian model choice

Nested sampling

Implementation

Nested sampling: Alternative representation

Another representation is

Z =N−1∑i=0

{ϕ(xi+1)− ϕ(xi)}xi

which is a special case of

Z =N−1∑i=0

{L(θ(i+1))− L(θ(i))}π({θ;L(θ) > L(θ(i))})

where · · ·L(θ(i+1)) < L(θ(i)) · · ·[Lebesgue version of Riemann’s sum]

Page 44: Jsm09 talk

On some computational methods for Bayesian model choice

Nested sampling

Implementation

Nested sampling: Second approximationEstimate (intractable) ϕ(xi) by ϕi:

Nested sampling

Start with N values θ1, . . . , θN sampled from πAt iteration i,

1 Take ϕi = L(θk), where θk is the point with smallestlikelihood in the pool of θi’s

2 Replace θk with a sample from the prior constrained toL(θ) > ϕi: the current N points are sampled from priorconstrained to L(θ) > ϕi.

Note that

π({θ;L(θ) > L(θ(i+1))})/π({θ;L(θ) > L(θ(i))}) ≈ 1− 1

N

Page 45: Jsm09 talk

On some computational methods for Bayesian model choice

Nested sampling

Implementation

Nested sampling: Second approximationEstimate (intractable) ϕ(xi) by ϕi:

Nested sampling

Start with N values θ1, . . . , θN sampled from πAt iteration i,

1 Take ϕi = L(θk), where θk is the point with smallestlikelihood in the pool of θi’s

2 Replace θk with a sample from the prior constrained toL(θ) > ϕi: the current N points are sampled from priorconstrained to L(θ) > ϕi.

Note that

π({θ;L(θ) > L(θ(i+1))})/π({θ;L(θ) > L(θ(i))}) ≈ 1− 1

N

Page 46: Jsm09 talk

On some computational methods for Bayesian model choice

Nested sampling

Implementation

Nested sampling: Second approximationEstimate (intractable) ϕ(xi) by ϕi:

Nested sampling

Start with N values θ1, . . . , θN sampled from πAt iteration i,

1 Take ϕi = L(θk), where θk is the point with smallestlikelihood in the pool of θi’s

2 Replace θk with a sample from the prior constrained toL(θ) > ϕi: the current N points are sampled from priorconstrained to L(θ) > ϕi.

Note that

π({θ;L(θ) > L(θ(i+1))})/π({θ;L(θ) > L(θ(i))}) ≈ 1− 1

N

Page 47: Jsm09 talk

On some computational methods for Bayesian model choice

Nested sampling

Implementation

Nested sampling: Second approximationEstimate (intractable) ϕ(xi) by ϕi:

Nested sampling

Start with N values θ1, . . . , θN sampled from πAt iteration i,

1 Take ϕi = L(θk), where θk is the point with smallestlikelihood in the pool of θi’s

2 Replace θk with a sample from the prior constrained toL(θ) > ϕi: the current N points are sampled from priorconstrained to L(θ) > ϕi.

Note that

π({θ;L(θ) > L(θ(i+1))})/π({θ;L(θ) > L(θ(i))}) ≈ 1− 1

N

Page 48: Jsm09 talk

On some computational methods for Bayesian model choice

Nested sampling

Implementation

Nested sampling: Third approximation

Iterate the above steps until a given stopping iteration j isreached: e.g.,

observe very small changes in the approximation Z;

reach the maximal value of L(θ) when the likelihood isbounded and its maximum is known;

truncate the integral Z at level ε, i.e. replace∫ 1

0ϕ(x) dx with

∫ 1

εϕ(x) dx

Page 49: Jsm09 talk

On some computational methods for Bayesian model choice

Nested sampling

Error rates

Approximation error

Error = Z− Z

=j∑i=1

(xi−1 − xi)ϕi −∫ 1

0ϕ(x) dx = −

∫ ε

0ϕ(x) dx (Truncation Error)

+

[j∑i=1

(xi−1 − xi)ϕ(xi)−∫ 1

εϕ(x) dx

](Quadrature Error)

+

[j∑i=1

(xi−1 − xi) {ϕi − ϕ(xi)}

](Stochastic Error)

[Dominated by Monte Carlo]

Page 50: Jsm09 talk

On some computational methods for Bayesian model choice

Nested sampling

Error rates

A CLT for the Stochastic Error

The (dominating) stochastic error is OP (N−1/2):

N1/2{

Z− Z}D→ N (0, V )

with

V = −∫s,t∈[ε,1]

sϕ′(s)tϕ′(t) log(s ∨ t) ds dt.

[Proof based on Donsker’s theorem]

Page 51: Jsm09 talk

On some computational methods for Bayesian model choice

Nested sampling

Error rates

What of log Z?

If the interest lies within log Z, Slutsky’s transform of the CLT:

N−1/2{

log Z− log Z}D→ N

(0, V/Z2

)Note: The number of simulated points equals the number ofiterations j, and is a multiple of N : if one stops at first iteration jsuch that e−j/N < ε, then:

j = Nd− log εe .

Page 52: Jsm09 talk

On some computational methods for Bayesian model choice

Nested sampling

Error rates

What of log Z?

If the interest lies within log Z, Slutsky’s transform of the CLT:

N−1/2{

log Z− log Z}D→ N

(0, V/Z2

)Note: The number of simulated points equals the number ofiterations j, and is a multiple of N : if one stops at first iteration jsuch that e−j/N < ε, then:

j = Nd− log εe .

Page 53: Jsm09 talk

On some computational methods for Bayesian model choice

Nested sampling

Error rates

What of log Z?

If the interest lies within log Z, Slutsky’s transform of the CLT:

N−1/2{

log Z− log Z}D→ N

(0, V/Z2

)Note: The number of simulated points equals the number ofiterations j, and is a multiple of N : if one stops at first iteration jsuch that e−j/N < ε, then:

j = Nd− log εe .

Page 54: Jsm09 talk

On some computational methods for Bayesian model choice

Nested sampling

Constraints

Sampling from constr’d priors

Exact simulation from the constrained prior is intractable in mostcases

Skilling (2007) proposes to use MCMC but this introduces a bias(stopping rule)

If implementable, then slice sampler can be devised at the samecost [or less]

Page 55: Jsm09 talk

On some computational methods for Bayesian model choice

Nested sampling

Constraints

Sampling from constr’d priors

Exact simulation from the constrained prior is intractable in mostcases

Skilling (2007) proposes to use MCMC but this introduces a bias(stopping rule)

If implementable, then slice sampler can be devised at the samecost [or less]

Page 56: Jsm09 talk

On some computational methods for Bayesian model choice

Nested sampling

Constraints

Sampling from constr’d priors

Exact simulation from the constrained prior is intractable in mostcases

Skilling (2007) proposes to use MCMC but this introduces a bias(stopping rule)

If implementable, then slice sampler can be devised at the samecost [or less]

Page 57: Jsm09 talk

On some computational methods for Bayesian model choice

Nested sampling

Constraints

Banana illustration

Case of a banana target made of a twisted 2D normal:x2 = x2 + βx2

1 − 100β[Haario, Sacksman, Tamminen, 1999]

β = .03,Σ = diag(100, 1)

Page 58: Jsm09 talk

On some computational methods for Bayesian model choice

Nested sampling

Constraints

Banana illustration (2)Use of nested sampling with N = 1000, 50 MCMC steps with size0.1, compared with a population Monte Carlo (PMC) based on 10iterations with 5000 points per iteration and final sample of 50000points, using nine Student’s t components with 9 df

[Wraith, Kilbinger, Benabed et al., 2009, Physica Rev. D]

Evidence estimation

Page 59: Jsm09 talk

On some computational methods for Bayesian model choice

Nested sampling

Constraints

Banana illustration (2)Use of nested sampling with N = 1000, 50 MCMC steps with size0.1, compared with a population Monte Carlo (PMC) based on 10iterations with 5000 points per iteration and final sample of 50000points, using nine Student’s t components with 9 df

[Wraith, Kilbinger, Benabed et al., 2009, Physica Rev. D]

E[X1] estimation

Page 60: Jsm09 talk

On some computational methods for Bayesian model choice

Nested sampling

Constraints

Banana illustration (2)Use of nested sampling with N = 1000, 50 MCMC steps with size0.1, compared with a population Monte Carlo (PMC) based on 10iterations with 5000 points per iteration and final sample of 50000points, using nine Student’s t components with 9 df

[Wraith, Kilbinger, Benabed et al., 2009, Physica Rev. D]

E[X2] estimation

Page 61: Jsm09 talk

On some computational methods for Bayesian model choice

ABC model choice

ABC method(s)

Approximate Bayesian Computation

Bayesian setting: target is π(θ)f(x|θ)When likelihood f(x|θ) not in closed form, likelihood-free rejectiontechnique:

ABC algorithm

For an observation y ∼ f(y|θ), under the prior π(θ), keep jointlysimulating

θ′ ∼ π(θ) , x ∼ f(x|θ′) ,

until the auxiliary variable x is equal to the observed value, x = y.

[Pritchard et al., 1999]

Page 62: Jsm09 talk

On some computational methods for Bayesian model choice

ABC model choice

ABC method(s)

Approximate Bayesian Computation

Bayesian setting: target is π(θ)f(x|θ)When likelihood f(x|θ) not in closed form, likelihood-free rejectiontechnique:

ABC algorithm

For an observation y ∼ f(y|θ), under the prior π(θ), keep jointlysimulating

θ′ ∼ π(θ) , x ∼ f(x|θ′) ,

until the auxiliary variable x is equal to the observed value, x = y.

[Pritchard et al., 1999]

Page 63: Jsm09 talk

On some computational methods for Bayesian model choice

ABC model choice

ABC method(s)

Approximate Bayesian Computation

Bayesian setting: target is π(θ)f(x|θ)When likelihood f(x|θ) not in closed form, likelihood-free rejectiontechnique:

ABC algorithm

For an observation y ∼ f(y|θ), under the prior π(θ), keep jointlysimulating

θ′ ∼ π(θ) , x ∼ f(x|θ′) ,

until the auxiliary variable x is equal to the observed value, x = y.

[Pritchard et al., 1999]

Page 64: Jsm09 talk

On some computational methods for Bayesian model choice

ABC model choice

ABC method(s)

A as approximative

When y is a continuous random variable, equality x = y is replacedwith a tolerance condition,

%(x, y) ≤ ε

where % is a distance between summary statisticsOutput distributed from

π(θ)Pθ{%(x, y) < ε} ∝ π(θ|%(x, y) < ε)

Page 65: Jsm09 talk

On some computational methods for Bayesian model choice

ABC model choice

ABC method(s)

A as approximative

When y is a continuous random variable, equality x = y is replacedwith a tolerance condition,

%(x, y) ≤ ε

where % is a distance between summary statisticsOutput distributed from

π(θ)Pθ{%(x, y) < ε} ∝ π(θ|%(x, y) < ε)

Page 66: Jsm09 talk

On some computational methods for Bayesian model choice

ABC model choice

ABC method(s)

Dynamic area

Page 67: Jsm09 talk

On some computational methods for Bayesian model choice

ABC model choice

ABC method(s)

ABC improvements

Simulating from the prior is often poor in efficiencyEither modify the proposal distribution on θ to increase the densityof x’s within the vicinity of y...

[Marjoram et al, 2003; Bortot et al., 2007, Sisson et al., 2007]

...or by viewing the problem as a conditional density estimationand by developing techniques to allow for larger ε....

[Beaumont et al., 2002]

...or even by including ε in the inferential framework[Ratmann et al., 2009]

Page 68: Jsm09 talk

On some computational methods for Bayesian model choice

ABC model choice

ABC method(s)

ABC improvements

Simulating from the prior is often poor in efficiencyEither modify the proposal distribution on θ to increase the densityof x’s within the vicinity of y...

[Marjoram et al, 2003; Bortot et al., 2007, Sisson et al., 2007]

...or by viewing the problem as a conditional density estimationand by developing techniques to allow for larger ε....

[Beaumont et al., 2002]

...or even by including ε in the inferential framework[Ratmann et al., 2009]

Page 69: Jsm09 talk

On some computational methods for Bayesian model choice

ABC model choice

ABC method(s)

ABC improvements

Simulating from the prior is often poor in efficiencyEither modify the proposal distribution on θ to increase the densityof x’s within the vicinity of y...

[Marjoram et al, 2003; Bortot et al., 2007, Sisson et al., 2007]

...or by viewing the problem as a conditional density estimationand by developing techniques to allow for larger ε....

[Beaumont et al., 2002]

...or even by including ε in the inferential framework[Ratmann et al., 2009]

Page 70: Jsm09 talk

On some computational methods for Bayesian model choice

ABC model choice

ABC method(s)

ABC improvements

Simulating from the prior is often poor in efficiencyEither modify the proposal distribution on θ to increase the densityof x’s within the vicinity of y...

[Marjoram et al, 2003; Bortot et al., 2007, Sisson et al., 2007]

...or by viewing the problem as a conditional density estimationand by developing techniques to allow for larger ε....

[Beaumont et al., 2002]

...or even by including ε in the inferential framework[Ratmann et al., 2009]

Page 71: Jsm09 talk

On some computational methods for Bayesian model choice

ABC model choice

ABC method(s)

ABC-MCMC

Markov chain (θ(t)) created via the transition function

θ(t+1) =

θ′ ∼ K(θ′|θ(t)) if x ∼ f(x|θ′) is such that x = y

and u ∼ U(0, 1) ≤ π(θ′)K(θ(t)|θ′)π(θ(t))K(θ′|θ(t)) ,

θ(t) otherwise,

has the posterior π(θ|y) as stationary distribution[Marjoram et al, 2003]

Page 72: Jsm09 talk

On some computational methods for Bayesian model choice

ABC model choice

ABC method(s)

ABC-MCMC

Markov chain (θ(t)) created via the transition function

θ(t+1) =

θ′ ∼ K(θ′|θ(t)) if x ∼ f(x|θ′) is such that x = y

and u ∼ U(0, 1) ≤ π(θ′)K(θ(t)|θ′)π(θ(t))K(θ′|θ(t)) ,

θ(t) otherwise,

has the posterior π(θ|y) as stationary distribution[Marjoram et al, 2003]

Page 73: Jsm09 talk

On some computational methods for Bayesian model choice

ABC model choice

ABC method(s)

ABC-PRC

Another sequential version producing a sequence of Markov

transition kernels Kt and of samples (θ(t)1 , . . . , θ

(t)N ) (1 ≤ t ≤ T )

ABC-PRC Algorithm

1 Pick a θ? is selected at random among the previous θ(t−1)i ’s

with probabilities ω(t−1)i (1 ≤ i ≤ N).

2 Generateθ

(t)i ∼ Kt(θ|θ?) , x ∼ f(x|θ(t)

i ) ,

3 Check that %(x, y) < ε, otherwise start again.

[Sisson et al., 2007]

Page 74: Jsm09 talk

On some computational methods for Bayesian model choice

ABC model choice

ABC method(s)

ABC-PRC

Another sequential version producing a sequence of Markov

transition kernels Kt and of samples (θ(t)1 , . . . , θ

(t)N ) (1 ≤ t ≤ T )

ABC-PRC Algorithm

1 Pick a θ? is selected at random among the previous θ(t−1)i ’s

with probabilities ω(t−1)i (1 ≤ i ≤ N).

2 Generateθ

(t)i ∼ Kt(θ|θ?) , x ∼ f(x|θ(t)

i ) ,

3 Check that %(x, y) < ε, otherwise start again.

[Sisson et al., 2007]

Page 75: Jsm09 talk

On some computational methods for Bayesian model choice

ABC model choice

ABC method(s)

ABC-PRC weight

Probability ω(t)i computed as

ω(t)i ∝ π(θ(t)

i )Lt−1(θ?|θ(t)i ){π(θ?)Kt(θ

(t)i |θ

?)}−1 ,

where Lt−1 is an arbitrary transition kernel.In case

Lt−1(θ′|θ) = Kt(θ|θ′) ,

all weights are equal under a uniform prior.Inspired from Del Moral et al. (2006), who use backward kernelsLt−1 in SMC to achieve unbiasedness with a recent (2008)ABC-SMC version

Page 76: Jsm09 talk

On some computational methods for Bayesian model choice

ABC model choice

ABC method(s)

ABC-PRC weight

Probability ω(t)i computed as

ω(t)i ∝ π(θ(t)

i )Lt−1(θ?|θ(t)i ){π(θ?)Kt(θ

(t)i |θ

?)}−1 ,

where Lt−1 is an arbitrary transition kernel.In case

Lt−1(θ′|θ) = Kt(θ|θ′) ,

all weights are equal under a uniform prior.Inspired from Del Moral et al. (2006), who use backward kernelsLt−1 in SMC to achieve unbiasedness with a recent (2008)ABC-SMC version

Page 77: Jsm09 talk

On some computational methods for Bayesian model choice

ABC model choice

ABC method(s)

ABC-PRC weight

Probability ω(t)i computed as

ω(t)i ∝ π(θ(t)

i )Lt−1(θ?|θ(t)i ){π(θ?)Kt(θ

(t)i |θ

?)}−1 ,

where Lt−1 is an arbitrary transition kernel.In case

Lt−1(θ′|θ) = Kt(θ|θ′) ,

all weights are equal under a uniform prior.Inspired from Del Moral et al. (2006), who use backward kernelsLt−1 in SMC to achieve unbiasedness with a recent (2008)ABC-SMC version

Page 78: Jsm09 talk

On some computational methods for Bayesian model choice

ABC model choice

ABC method(s)

ABC-PRC weight

Probability ω(t)i computed as

ω(t)i ∝ π(θ(t)

i )Lt−1(θ?|θ(t)i ){π(θ?)Kt(θ

(t)i |θ

?)}−1 ,

where Lt−1 is an arbitrary transition kernel.In case

Lt−1(θ′|θ) = Kt(θ|θ′) ,

all weights are equal under a uniform prior.Inspired from Del Moral et al. (2006), who use backward kernelsLt−1 in SMC to achieve unbiasedness with a recent (2008)ABC-SMC version

Page 79: Jsm09 talk

On some computational methods for Bayesian model choice

ABC model choice

ABC method(s)

ABC-PRC bias

Lack of unbiasedness of the methodJoint density of the accepted pair (θ(t−1), θ(t)) proportional to

π(θ(t−1)|y)Kt(θ(t)|θ(t−1))f(y|θ(t)) ,

For an arbitrary function h(θ), E[ωth(θ(t))] proportional to

ZZh(θ

(t))π(θ(t))Lt−1(θ(t−1)|θ(t))

π(θ(t−1))Kt(θ(t)|θ(t−1))π(θ

(t−1)|y)Kt(θ(t)|θ(t−1)

)f(y|θ(t))dθ(t−1)dθ(t)

∝ZZ

h(θ(t)

)π(θ(t))Lt−1(θ(t−1)|θ(t))

π(θ(t−1))Kt(θ(t)|θ(t−1))π(θ

(t−1))f(y|θ(t−1)

)

×Kt(θ(t)|θ(t−1)

)f(y|θ(t))dθ(t−1)dθ(t)

∝Zh(θ

(t))π(θ

(t)|y)Z

Lt−1(θ(t−1)|θ(t)

)f(y|θ(t−1))dθ(t−1)

ffdθ(t)

.

Page 80: Jsm09 talk

On some computational methods for Bayesian model choice

ABC model choice

ABC method(s)

ABC-PRC bias

Lack of unbiasedness of the methodJoint density of the accepted pair (θ(t−1), θ(t)) proportional to

π(θ(t−1)|y)Kt(θ(t)|θ(t−1))f(y|θ(t)) ,

For an arbitrary function h(θ), E[ωth(θ(t))] proportional to

ZZh(θ

(t))π(θ(t))Lt−1(θ(t−1)|θ(t))

π(θ(t−1))Kt(θ(t)|θ(t−1))π(θ

(t−1)|y)Kt(θ(t)|θ(t−1)

)f(y|θ(t))dθ(t−1)dθ(t)

∝ZZ

h(θ(t)

)π(θ(t))Lt−1(θ(t−1)|θ(t))

π(θ(t−1))Kt(θ(t)|θ(t−1))π(θ

(t−1))f(y|θ(t−1)

)

×Kt(θ(t)|θ(t−1)

)f(y|θ(t))dθ(t−1)dθ(t)

∝Zh(θ

(t))π(θ

(t)|y)Z

Lt−1(θ(t−1)|θ(t)

)f(y|θ(t−1))dθ(t−1)

ffdθ(t)

.

Page 81: Jsm09 talk

On some computational methods for Bayesian model choice

ABC model choice

ABC method(s)

A mixture example (0)

Toy model of Sisson et al. (2007): if

θ ∼ U(−10, 10) , x|θ ∼ 0.5N (θ, 1) + 0.5N (θ, 1/100) ,

then the posterior distribution associated with y = 0 is the normalmixture

θ|y = 0 ∼ 0.5N (0, 1) + 0.5N (0, 1/100)

restricted to [−10, 10].Furthermore, true target available as

π(θ||x| < ε) ∝ Φ(ε−θ)−Φ(−ε−θ)+Φ(10(ε−θ))−Φ(−10(ε+θ)) .

Page 82: Jsm09 talk

On some computational methods for Bayesian model choice

ABC model choice

ABC method(s)

A mixture example (1)

θθ

−3 −1 1 3

0.0

0.2

0.4

0.6

0.8

1.0

θθ

−3 −1 1 3

0.0

0.2

0.4

0.6

0.8

1.0

θθ

−3 −1 1 3

0.0

0.2

0.4

0.6

0.8

1.0

θθ

−3 −1 1 3

0.0

0.2

0.4

0.6

0.8

1.0

θθ

−3 −1 1 3

0.0

0.2

0.4

0.6

0.8

1.0

θθ

−3 −1 1 3

0.0

0.2

0.4

0.6

0.8

1.0

θθ

−3 −1 1 3

0.0

0.2

0.4

0.6

0.8

1.0

θθ

−3 −1 1 3

0.0

0.2

0.4

0.6

0.8

1.0

θθ

−3 −1 1 3

0.0

0.2

0.4

0.6

0.8

1.0

θθ

−3 −1 1 3

0.0

0.2

0.4

0.6

0.8

1.0

Comparison of τ = 0.15 and τ = 1/0.15 in Kt

Page 83: Jsm09 talk

On some computational methods for Bayesian model choice

ABC model choice

ABC-PMC

A PMC version

Use of the same kernel idea as ABC-PRC but with IS correctionGenerate a sample at iteration t by

πt(θ(t)) ∝N∑j=1

ω(t−1)j Kt(θ(t)|θ(t−1)

j )

modulo acceptance of the associated xt, and use an importance

weight associated with an accepted simulation θ(t)i

ω(t)i ∝ π(θ(t)

i )/πt(θ

(t)i ) .

c© Still likelihood free[Beaumont et al., 2009]

Page 84: Jsm09 talk

On some computational methods for Bayesian model choice

ABC model choice

ABC-PMC

A mixture example (2)

Recovery of the target, whether using a fixed standard deviation ofτ = 0.15 or τ = 1/0.15,

Page 85: Jsm09 talk

On some computational methods for Bayesian model choice

ABC model choice

ABC for model choice in GRFs

Gibbs random fields

Gibbs distribution

The rv y = (y1, . . . , yn) is a Gibbs random field associated withthe graph G if

f(y) =1Z

exp

{−∑c∈C

Vc(yc)

},

where Z is the normalising constant, C is the set of cliques of G

and Vc is any function also called potentialU(y) =

∑c∈C Vc(yc) is the energy function

c© Z is usually unavailable in closed form

Page 86: Jsm09 talk

On some computational methods for Bayesian model choice

ABC model choice

ABC for model choice in GRFs

Gibbs random fields

Gibbs distribution

The rv y = (y1, . . . , yn) is a Gibbs random field associated withthe graph G if

f(y) =1Z

exp

{−∑c∈C

Vc(yc)

},

where Z is the normalising constant, C is the set of cliques of G

and Vc is any function also called potentialU(y) =

∑c∈C Vc(yc) is the energy function

c© Z is usually unavailable in closed form

Page 87: Jsm09 talk

On some computational methods for Bayesian model choice

ABC model choice

ABC for model choice in GRFs

Potts model

Potts model

Vc(y) is of the form

Vc(y) = θS(y) = θ∑l∼i

δyl=yi

where l∼i denotes a neighbourhood structure

In most realistic settings, summation

Zθ =∑x∈X

exp{θTS(x)}

involves too many terms to be manageable and numericalapproximations cannot always be trusted

[Cucala, Marin, CPR & Titterington, 2009]

Page 88: Jsm09 talk

On some computational methods for Bayesian model choice

ABC model choice

ABC for model choice in GRFs

Potts model

Potts model

Vc(y) is of the form

Vc(y) = θS(y) = θ∑l∼i

δyl=yi

where l∼i denotes a neighbourhood structure

In most realistic settings, summation

Zθ =∑x∈X

exp{θTS(x)}

involves too many terms to be manageable and numericalapproximations cannot always be trusted

[Cucala, Marin, CPR & Titterington, 2009]

Page 89: Jsm09 talk

On some computational methods for Bayesian model choice

ABC model choice

ABC for model choice in GRFs

Bayesian Model Choice

Comparing a model with potential S0 taking values in Rp0 versus amodel with potential S1 taking values in Rp1 can be done throughthe Bayes factor corresponding to the priors π0 and π1 on eachparameter space

Bm0/m1(x) =

∫exp{θT

0 S0(x)}/Zθ0,0π0(dθ0)∫exp{θT

1 S1(x)}/Zθ1,1π1(dθ1)

Page 90: Jsm09 talk

On some computational methods for Bayesian model choice

ABC model choice

ABC for model choice in GRFs

Neighbourhood relations

Choice to be made between M neighbourhood relations

im∼ i′ (0 ≤ m ≤M − 1)

withSm(x) =

∑im∼i′

I{xi=xi′}

driven by the posterior probabilities of the models.

Page 91: Jsm09 talk

On some computational methods for Bayesian model choice

ABC model choice

ABC for model choice in GRFs

Model index

Formalisation via a model index M that appears as a newparameter with prior distribution π(M = m) andπ(θ|M = m) = πm(θm)Computational target:

P(M = m|x) ∝∫

Θm

fm(x|θm)πm(θm) dθm π(M = m) ,

Page 92: Jsm09 talk

On some computational methods for Bayesian model choice

ABC model choice

ABC for model choice in GRFs

Model index

Formalisation via a model index M that appears as a newparameter with prior distribution π(M = m) andπ(θ|M = m) = πm(θm)Computational target:

P(M = m|x) ∝∫

Θm

fm(x|θm)πm(θm) dθm π(M = m) ,

Page 93: Jsm09 talk

On some computational methods for Bayesian model choice

ABC model choice

ABC for model choice in GRFs

Sufficient statisticsBy definition, if S(x) sufficient statistic for the joint parameters(M, θ0, . . . , θM−1),

P(M = m|x) = P(M = m|S(x)) .

For each model m, own sufficient statistic Sm(·) andS(·) = (S0(·), . . . , SM−1(·)) also sufficient.For Gibbs random fields,

x|M = m ∼ fm(x|θm) = f1m(x|S(x))f2

m(S(x)|θm)

=1

n(S(x))f2m(S(x)|θm)

wheren(S(x)) = ] {x ∈ X : S(x) = S(x)}

c© S(x) is therefore also sufficient for the joint parameters[Specific to Gibbs random fields!]

Page 94: Jsm09 talk

On some computational methods for Bayesian model choice

ABC model choice

ABC for model choice in GRFs

Sufficient statisticsBy definition, if S(x) sufficient statistic for the joint parameters(M, θ0, . . . , θM−1),

P(M = m|x) = P(M = m|S(x)) .

For each model m, own sufficient statistic Sm(·) andS(·) = (S0(·), . . . , SM−1(·)) also sufficient.For Gibbs random fields,

x|M = m ∼ fm(x|θm) = f1m(x|S(x))f2

m(S(x)|θm)

=1

n(S(x))f2m(S(x)|θm)

wheren(S(x)) = ] {x ∈ X : S(x) = S(x)}

c© S(x) is therefore also sufficient for the joint parameters[Specific to Gibbs random fields!]

Page 95: Jsm09 talk

On some computational methods for Bayesian model choice

ABC model choice

ABC for model choice in GRFs

Sufficient statisticsBy definition, if S(x) sufficient statistic for the joint parameters(M, θ0, . . . , θM−1),

P(M = m|x) = P(M = m|S(x)) .

For each model m, own sufficient statistic Sm(·) andS(·) = (S0(·), . . . , SM−1(·)) also sufficient.For Gibbs random fields,

x|M = m ∼ fm(x|θm) = f1m(x|S(x))f2

m(S(x)|θm)

=1

n(S(x))f2m(S(x)|θm)

wheren(S(x)) = ] {x ∈ X : S(x) = S(x)}

c© S(x) is therefore also sufficient for the joint parameters[Specific to Gibbs random fields!]

Page 96: Jsm09 talk

On some computational methods for Bayesian model choice

ABC model choice

ABC for model choice in GRFs

ABC model choice algorithm

ABC-MC

Generate m∗ from the prior π(M = m).

Generate θ∗m∗ from the prior πm∗(·).

Generate x∗ from the model fm∗(·|θ∗m∗).

Compute the distance ρ(S(x0), S(x∗)).

Accept (θ∗m∗ ,m∗) if ρ(S(x0), S(x∗)) < ε.

[Cornuet, Grelaud, Marin & Robert, 2009]

Note When ε = 0 the algorithm is exact

Page 97: Jsm09 talk

On some computational methods for Bayesian model choice

ABC model choice

ABC for model choice in GRFs

ABC approximation to the Bayes factor

Frequency ratio:

BFm0/m1(x0) =

P(M = m0|x0)P(M = m1|x0)

× π(M = m1)π(M = m0)

=]{mi∗ = m0}]{mi∗ = m1}

× π(M = m1)π(M = m0)

,

replaced with

BFm0/m1(x0) =

1 + ]{mi∗ = m0}1 + ]{mi∗ = m1}

× π(M = m1)π(M = m0)

to avoid indeterminacy (also Bayes estimate).

Page 98: Jsm09 talk

On some computational methods for Bayesian model choice

ABC model choice

ABC for model choice in GRFs

ABC approximation to the Bayes factor

Frequency ratio:

BFm0/m1(x0) =

P(M = m0|x0)P(M = m1|x0)

× π(M = m1)π(M = m0)

=]{mi∗ = m0}]{mi∗ = m1}

× π(M = m1)π(M = m0)

,

replaced with

BFm0/m1(x0) =

1 + ]{mi∗ = m0}1 + ]{mi∗ = m1}

× π(M = m1)π(M = m0)

to avoid indeterminacy (also Bayes estimate).

Page 99: Jsm09 talk

On some computational methods for Bayesian model choice

ABC model choice

Illustrations

Toy example

iid Bernoulli model versus two-state first-order Markov chain, i.e.

f0(x|θ0) = exp

(θ0

n∑i=1

I{xi=1}

)/{1 + exp(θ0)}n ,

versus

f1(x|θ1) =12

exp

(θ1

n∑i=2

I{xi=xi−1}

)/{1 + exp(θ1)}n−1 ,

with priors θ0 ∼ U(−5, 5) and θ1 ∼ U(0, 6) (inspired by “phasetransition” boundaries).

Page 100: Jsm09 talk

On some computational methods for Bayesian model choice

ABC model choice

Illustrations

Toy example (2)

−40 −20 0 10

−50

5

BF01

BF01

−40 −20 0 10−10

−50

510

BF01

BF01

(left) Comparison of the true BFm0/m1(x0) with BFm0/m1

(x0)(in logs) over 2, 000 simulations and 4.106 proposals from theprior. (right) Same when using tolerance ε corresponding to the1% quantile on the distances.

Page 101: Jsm09 talk

On some computational methods for Bayesian model choice

ABC model choice

Illustrations

Protein folding

Superposition of the native structure (grey) with the ST1structure (red.), the ST2 structure (orange), the ST3 structure(green), and the DT structure (blue).

Page 102: Jsm09 talk

On some computational methods for Bayesian model choice

ABC model choice

Illustrations

Protein folding (2)

% seq . Id. TM-score FROST score

1i5nA (ST1) 32 0.86 75.31ls1A1 (ST2) 5 0.42 8.91jr8A (ST3) 4 0.24 8.91s7oA (DT) 10 0.08 7.8

Characteristics of dataset. % seq. Id.: percentage of identity withthe query sequence. TM-score.: similarity between predicted andnative structure (uncertainty between 0.17 and 0.4) FROST score:quality of alignment of the query onto the candidate structure(uncertainty between 7 and 9).

Page 103: Jsm09 talk

On some computational methods for Bayesian model choice

ABC model choice

Illustrations

Protein folding (3)

NS/ST1 NS/ST2 NS/ST3 NS/DT

BF 1.34 1.22 2.42 2.76P(M = NS|x0) 0.573 0.551 0.708 0.734

Estimates of the Bayes factors between model NS and modelsST1, ST2, ST3, and DT, and corresponding posteriorprobabilities of model NS based on an ABC-MC algorithm using1.2 106 simulations and a tolerance ε equal to the 1% quantile ofthe distances.