89
IV Workshop Bayesian Nonparametrics, Roma, 12 Giugno 2004 1 Bayesian Inference on Mixtures Christian P. Robert Universit ´ e Paris Dauphine Joint work with J EAN -M ICHEL M ARIN ,K ERRIE M ENGERSEN AND J UDITH ROUSSEAU

Bayesian inference on mixtures

Embed Size (px)

Citation preview

Page 1: Bayesian inference on mixtures

IV Workshop Bayesian Nonparametrics, Roma, 12 Giugno 2004 1

Bayesian Inference on Mixtures

Christian P. Robert

Universit e Paris Dauphine

Joint work with

JEAN-MICHEL MARIN, KERRIE MENGERSEN AND JUDITH ROUSSEAU

Page 2: Bayesian inference on mixtures

IV Workshop Bayesian Nonparametrics, Roma, 12 Giugno 2004 2

What’s new?!

• Density approximation & consistency

• Scarsity phenomenon

• Label switching & Bayesian inference

• Nonconvergence of the Gibbs sampler & population Monte Carlo

• Comparison of RJMCM with B& D

Page 3: Bayesian inference on mixtures

Intro/Inference/Algorithms/Beyond fixed k 3

1 Mixtures

Convex combination of “usual” densities (e.g., exponential family)

k∑

i=1

pif(x|θi) ,k∑

i=1

pi = 1 k > 1 ,

Page 4: Bayesian inference on mixtures

Intro/Inference/Algorithms/Beyond fixed k 4

−1 0 1 2 30.

10.

20.

30.

40 1 2 3 4 5

0.0

0.1

0.2

0.3

0.4

0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0

0.0

0.2

0.4

0.6

0.8

1.0

0 2 4 6 8 10

0.00

0.10

0.20

0.30

0 2 4 6

0.00

0.05

0.10

0.15

0.20

0.25

−2 0 2 4 6 8 10

0.00

0.10

0.20

0.30

0 5 10 15

0.00

0.05

0.10

0.15

0 5 10 15

0.00

0.05

0.10

0.15

0 5 10 15

0.00

0.05

0.10

0.15

0.00

0.05

0.10

0.15

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.1

0.2

0.3

Normal mixture densities for K = 2, 5, 25, 50

Page 5: Bayesian inference on mixtures

Intro/Inference/Algorithms/Beyond fixed k 5

Likelihood

L(θ, p|x) =n∏

i=1

k∑

j=1

pjf (xi|θj)

c© Computable in O (nk) time

Page 6: Bayesian inference on mixtures

Intro:Misg/Inference/Algorithms/Beyond fixed k 6

Missing data representation

Demarginalisation

k∑

i=1

pif(x|θi) =∫

f(x|θ, z) f(z|p) dz

where

X|Z = z ∼ f(x|θz), Z ∼Mk(1; p1, ..., pk)

Missing “data” z1, . . . , zn that may be or may not be meaningful

[Auxiliary variables]

Page 7: Bayesian inference on mixtures

Intro:Misg/Inference/Algorithms/Beyond fixed k 7

Nonparametric re-interpretation

Approximation of unknown distributions

E.g., Nadaraya–Watson kernel

kn(x|x) =1

nhn

n∑

i=1

ϕ (x; xi, hn)

Page 8: Bayesian inference on mixtures

Intro:Misg/Inference/Algorithms/Beyond fixed k 8

Bernstein polynomials

Bounded continuous densities on [0, 1] approximated by Beta mixtures

(αk,βk)∈N2+

pk Be(αk, βk) αk, βk ∈ N∗

[Consistency]

Associated predictive is then

fn(x|x) =∞∑

k=1

k∑

j=1

Eπ[ωkj |x] Be(j, k + 1− j)P(K = k|x) .

[Petrone and Wasserman, 2002]

Page 9: Bayesian inference on mixtures

Intro:Misg/Inference/Algorithms/Beyond fixed k 9

0.0 0.2 0.4 0.6 0.8 1.0

02

46

811,0.1,0.9

0.0 0.2 0.4 0.6 0.8 1.0

24

68

31,0.6,0.3

0.0 0.2 0.4 0.6 0.8 1.0

0.9

1.0

1.1

1.2

1.3

5,0.8,0.9

0.0 0.2 0.4 0.6 0.8 1.0

01

23

4

54,0.8,2.6

0.0 0.2 0.4 0.6 0.8 1.0

0.2

0.4

0.6

0.8

1.0

1.2

22,1.2,1.6

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.5

1.0

1.5

45,2.9,1.8

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.5

1.0

1.5

7,4.9,3.3

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.5

1.0

1.5

2.0

2.5

3.0

67,5.1,9.3

0.0 0.2 0.4 0.6 0.8 1.00

12

34

91,19.1,17.5

Realisations from the Bernstein prior

Page 10: Bayesian inference on mixtures

Intro:Misg/Inference/Algorithms/Beyond fixed k 10

0.0 0.2 0.4 0.6 0.8 1.0

05

1015

2025

0.0 0.2 0.4 0.6 0.8 1.0

0.4

0.6

0.8

1.0

1.2

1.4

1.6

0.0 0.2 0.4 0.6 0.8 1.0

0.6

0.8

1.0

1.2

1.4

1.6

0.0 0.2 0.4 0.6 0.8 1.0

0.4

0.6

0.8

1.0

1.2

1.4

1.6

0.0 0.2 0.4 0.6 0.8 1.0

02

46

810

12

0.0 0.2 0.4 0.6 0.8 1.0

1.0

1.5

2.0

2.5

3.0

3.5

0.0 0.2 0.4 0.6 0.8 1.0

0.8

1.0

1.2

1.4

1.6

1.8

0.0 0.2 0.4 0.6 0.8 1.0

0.5

1.0

1.5

2.0

2.5

0.0 0.2 0.4 0.6 0.8 1.00.

51.

01.

52.

02.

53.

03.

5

Realisations from a more general prior

Page 11: Bayesian inference on mixtures

Intro:Constancy/Inference/Algorithms/Beyond fixed k 11

Density estimation

[CPR & Rousseau, 2000–04]

Reparameterisation of a Beta mixture

p0U(0, 1) + (1− p0)K∑

k=1

pkB(αkεk, αk(1− εk))∑

k≥1

pk = 1 ,

with density fψ

Can approximate most distributions g on [0, 1]

Assumptions

– g is piecewise continuous on {x ; g(x) < M} for all M ’s

–∫

g(x) log g(x) d x <∞

Page 12: Bayesian inference on mixtures

Intro:Constancy/Inference/Algorithms/Beyond fixed k 12

Prior distributions

– π(K) has a light tail

P (K ≥ tn/ log n) ≤ exp−rn

– p0 ∼ Be(a0, b0), a0 < 1, b0 > 1

– pk ∝ ωk and ωk ∼ Be(1, k)

– location-scale “hole” prior

(αk, εk) ∼ {1− exp [−{β1(αk − 2)c3 + β2(εk − .5)c4}]}exp

[−τ0αc0k /2− τ1/{α2c1

k εc1k (1− εk)c1}]

,

Page 13: Bayesian inference on mixtures

Intro:Constancy/Inference/Algorithms/Beyond fixed k 13

Consistency results

Hellinger neighbourhood

Aε(f0) = {f, d(f, f0) ≤ ε}

Then, for all ε > 0,

π[Aε(g)|x1:n]→ 1, as n→∞, g a.s.

and

Eπ [d(g, fψ)|x1:n]→ 0, g a.s.

Extension to general parametric distributions by the cdf transform Fθ(x)

Page 14: Bayesian inference on mixtures

Intro/Inference/Algorithms/Beyond fixed k 14

2 [B] Inference

Difficulties:

• identifiability

• label switching

• loss function

• ordering constraints

• prior determination

Page 15: Bayesian inference on mixtures

Intro/Inference:Identifability/Algorithms/Beyond fixed k 15

Central (non)identifiability issue

∑kj=1 pjf(y|θj) is invariant to relabelling of the components

Consequence

((pj , θj))1≤i≤k

only known up to a permutation τ ∈ Sk

Page 16: Bayesian inference on mixtures

Intro/Inference:Identifability/Algorithms/Beyond fixed k 16

Example 1. Two component normal mixture

p N (µ1, 1) + (1− p)N (µ2, 1)

where p 6= 0.5 is known

The parameters µ1 and µ2 are identifiable

Page 17: Bayesian inference on mixtures

Intro/Inference:Identifability/Algorithms/Beyond fixed k 17

Bimodal likelihood [ 500 observations and (µ1, µ2, p) = (0, 2.5, 0.7)]

−1 0 1 2 3 4

−10

12

34

µ1

µ 2

Page 18: Bayesian inference on mixtures

Intro/Inference:Identifability/Algorithms/Beyond fixed k 18

Influence of p on the modes

−2 0 2 4

−20

24

µ1

µ 2

p=0.5

−2 0 2 4

−20

24

µ1

µ 2

p=0.6

−2 0 2 4

−20

24

µ1

µ 2

p=0.75

−2 0 2 4

−20

24

µ1

µ 2

p=0.85

Page 19: Bayesian inference on mixtures

Intro/Inference:Com’ics/Algorithms/Beyond fixed k 19

Combinatorics

For a normal mixture,

pϕ(x; µ1, σ1) + (1− p)ϕ(x; µ2, σ2)

under the pseudo-conjugate priors (i = 1, 2)

µi|σi ∼ N (ζi, σ2i /λi), σ−2

i ∼ G a(νi/2, s2i /2), p ∼ Be(α, β) ,

the posterior is

π (θ, p|x) ∝n∏

j=1

{pϕ(xj ; µ1, σ1) + (1− p)ϕ(xj ; µ2, σ2)}π (θ, p) .

Computation: complexity O (2n)

Page 20: Bayesian inference on mixtures

Intro/Inference:Com’ics/Algorithms/Beyond fixed k 20

Missing variables (2)

Auxiliary variables z = (z1, . . . , zn) ∈ Z associated with observations

x = (x1, . . . , xn)

For (n1, . . . , nk), where n1 + . . . + nk = n,

Zj =

{z :

n∑

i=1

Izi=1 = n1, . . . ,n∑

i=1

Izi=k = nk

}

which consists of all allocations with the given allocation vector (n1, . . . , nk) (and

j corresponding lexicographic order).

Page 21: Bayesian inference on mixtures

Intro/Inference:Com’ics/Algorithms/Beyond fixed k 21

Number of nonnegative integer solutions of this decomposition of n

r =(

n + k − 1n

).

Partition

Z = ∪ri=1Zi[Number of partition sets of order O(nk−1)]

Page 22: Bayesian inference on mixtures

Intro/Inference:Com’ics/Algorithms/Beyond fixed k 22

Posterior decomposition

π(θ, p|x)

=r∑

i=1

z∈Zi

ω (z)π(θ, p|x, z

)

with ω (z) posterior probability of allocation z.

Corresponding representation of posterior expectation of(θ, p

)

r∑

i=1

z∈Zi

ω (z)Eπ[θ, p|x, z

]

Page 23: Bayesian inference on mixtures

Intro/Inference:Com’ics/Algorithms/Beyond fixed k 23

Very sensible from an inferential point of view:

1. consider each possible allocation z of the dataset,

2. allocates a posterior probability ω (z) to this allocation, and

3. constructs a posterior distribution for the parameters conditional on this

allocation.

All possible allocations: complexity O (kn)

Page 24: Bayesian inference on mixtures

Intro/Inference:Com’ics/Algorithms/Beyond fixed k 24

Posterior

For a given permutation/allocation (kt), conditional posterior distribution

π(θ|(kt)) = N(

ξ1(kt),σ2

1

n1 + `

)× IG((ν1 + `)/2, s1(kt)/2)

×N(

ξ2(kt),σ2

2

n2 + n− `

)× IG((ν2 + n− `)/2, s2(kt)/2)

×Be(α + `, β + n− `)

Page 25: Bayesian inference on mixtures

Intro/Inference:Com’ics/Algorithms/Beyond fixed k 25

where

x1(kt) = 1`

∑`t=1 xkt , s1(kt) =

∑`t=1(xkt − x1(kt))2,

x2(kt) = 1n−`

∑nt=`+1 xkt , s2(kt) =

∑nt=`+1(xkt − x2(kt))2

and

ξ1(kt) =n1ξ1 + `x1(kt)

n1 + `, ξ2(kt) =

n2ξ2 + (n− `)x2(kt)n2 + n− `

,

s1(kt) = s21 + s2

1(kt) +n1`

n1 + `(ξ1 − x1(kt))2,

s2(kt) = s22 + s2

2(kt) +n2(n− `)n2 + n− `

(ξ2 − x2(kt))2,

posterior updates of the hyperparameters

Page 26: Bayesian inference on mixtures

Intro/Inference:Scarcity/Algorithms/Beyond fixed k 26

Scarcity

Frustrating barrier:

Almost all posterior probabilities ω (z) are zero

Example 2. Galaxy dataset with k = 4 components, Set of allocations with the

partition sizes (n1, n2, n3, n4) = (7, 34, 38, 3) with probability 0.59 and

(n1, n2, n3, n4) = (7, 30, 27, 18) with probability 0.32, and no other size group

getting a probability above 0.01.

Page 27: Bayesian inference on mixtures

Intro/Inference:Scarcity/Algorithms/Beyond fixed k 27

Example 3. Normal mean mixture

For a same normal prior,

µ1, µ2 ∼ N (0, 10)

posterior weight associated with a z such that

n∑

i=1

Izi=1 = l

is

ω (z) ∝√

(l + 1/4)(n− l + 1/4) pl(1− p)n−l,

Thus posterior distribution of z only depends on l and repartition of the partition size

follows a distribution close to a binomial B(n, p) distribution.

Page 28: Bayesian inference on mixtures

Intro/Inference:Scarcity/Algorithms/Beyond fixed k 28

For two different normal priors on the means,

µ1 ∼ N (0, 4) , µ2 ∼ N (2, 4) ,

posterior weight of z is

ω (z) ∝√

(l + 1/4)(n− l + 1/4) pl(1− p)n−l×exp

{−[(l + 1/4)s1 (z) + l{x1 (z)}2/4]/2}×

exp{−[(n− l + 1/4)s2 (z) + (n− l){x2 (z)− 2}2/4]/2

}

where

x1 (z) =1l

n∑

i=1

Izi=1xi, x2 (z) =1

n− l

n∑

i=1

Izi=2xi

s1 (z) =n∑

i=1

Izi=1 (xi − x1 (z))2 , s2 (z) =n∑

i=1

Izi=2 (xi − x2 (z))2 .

Page 29: Bayesian inference on mixtures

Intro/Inference:Scarcity/Algorithms/Beyond fixed k 29

Computation of exact weight of all partition sizes l impossible

Monte Carlo experiment by drawing z’s at random.

Example 4. Sample of 45 points simulated when p = 0.7, µ1 = 0 and µ2 = 2.5leads to

l = 23 as the most likely partition, with a weight approximated by 0.962For l = 27, weight approximated by 4.56 10−11.

Page 30: Bayesian inference on mixtures

Intro/Inference:Scarcity/Algorithms/Beyond fixed k 30

l=23

log(ω(kt))

−750 −700 −650 −600 −550

0.00

00.

005

0.01

00.

015

0.02

0

l=29

log(ω(kt))

−750 −700 −650 −600 −550

0.00

00.

005

0.01

00.

015

0.02

00.

025

Page 31: Bayesian inference on mixtures

Intro/Inference:Scarcity/Algorithms/Beyond fixed k 31

Ten highest log-weights ω (z) (up to an additive constant)

0 10 20 30 40

−700

−650

−600

−550

l

Page 32: Bayesian inference on mixtures

Intro/Inference:Scarcity/Algorithms/Beyond fixed k 32

Most likely allocation z for a simulated dataset of 45 observations

−2 −1 0 1 2 3 4

0.00.1

0.20.3

0.40.5

Page 33: Bayesian inference on mixtures

Intro/Inference:Scarcity/Algorithms/Beyond fixed k 33

Caution! We simulated 450, 000 permutations, to be compared with a

total of 245 permutations!

Page 34: Bayesian inference on mixtures

Intro/Inference: Priors/Algorithms/Beyond fixed k 34

Prior selection

Basic difficulty: if exchangeable prior used on

θ = (θ1, . . . , θk)

all marginals on the θi’s are identical

Posterior expectation of θ1 identical to posterior expectation of θ2!

Page 35: Bayesian inference on mixtures

Intro/Inference: Priors/Algorithms/Beyond fixed k 35

Identifiability constraints

Prior restriction by identifiability constraint on the mixture parameters, for instance

by ordering the means [or the variances or the weights]

Not so innocuous!

• truncation unrelated to the topology of the posterior distribution

• may induce a posterior expectation in a low probability region

• modifies the prior modelling

θ(1)

−4 −3 −2 −1 0

0.00.2

0.40.6

0.8

θ(10)

−1.0 −0.5 0.0 0.5 1.0

0.00.2

0.40.6

0.81.0

1.21.4

θ(19)

1 2 3 4

0.00.2

0.40.6

0.8

Page 36: Bayesian inference on mixtures

Intro/Inference: Priors/Algorithms/Beyond fixed k 36

• with many components, ordering in terms of one type of parameter is unrealistic

• poor estimation (posterior mean)

-2 -1 0 1 2 3

0.00.1

0.20.3

0.40.5

Gibbs sampling

p

-2 -1 0 1 2 3

0.00.1

0.20.3

0.40.5

theta

-2 -1 0 1 2 3

0.00.1

0.20.3

0.40.5

tau

-2 -1 0 1 2 3

0.00.1

0.20.3

0.40.5

Random walk

p

-2 -1 0 1 2 3

0.00.1

0.20.3

0.40.5

theta

-2 -1 0 1 2 3

0.00.1

0.20.3

0.40.5

tau

-2 -1 0 1 2 3

0.00.1

0.20.3

0.40.5

Langevin

p

-2 -1 0 1 2 3

0.00.1

0.20.3

0.40.5

theta

-2 -1 0 1 2 3

0.00.1

0.20.3

0.40.5

tau

-2 -1 0 1 2 3

0.00.1

0.20.3

0.40.5

Tempered random walk

p

-2 -1 0 1 2 3

0.00.1

0.20.3

0.40.5

theta

-2 -1 0 1 2 3

0.00.1

0.20.3

0.40.5

tau

• poor exploration (MCMC)

Page 37: Bayesian inference on mixtures

Intro/Inference: Priors/Algorithms/Beyond fixed k 37

Improper priors??

Independent improper priors,

π (θ) =k∏

i=1

πi(θi) ,

cannot be used since, if ∫πi(θi)dθi =∞

then for every n, ∫π(θ, p|x)dθdp =∞

Still, some improper priors can be used when the impropriety is on a common

(location/scale) parameter

[CPR & Titterington, 1998]

Page 38: Bayesian inference on mixtures

Intro/Inference: Loss/Algorithms/Beyond fixed k 38

Loss functions

Once a sample can be produced from the unconstrained posterior distribution, an

ordering constraint can be imposed ex post

[Stephens, 1997]

Good for MCMC exploration

Page 39: Bayesian inference on mixtures

Intro/Inference: Loss/Algorithms/Beyond fixed k 39

Again, difficult assesment of the true effect of the ordering constraints...

order p1 p2 p3 θ1 θ2 θ3 σ1 σ2 σ3

p 0.231 0.311 0.458 0.321 -0.55 2.28 0.41 0.471 0.303

θ 0.297 0.246 0.457 -1.1 0.83 2.33 0.357 0.543 0.284

σ 0.375 0.331 0.294 1.59 0.083 0.379 0.266 0.34 0.579

true 0.22 0.43 0.35 1.1 2.4 -0.95 0.3 0.2 0.5

−4 −2 0 2 4

0.00.1

0.20.3

0.40.5

0.6

x

y

Page 40: Bayesian inference on mixtures

Intro/Inference: Loss/Algorithms/Beyond fixed k 40

Pivotal quantity

For a permutation τ ∈ Sk, corresponding permutation of the parameter

τ(θ, p) ={(θτ(1), . . . , θτ(k)), (pτ(1), . . . , pτ(k))

}

does not modify the value of the likelihood (& posterior under exchangeability).

Label switching phenomenon

Page 41: Bayesian inference on mixtures

Intro/Inference: Loss/Algorithms/Beyond fixed k 41

Reordering scheme:

Based on a simulated sample of size M ,

(i) compute the pivot (θ, p)(i∗) such that

i∗ = arg maxi=1,...,M

π((θ, p)(i)|x)

Monte Carlo approximation of the MAP estimator of (θ, p).

(ii) For i ∈ {1, . . . ,M}:1. Compute

τi = arg minτ∈Sk

d(τ((θ, p)(i)), (θ, p)(i

∗))

2. Set (θ, p)(i) = τi((θ, p)(i)).

Page 42: Bayesian inference on mixtures

Intro/Inference: Loss/Algorithms/Beyond fixed k 42

Step (ii) chooses the reordering the closest to the MAP estimator

After reordering, the Monte Carlo posterior expectation is

M∑

j=1

(θi)(j)/M .

Page 43: Bayesian inference on mixtures

Intro/Inference: Loss/Algorithms/Beyond fixed k 43

Probabilistic alternative

[Jasra, Holmes & Stephens, 2004]

Also put a prior on permutations σ ∈ Sk

Defines a specific model M based on a preliminary estimate (e.g., by relabelling)

Computes

θj =1N

n∑t=1

σ∈Sk

θ(t)σ(j)p(σ|θ(t), M)

Page 44: Bayesian inference on mixtures

Intro/Inference/Algorithms/Beyond fixed k 44

3 Computations

Page 45: Bayesian inference on mixtures

Intro/Inference/Algorithms: Gibbs/Beyond fixed k 45

3.1 Gibbs sampling

Same idea as the EM algorithm: take advantage of the missing data representation

General Gibbs sampling for mixture models

0. Initialization: choose p(0) and θ(0) arbitrarily

1. Step t. For t = 1, . . .

1.1 Generate z(t)i (i = 1, . . . , n) from (j = 1, . . . , k)

P(z(t)i = j|p(t−1)

j , θ(t−1)j , xi

)∝ p

(t−1)j f

(xi|θ(t−1)

j

)

1.2 Generate p(t) from π(p|z(t)),

1.3 Generate θ(t) from π(θ|z(t), x).

Page 46: Bayesian inference on mixtures

Intro/Inference/Algorithms: Gibbs/Beyond fixed k 46

Trapping states

Gibbs sampling may lead to trapping states , concentrated local modes that require

an enormous number of iterations to escape from, e.g., components with a small

number of allocated observations and very small variance

[Diebolt & CPR, 1990]

Also, most MCMC samplers fail to reproduce the permutation invariance of the

posterior distribution, that is, do not visit the k! replications of a given mode.

[Celeux, Hurn & CPR, 2000]

Page 47: Bayesian inference on mixtures

Intro/Inference/Algorithms: Gibbs/Beyond fixed k 47

Example 5. Mean normal mixture

0. Initialization. Choose µ(0)1 and µ

(0)2 ,

1. Step t. For t = 1, . . .

1.1 Generate z(t)i (i = 1, . . . , n) from

P(z(t)i = 1

)= 1−P

(z(t)i = 2

)∝ p exp

(−1

2

(xi − µ

(t−1)1

)2)

1.2 Compute n(t)j =

n∑

i=1

Iz(t)i =j

and (sxj )(t) =

n∑

i=1

Iz(t)i =j

xi

1.3 Generate µ(t)j (j = 1, 2) from N

(λδ + (sxj )

(t)

λ + n(t)j

,1

λ + n(t)j

).

Page 48: Bayesian inference on mixtures

Intro/Inference/Algorithms: Gibbs/Beyond fixed k 48

−1 0 1 2 3 4

−1

01

23

4

µ1

µ 2

Page 49: Bayesian inference on mixtures

Intro/Inference/Algorithms: Gibbs/Beyond fixed k 49

But...

−1 0 1 2 3 4

−1

01

23

4

µ1

µ 2

Page 50: Bayesian inference on mixtures

Intro/Inference/Algorithms: HM/Beyond fixed k 50

3.2 Metropolis–Hastings

Missing data structure is not necessary for MCMC implementation: the mixture

likelihood is available in closed form and computable in O(kn) time:

Page 51: Bayesian inference on mixtures

Intro/Inference/Algorithms: HM/Beyond fixed k 51

Step t. For t = 1, . . .

1.1 Generate (θ, p) from q(θ, p|θ(t−1), p(t−1)

),

1.2 Compute

r =f(x|θ, p)π(θ, p)q(θ(t−1), p(t−1)|θ, p)

f(x|θ(t−1), p(t−1))π(θ(t−1), p(t−1))q(θ, p|θ(t−1), p(t−1)),

1.3 Generate u ∼ U[0,1]

If r < u then (θ(t), p(t)) = (θ, p)

else (θ(t), p(t)) = (θ(t−1), p(t−1)).

Page 52: Bayesian inference on mixtures

Intro/Inference/Algorithms: HM/Beyond fixed k 52

Proposal

Use of random walk inefficient for constrained parameters like the weights and the

variances.

Reparameterisation:

For the weights p, overparameterise the model as

pj = wj

/ k∑

l=1

wl , wj > 0

[Cappe, Ryden & CPR]

The wj ’s are not identifiable, but this is not a problem.

Proposed move on the wj ’s is

log(wj) = log(w(t−1)j ) + uj , uj ∼ N (0, ζ2)

Page 53: Bayesian inference on mixtures

Intro/Inference/Algorithms: HM/Beyond fixed k 53

Example 6. Mean normal mixture

Gaussian random walk proposal

µ1 ∼ N(µ

(t−1)1 , ζ2

)and µ2 ∼ N

(t−1)2 , ζ2

)

associated with

Page 54: Bayesian inference on mixtures

Intro/Inference/Algorithms: HM/Beyond fixed k 54

0. Initialization. Choose µ(0)1 and µ

(0)2

1. Step t. For t = 1, . . .

1.1 Generate µj (j = 1, 2) from N(µ

(t−1)j , ζ2

),

1.2 Compute

r =f (x|µ1, µ2, )π (µ1, µ2)

f(x|µ(t−1)

1 , µ(t−1)2

(t−1)1 , µ

(t−1)2

) ,

1.3 Generate u ∼ U[0,1]

If r < u then(µ

(t)1 , µ

(t)2

)= (µ1, µ2)

else(µ

(t)1 , µ

(t)2

)=

(t−1)1 , µ

(t−1)2

).

Page 55: Bayesian inference on mixtures

Intro/Inference/Algorithms: HM/Beyond fixed k 55

−1 0 1 2 3 4

−10

12

34

µ1

µ 2

Page 56: Bayesian inference on mixtures

Intro/Inference/Algorithms: PMC/Beyond fixed k 56

3.3 Population Monte Carlo

Idea Apply dynamic importance sampling to simulate a sequence of iid samples

x(t) = (x(t)1 , . . . , x(t)

n )iid≈ π(x)

where t is a simulation iteration index (at sample level)

Page 57: Bayesian inference on mixtures

Intro/Inference/Algorithms: PMC/Beyond fixed k 57

Dependent importance sampling

The importance distribution of the sample x(t)

qt(x(t)|x(t−1))

can depend on the previous sample x(t−1) in any possible way as long as marginal

distributions

qit(x) =∫

qt(x(t)) dx(t)−i

can be expressed to build importance weights

%it =π(x(t)

i )

qit(x(t)i )

Page 58: Bayesian inference on mixtures

Intro/Inference/Algorithms: PMC/Beyond fixed k 58

Special case

qt(x(t)|x(t−1)) =n∏

i=1

qit(x(t)i |x(t−1))

[Independent proposals]

In that case,

var(It

)=

1n2

n∑

i=1

var(%(t)i h(x(t)

i ))

.

Page 59: Bayesian inference on mixtures

Intro/Inference/Algorithms: PMC/Beyond fixed k 59

Population Monte Carlo (PMC)

Use previous sample (x(t)) marginaly distributed from π

E[%ith(X(t)

i )]

= E

[∫π(x(t)

i )

qit(x(t)i )

h(x(t)i )qit(x

(t)i ) dx

(t)i

]= E [Eπ [h(X)]]

to improve on approximation of π

Page 60: Bayesian inference on mixtures

Intro/Inference/Algorithms: PMC/Beyond fixed k 60

Resampling

Over iterations (in t), weights may degenerate:

e.g.,

%1 ' 1

while %2, . . . , %n negligible

Use instead Rubin’s (1987) systematic resampling: at each iteration resample the

x(t)i ’s according to their weight %

(t)i and reset the weights to 1 (preserves

“unbiasedness”/increases variance)

Page 61: Bayesian inference on mixtures

Intro/Inference/Algorithms: PMC/Beyond fixed k 61

PMC for mixtures

Proposal distributions qit that simulate (θ(i)(t), p

(i)(t)) and associated importance

weight

ρ(i)(t) =

f(x|θ(i)

(t), p(i)(t)

(θ(i)(t), p

(i)(t)

)

qit

(θ(i)(t), p

(i)(t)

) , i = 1, . . . ,M

Approximations of the form

1M

M∑

i=1

ρ(i)(t)∑M

l=1 ρ(l)(t)

h(θ(i)(t), p

(i)(t)

)

give (almost) unbiased estimators of Eπx [h(θ, p)],

Page 62: Bayesian inference on mixtures

Intro/Inference/Algorithms: PMC/Beyond fixed k 62

0. Initialization. Choose θ(1)(0), . . . , θ

(M)(0) and p

(1)(0), . . . , p

(M)(0)

1. Step t. For t = 1, . . . , T

1.1 For i = 1, . . . ,M

1.1.1 Generate(θ(i)(t), p

(i)(t)

)from qit (θ, p),

1.1.2 Compute

ρ(i) = f(x|θ(i)

(t), p(i)(t)

(θ(i)(t), p

(i)(t)

)/qit

(θ(i)(t), p

(i)(t)

),

1.2 Compute ω(i) = ρ(i)

/ M∑

l=1

ρ(l),

1.3 Resample M values with replacement from the(θ(i)(t), p

(i)(t)

)’s using

the weights ω(i)

Page 63: Bayesian inference on mixtures

Intro/Inference/Algorithms: PMC/Beyond fixed k 63

Example 7. Mean normal mixture

Implementation without the Gibbs augmentation step, using normal random walk

proposals based on the previous sample of (µ1, µ2)’s as in Metropolis–Hastings.

Selection of a “proper” scale:

bypassed by the adaptivity of the PMC algorithm

Several proposals associated with a range of variances vk, k = 1, . . . , K .

At each step, new variances can be selected proportionally to the performances of

the scales vk on the previous iterations, for instance, proportional to its

non-degeneracy rate

Page 64: Bayesian inference on mixtures

Intro/Inference/Algorithms: PMC/Beyond fixed k 64

Step t. For t = 1, . . . , T

1.1 For i = 1, . . . , M

1.1.1 Generate k from M (1; r1, . . . , rK),

1.1.2 Generate (µj)(i)(t) (j = 1, 2) from N

((µj)

(i)(t−1) , vk

)

1.1.4 Compute

ρ(i) =f

(x|(µ1)

(i)(t), (µ2)

(i)(t)

((µ1)

(i)(t), (µ2)

(i)(t)

)

K∑

l=1

2∏

j=1

ϕ((µj)

(i)(t); (µ1)

(i)(t−1), vl

) ,

1.2 Compute ω(i) = ρ(i)/ M∑

l=1

ρ(l),

1.3 Resample the (µ1)(i)(t), (µ2)

(i)(t)’s using the weights ω(i)

1.4 Update the rl’s: rl is proportional to the number of (µ1)(i)(t), (µ2)

(i)(t)’s with

variance vl resampled.

Page 65: Bayesian inference on mixtures

Intro/Inference/Algorithms: PMC/Beyond fixed k 65

−1 0 1 2 3 4

−10

12

34

µ1

µ 2

Page 66: Bayesian inference on mixtures

Intro/Inference/Algorithms/Beyond fixed k 66

4 Unknown number of components

When k number of components is unknown, there are several models

Mk

with corresponding parameter sets

Θk

in competition.

Page 67: Bayesian inference on mixtures

Intro/Inference/Algorithms/Beyond fixed k: RJ 67

Reversible jump MCMC

Reversibility constraint put on dimension-changing moves that bridge the sets Θk /

the models Mk

[Green, 1995]

Local reversibility for each pair (k1, k2) of possible values of k: supplement Θk1

and Θk2 with adequate artificial spaces in order to create a bijection between them:

Page 68: Bayesian inference on mixtures

Intro/Inference/Algorithms/Beyond fixed k: RJ 68

Basic steps

Choice of probabilities

πij∑

j

πij = 1

of jumping to modelMkj while in modelMki

θ(k1) is completed by a simulation u1 ∼ g1(u1) into (θ(k1), u1) and θ(k2) by

u2 ∼ g2(u2) into (θ(k2), u2)

(θ(k2), u2) = Tk1→k2(θ(k1), u1),

Page 69: Bayesian inference on mixtures

Intro/Inference/Algorithms/Beyond fixed k: RJ 69

Green reversible jump algorithm

0. At iteration t, if x(t) = (m, θ(m)),

1. Select model Mn with probability πmn,

2. Generate umn ∼ ϕmn(u),

3. Set (θ(n), vnm) = Tm→n(θ(m), umn),

4. Take x(t+1) = (n, θ(n)) with probability

min(

π(n, θ(n))π(m, θ(m))

πnmϕnm(vnm)πmnϕmn(umn)

∣∣∣∣∂Tm→n(θ(m), umn)

∂(θ(m), umn)

∣∣∣∣ , 1)

,

and take x(t+1) = x(t) otherwise.

Page 70: Bayesian inference on mixtures

Intro/Inference/Algorithms/Beyond fixed k: RJ 70

Example 8. For a normal mixture

Mk :k∑

j=1

pjkN (µjk, σ2jk) ,

restriction to moves from Mk to neighbouring models Mk+1 and Mk−1.

[Richardson & Green, 1997]

Page 71: Bayesian inference on mixtures

Intro/Inference/Algorithms/Beyond fixed k: RJ 71

Birth and death steps

birth adds a new normal component generated from the prior

death removes one of the k components at random.

Birth acceptance probability

min(

π(k+1)k

πk(k+1)

(k + 1)!k!

πk+1(θk+1)πk(θk) (k + 1)ϕk(k+1)(uk(k+1))

, 1)

= min(

π(k+1)k

πk(k+1)

%(k + 1)%(k)

`k+1(θk+1) (1− pk+1)k−1

`k(θk), 1

),

where %(k) is the prior probability of model Mk

Page 72: Bayesian inference on mixtures

Intro/Inference/Algorithms/Beyond fixed k: RJ 72

Proposal that can work well in some settings, but can also be inefficient (i.e. high

rejection rate), if the prior is vague.

Alternative: devise more local jumps between models,

(i). split

pjk = pj(k+1) + p(j+1)(k+1)

pjkµjk = pj(k+1)µj(k+1) + p(j+1)(k+1)µ(j+1)(k+1)

pjkσ2jk = pj(k+1)σ

2j(k+1) + p(j+1)(k+1)σ

2(j+1)(k+1)

(ii). merge (reverse)

Page 73: Bayesian inference on mixtures

Intro/Inference/Algorithms/Beyond fixed k: RJ 73

Histogram and rawplot of 100, 000 k’s produced by RJMCMC

Histogram of k

k

1 2 3 4 5

0.0

0.1

0.2

0.3

0.4

0e+00 2e+04 4e+04 6e+04 8e+04 1e+05

12

34

5

Rawplot of k

k

Page 74: Bayesian inference on mixtures

Intro/Inference/Algorithms/Beyond fixed k: RJ 74

Normalised enzyme dataset

0.0 0.5 1.0 1.5 2.0 2.5 3.0

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Page 75: Bayesian inference on mixtures

Intro/Inference/Algorithms/Beyond fixed k: b&d 75

Birth and Death processes

Use of an alternative methodology based on a Birth–&-Death (point) process

Idea: Create a Markov chain in continuous time, i.e. a Markov jump process,

moving between models Mk, by births (to increase the dimension), deaths (to

decrease the dimension), and other moves

[Preston, 1976; Ripley, 1977; Stephens, 1999]

Page 76: Bayesian inference on mixtures

Intro/Inference/Algorithms/Beyond fixed k: b&d 76

Time till next modification (jump) is exponentially distributed with rate depending on

current state

Remember: if ξ1, . . . , ξv are exponentially distributed, ξi ∼ Exp(λi),

min ξi ∼ Exp

(∑

i

λi

)

Difference with MH-MCMC : Whenever a jump occurs, the corresponding move is

always accepted. Acceptance probabilities replaced with holding times.

Page 77: Bayesian inference on mixtures

Intro/Inference/Algorithms/Beyond fixed k: b&d 77

Balance condition

Sufficient to have detailed balance

L(θ)π(θ)q(θ, θ′) = L(θ′)π(θ′)q(θ′, θ) for all θ, θ′

for π(θ) ∝ L(θ)π(θ) to be stationary.

Here q(θ, θ′) rate of moving from state θ to θ′.

Possibility to add split/merge and fixed-k processes if balance condition satisfied.

[Cappe, Ryden & CPR, 2002]

Page 78: Bayesian inference on mixtures

Intro/Inference/Algorithms/Beyond fixed k: b&d 78

Case of mixtures

Representation as a (marked) point process

Φ ={{pj , (µj , σj)}

}

j

Birth rate λ0 (constant) and proposal from the prior

Death rate δj(Φ) for removal of component j

Overall death ratek∑

j=1

δj(Φ) = δ(Φ)

Balance condition

(k + 1) d(Φ ∪ {p, (µ, σ)}) L(Φ ∪ {p, (µ, σ)}) = λ0L(Φ)π(k)

π(k + 1)

with

d(Φ \ {pj , (µj , σj)}) = δj(Φ)

Page 79: Bayesian inference on mixtures

Intro/Inference/Algorithms/Beyond fixed k: b&d 79

Stephen’s original algorithm:

For v = 0, 1, · · · , Vt← v

Run till t > v + 1

1. Compute δj(Φ) =L(Φ|Φj)

L(Φ)λ0λ1

2. δ(Φ)←k∑

j=1

δj(Φj), ξ ← λ0 + δ(Φ), u ∼ U([0, 1])

3. t← t− u log(u)

Page 80: Bayesian inference on mixtures

Intro/Inference/Algorithms/Beyond fixed k: b&d 80

4. With probability δ(Φ)/ξ

Remove component j with probability δj(Φ)/δ(Φ)

k ← k − 1

p` ← p`/(1− pj) (` 6= j)

Otherwise,

Add component j from the prior π(µj , σj)pj ∼ Be(γ, kγ)

p` ← p`(1− pj) (` 6= j)

k ← k + 1

5. Run I MCMC(k, β, p)

Page 81: Bayesian inference on mixtures

Intro/Inference/Algorithms/Beyond fixed k: b&d 81

Rescaling time

In discrete-time RJMCMC, let the time unit be 1/N , put

βk = λk/N and δk = 1− λk/N

As N →∞, each birth proposal will be accepted, and having k components births

occur according to a Poisson process with rate λk

Page 82: Bayesian inference on mixtures

Intro/Inference/Algorithms/Beyond fixed k: b&d 82

while component (w, φ) dies with rate

limN→∞

Nδk+1 × 1k + 1

×min(A−1, 1)

= limN→∞

N1

k + 1× likelihood ratio−1 × βk

δk+1× b(w, φ)

(1− w)k−1

= likelihood ratio−1 × λkk + 1

× b(w, φ)(1− w)k−1

.

Hence

“RJMCMC→BDMCMC”

Page 83: Bayesian inference on mixtures

Intro/Inference/Algorithms/Beyond fixed k: b&d 83

Even closer to RJMCM

Exponential (random) sampling is not necessary, nor is continuous time!

Estimator of

I =∫

g(θ)π(θ)dθ

by

I =1N

N∑1

g(θ(τi))

where {θ(t)} continuous time MCMC process and τ1, . . . , τN sampling instants.

Page 84: Bayesian inference on mixtures

Intro/Inference/Algorithms/Beyond fixed k: b&d 84

New notations:

1. Tn time of the n-th jump of {θ(t)} with T0 = 0

2. {θn} jump chain of states visited by {θ(t)}3. λ(θ) total rate of {θ(t)} leaving state θ

Then holding time Tn − Tn−1 of {θ(t)} in its n-th state θn exponential rv with rate

λ(θn)

Page 85: Bayesian inference on mixtures

Intro/Inference/Algorithms/Beyond fixed k: b&d 85

Rao–Blackwellisation

If sampling interval goes to 0, limiting case

I∞ =1

TN

N∑n=1

g(θn−1)(Tn − Tn−1)

Rao–Blackwellisation argument: replace I∞ with

I =1

TN

N∑n=1

g(θn−1)

λ(θn−1)=

1TN

N∑n=1

E[Tn − Tn−1 | θn−1] g(θn−1) .

Conclusion: Only simulate jumps and store average holding times!

Completely remove continuous time feature

Page 86: Bayesian inference on mixtures

Intro/Inference/Algorithms/Beyond fixed k: b&d 86

Example 9. Galaxy dataset

Comparison of RJMCMC and CTMCMC in the Galaxy dataset

[Cappe & al., 2002]

Experiment:

• Same proposals (same Ccode)

• Moves proposed in equal proportions by both samplers (setting the probability

PF of proposing a fixed k move in RJMCMC equal to the rate ηF at which

fixed k moves are proposed in CTMCMC, and likewise PB = ηB for the birth

moves)

• Rao–Blackwellisation

• Number of jumps (number of visited configurations) in CTMCMC == number of

iterations of RJMCMC

Page 87: Bayesian inference on mixtures

Intro/Inference/Algorithms/Beyond fixed k: b&d 87

Results:

• If one algorithm performs poorly, so does the other. (For RJMCMC

manifested as small A’s—birth proposals are rarely accepted—while for

BDMCMC manifested as large δ’s—new components are indeed born but die

again quickly.)

• No significant difference between samplers for birth and death only

• CTMCMC slightly better than RJMCMC with split-and-combine moves

• Marginal advantage in accuracy for split-and-combine addition

• For split-and-combine moves, computation time associated with one step of

continuous time simulation is about 5 times longer than for reversible jump

simulation.

Page 88: Bayesian inference on mixtures

Intro/Inference/Algorithms/Beyond fixed k: b&d 88

Box plot for the estimated posterior on k obtained from 200 independent runs:

RJMCMC (top) and BDMCMC (bottom). The number of iterations varies from 5 000

(left), to 50 000 (middle) and 500 000 (right).

2 4 6 8 10 12 14 0

0.1

0.2

0.3CT (500 000 it.)

k

2 4 6 8 10 12 14 0

0.1

0.2

0.3RJ (500 000 it.)

2 4 6 8 10 12 14 0

0.1

0.2

0.3CT (50 000 it.)

k

2 4 6 8 10 12 14 0

0.1

0.2

0.3RJ (50 000 it.)

2 4 6 8 10 12 14 0

0.1

0.2

0.3CT (5 000 it.)

post

erio

r pro

babi

lity

k

2 4 6 8 10 12 14 0

0.1

0.2

0.3RJ (5 000 it.)

post

erio

r pro

babi

lity

Page 89: Bayesian inference on mixtures

Intro/Inference/Algorithms/Beyond fixed k: b&d 89

Same for the estimated posterior on k obtained from 500 independent runs: Top

RJMCMC and bottom, CTMCMC. The number of iterations varies from 5 000 (left

plots) to 50 000 (right plots).

2 4 6 8 10 12 14 0

0.1

0.2

0.3CT (50 000 it.)

k

2 4 6 8 10 12 14 0

0.1

0.2

0.3RJ (50 000 it.)

2 4 6 8 10 12 14 0

0.1

0.2

0.3CT (5 000 it.)

post

erio

r pro

babi

lity

k

2 4 6 8 10 12 14 0

0.1

0.2

0.3RJ (5 000 it.)

post

erio

r pro

babi

lity