51
A Short Review on Bayesian Analysis Binomial, Multinomial, Normal, Beta, Dirichlet Posterior mean, MAP, credible interval, posterior distribution Gibbs sampling Stat 542 (F. Liang) 1

A Short Review on Bayesian Analysis - publish.illinois.edupublish.illinois.edu/liangf/files/2016/09/lec_Bayes...A Short Review on Bayesian Analysis • Binomial, Multinomial, Normal,

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: A Short Review on Bayesian Analysis - publish.illinois.edupublish.illinois.edu/liangf/files/2016/09/lec_Bayes...A Short Review on Bayesian Analysis • Binomial, Multinomial, Normal,

A Short Review on Bayesian Analysis

• Binomial, Multinomial, Normal, Beta, Dirichlet

• Posterior mean, MAP, credible interval, posterior distribution

• Gibbs sampling

Stat 542 (F. Liang) 1

Page 2: A Short Review on Bayesian Analysis - publish.illinois.edupublish.illinois.edu/liangf/files/2016/09/lec_Bayes...A Short Review on Bayesian Analysis • Binomial, Multinomial, Normal,

Frequentist vs Bayesian

Statistical methods, such as hypothesis testing (p-values), maximum

likelihood estimates (MLE), and confidence intervals (CI), are known as

frequentist methods. In the frequentist framework,

• probabilities refer to long run frequencies and are objective quantities;

• parameters are fixed but unknown constants;

• statistical procedures should have well-defined long run frequency

properties (e.g. 95% CI).

Stat 542 (F. Liang) 2

Page 3: A Short Review on Bayesian Analysis - publish.illinois.edupublish.illinois.edu/liangf/files/2016/09/lec_Bayes...A Short Review on Bayesian Analysis • Binomial, Multinomial, Normal,

There is another approach to inference known as Bayesian inference. In the

Bayesian framework,

• probabilities reflect (subjective) personal belief;

• unknown parameters are treated as random variables;

• we make inferences about a parameter ✓ by producing a probability

distribution for ✓.

Stat 542 (F. Liang) 3

Page 4: A Short Review on Bayesian Analysis - publish.illinois.edupublish.illinois.edu/liangf/files/2016/09/lec_Bayes...A Short Review on Bayesian Analysis • Binomial, Multinomial, Normal,

Bayesian Analysis

The Bayesian inference is carried out in the following way:

1. Choose a statistical model p(x|✓), i.e., the likelihood, same as in

frequentist approach;

2. Choose a prior distribution ⇡(✓);

3. Calculate the posterior distribution ⇡(✓|x).

⇡(✓|x) = p(x|✓)⇡(✓)Rp(x|✓)⇡(✓)d✓

Stat 542 (F. Liang) 4

Page 5: A Short Review on Bayesian Analysis - publish.illinois.edupublish.illinois.edu/liangf/files/2016/09/lec_Bayes...A Short Review on Bayesian Analysis • Binomial, Multinomial, Normal,

Alternatively, we can write the posterior as

⇡(✓|x) =

p(x|✓)⇡(✓)Rp(x|✓)⇡(✓)d✓

/ p(x|✓)⇡(✓)

where we drop the scale factor⇥ R

p(x|✓)⇡(✓)d✓⇤�1

since it is a constant

not depending on ✓.

For example, if ⇡(✓|x) / e

�✓2+2✓, then we know that ✓|x ⇠ N(1, 1/2).

Stat 542 (F. Liang) 5

Page 6: A Short Review on Bayesian Analysis - publish.illinois.edupublish.illinois.edu/liangf/files/2016/09/lec_Bayes...A Short Review on Bayesian Analysis • Binomial, Multinomial, Normal,

Now any inference on the parameter ✓ can be obtained from the posterior

distribution ⇡(✓|x). For example, if one wants a

• point estimate of ✓, we can report the mean (E(✓ | x)), the median

(median(✓ | x)), or the mode of the posterior distribution

(max✓ ⇡(✓ | x));

• an interval estimate of ✓, we can report the 95% credible interval,

which is a region with 0.95 posterior probability. So 95% credible

interval (1.2, 3.5) means that

P(✓ 2 (1.2, 3.5)) = 0.95

where P corresponds to the posterior distribution over ✓.

For a 95% confidence interval (1.2, 3.5), we CANNOT say that “(1.2,

3.5) covers the true ✓ with prob 95%.”

Stat 542 (F. Liang) 6

Page 7: A Short Review on Bayesian Analysis - publish.illinois.edupublish.illinois.edu/liangf/files/2016/09/lec_Bayes...A Short Review on Bayesian Analysis • Binomial, Multinomial, Normal,

• The mode estimate is often referred to as the MAP (maximum a

posteriori) estimate, which is the solution of

max

✓log ⇡(✓|x) = max

hlog p(x|✓) + log ⇡(✓)

i.

Note that MAP is also the solution of

min

h� log p(x | ✓)� log ⇡(✓)

i.

So MAP is related to the regularization approach, with the penalty

term being (� log) of the prior.

Stat 542 (F. Liang) 7

Page 8: A Short Review on Bayesian Analysis - publish.illinois.edupublish.illinois.edu/liangf/files/2016/09/lec_Bayes...A Short Review on Bayesian Analysis • Binomial, Multinomial, Normal,

• Your first resistance to Bayesian inference may be the prior choice.

Where does one find priors?

• Priors, like the likelihood, is part of your assumption: it’s one’s initial

guess of the parameter; after observing the data which carry

information about the parameter, one updates his/her prior to the

posterior. Priors matter and do not matter.

• Next I’ll introduce some default prior choices. Of course the sensitivity

of prior choices–how di↵erent priors a↵ect the final result–should always

be examined in practice.

Stat 542 (F. Liang) 8

Page 9: A Short Review on Bayesian Analysis - publish.illinois.edupublish.illinois.edu/liangf/files/2016/09/lec_Bayes...A Short Review on Bayesian Analysis • Binomial, Multinomial, Normal,

A Bernoulli Example

• Suppose X = (X1, . . . , Xn) denotes the outcomes from n

coin-tossings, 1 means a head and 0 means a tail. They are iid samples

from a Bernoulli distribution with parameter ✓, where ✓ is the

probability of getting a head.

• Without any information about the coin, we can put a uniform prior on

✓, that is,

⇡(✓) = 1, 0 ✓ 1.

Stat 542 (F. Liang) 9

Page 10: A Short Review on Bayesian Analysis - publish.illinois.edupublish.illinois.edu/liangf/files/2016/09/lec_Bayes...A Short Review on Bayesian Analysis • Binomial, Multinomial, Normal,

Next we calculate the posterior distribution of ✓ given X.

⇡(✓|X) /nY

i=1

p(Xi|✓)

/ ✓

s(1� ✓)

n�s, s =

XXi,

which implies that ✓|X ⇠ Beta(s+ 1, n+ 1� s). The corresponding

posterior mean can be used as a point estimate for ✓,

ˆ

✓ =

s+ 1

n+ 2

. (1)

Stat 542 (F. Liang) 10

Page 11: A Short Review on Bayesian Analysis - publish.illinois.edupublish.illinois.edu/liangf/files/2016/09/lec_Bayes...A Short Review on Bayesian Analysis • Binomial, Multinomial, Normal,

Post-mean =

s+ 1

n+ 2

, MLE =

s

n

.

MLE is equal to the observed frequency of heads among n experiments; the

Bayes estimator is the frequency of heads among (n+ 2) experiments in

which there are two “prior” experiments, one is a head and the other one is

a tail.

Without the data, one just looks at the prior experiments and a reasonable

guess for ✓ is 1/2. After observing the data, the final estimate (1) is some

number between 1/2 and s/n as a compromise between the prior

information and the MLE. Note that when n gets large, the prior gets

“washed out”.

Stat 542 (F. Liang) 11

Page 12: A Short Review on Bayesian Analysis - publish.illinois.edupublish.illinois.edu/liangf/files/2016/09/lec_Bayes...A Short Review on Bayesian Analysis • Binomial, Multinomial, Normal,

Post-mean =

s+ 1

n+ 2

, MLE =

s

n

.

The extra counts–one for head and one for tail–are often called the

pseudo-counts. Having pseudo-counts is appealing in cases where ✓ is likely

to take extreme values close to 1 or 0. For example, to estimate ✓ for a rare

event, it is likely to observe Xi = 0 for all i = 1, . . . , n, but it may be

dangerous to conclude ˆ

✓ = 0.

Stat 542 (F. Liang) 12

Page 13: A Short Review on Bayesian Analysis - publish.illinois.edupublish.illinois.edu/liangf/files/2016/09/lec_Bayes...A Short Review on Bayesian Analysis • Binomial, Multinomial, Normal,

Beta distributions are often used as a prior on ✓; in fact, uniform is a special

case of Beta. Suppose the prior on ✓ is Beta(↵,�),

⇡(✓) =

↵�1(1� ✓)

��1

B(↵,�)

,

then

⇡(✓|X) ⇠ Beta(s+ ↵, n� s+ �). (2)

We call Beta distributions the conjugate family for Bernoulli models since

both the prior and the posterior distributions belong to the same family.

The posterior mean of (2) is equal to (↵+ s)/(n+ ↵+ �). So Beta priors

can be viewed as having (↵+ �) prior experiments in which we have ↵

heads and � tails.

Stat 542 (F. Liang) 13

Page 14: A Short Review on Bayesian Analysis - publish.illinois.edupublish.illinois.edu/liangf/files/2016/09/lec_Bayes...A Short Review on Bayesian Analysis • Binomial, Multinomial, Normal,

A Multinomial Example

• Suppose you randomly draw a card from an ordinary deck of playing

cards, and then put it back in the deck. Repeat this exercise five times

(i.e., sampling with replacement).

• Let (N1, N2, N3, N4) denote the number of spades, hearts, diamonds,

and clubs among the five cards. We say (N1, N2, N3, N4) follow a

multinomial distribution, with a probability distribution function given

by

P (N1 = n1, . . . , N4 = n4 | n, ✓1, . . . , ✓4)

=

(n)!

(n1)!(n2)!(n3)!(n4)!✓

n11 · · · ✓n4

4 ,

where n1 + · · ·n4 = n = 5 and ✓1 = · · · = ✓4 = 1/4.

Stat 542 (F. Liang) 14

Page 15: A Short Review on Bayesian Analysis - publish.illinois.edupublish.illinois.edu/liangf/files/2016/09/lec_Bayes...A Short Review on Bayesian Analysis • Binomial, Multinomial, Normal,

• In the Bernoulli example, we conduct n independent trials and each

trial results in one of two possible outcomes, e.g., head or tail, with

probabilities ✓ and (1� ✓), respectively.

• In the multinomial example, we conduct n independent trials and each

trial results in one of k possible outcomes, e.g., k = 4 in the

aforementioned card example, with probabilities ✓1, . . . , ✓k, respectively.

Stat 542 (F. Liang) 15

Page 16: A Short Review on Bayesian Analysis - publish.illinois.edupublish.illinois.edu/liangf/files/2016/09/lec_Bayes...A Short Review on Bayesian Analysis • Binomial, Multinomial, Normal,

• For multinomial distributions, the parameter of interest is

✓ = (✓1, . . . , ✓k) which lies in a simplex of Rk,

S = {✓ = (✓1, . . . , ✓k);X

i

✓i = 1, ✓i � 0}.

• A Dirichlet distribution on S, Dir(↵1, . . . ,↵k), is an extension of the

Beta distribution, with a density function given by

p(✓|↵) =

�(↵1 + · · ·+ ↵k)

�(↵1) · · ·�(↵k)

kY

i=1

↵i�1i .

The Dirichlet distributions are conjugate priors for Multinomial models.

Stat 542 (F. Liang) 16

Page 17: A Short Review on Bayesian Analysis - publish.illinois.edupublish.illinois.edu/liangf/files/2016/09/lec_Bayes...A Short Review on Bayesian Analysis • Binomial, Multinomial, Normal,

Dirichlet Distributions

Examples of Dirichlet distributions over p = (p1, p2, p3) which can be plotted in 2Dsince p3 = 1� p1 � p2:

Page 18: A Short Review on Bayesian Analysis - publish.illinois.edupublish.illinois.edu/liangf/files/2016/09/lec_Bayes...A Short Review on Bayesian Analysis • Binomial, Multinomial, Normal,

A Normal Example

Assume X1, . . . , Xn ⇠ N(✓,�

2). The parameters here are (✓,�

2) and we

would like to get the posterior distribution ⇡(✓,�

2|X). If using Gibbs

sampler which will be introduced later, we only need to know the posterior

distribution of ⇡(✓|�2,X) and ⇡(�

2|✓,X).

Stat 542 (F. Liang) 17

Page 19: A Short Review on Bayesian Analysis - publish.illinois.edupublish.illinois.edu/liangf/files/2016/09/lec_Bayes...A Short Review on Bayesian Analysis • Binomial, Multinomial, Normal,

For the location parameter ✓, the conjugate prior is normal.

¯

X | ✓,�2 ⇠ N(✓,�2/n)

✓ ⇠ N(µ0, ⌧20 ),

then ✓|�2,X ⇠ N(µ, ⌧2) where

µ = w

¯

X + (1� w)µ0, w =

1�2/n

1�2/n +

1⌧20

, ⌧

2=

⇣1

2/n

+

1

20

⌘�1.

Stat 542 (F. Liang) 18

Page 20: A Short Review on Bayesian Analysis - publish.illinois.edupublish.illinois.edu/liangf/files/2016/09/lec_Bayes...A Short Review on Bayesian Analysis • Binomial, Multinomial, Normal,

For the scale parameter �2, the conjugate prior is Invse Gamma, that is, the

prior on 1/�

2 is Gamma. Suppose ⇡(�

2) = InvGa(↵,�), then

⇡(�

2|µ,X) /⇣1

2

⌘n/2exp{�

P(Xi � µ)

2

2�

2}⇣1

2

⌘↵�1e

� ��2

⇠ InvGa�n

2

+ ↵,

P(Xi � µ)

2

2

+ �

�.

In practice, we often specify ⇡(�

2) using Inv�2 distributions which are

special cases of InvGa,

Inv�2(v0, s

20) = InvGa(

v0

2

,

v0

2

s

20).

With prior Inv�2(v0, s

20) the posterior distribution is also Inv�2

(vn, s2n) where

vn = v0 + n, vns2n = v0s

20 +

X(Xi � µ)

2.

v0 pseudo samples and each contributes s20 into RSS.

Stat 542 (F. Liang) 19

Page 21: A Short Review on Bayesian Analysis - publish.illinois.edupublish.illinois.edu/liangf/files/2016/09/lec_Bayes...A Short Review on Bayesian Analysis • Binomial, Multinomial, Normal,

Gibbs Sampling for Posterior Inference

Suppose the random variables X and Y have a joint probability density

function p(x, y).

Sometimes it is not easy to simulate directly from the joint distribution.

Instead, suppose it is possible to simulate from the individual conditional

distributions pX|Y (x|y) and pY |X(y|x).

Stat 542 (F. Liang) 20

Page 22: A Short Review on Bayesian Analysis - publish.illinois.edupublish.illinois.edu/liangf/files/2016/09/lec_Bayes...A Short Review on Bayesian Analysis • Binomial, Multinomial, Normal,

Then a Gibbs sampler draws (X1, Y1), . . . , (XT , YT ) as follows:

1. Initialization: let (X0, Y0) be some starting values; set n = 0.

2. draw Xn+1 ⇠ pX|Y (x|Yn)

3. draw Yn+1 ⇠ pY |X(y|Xn+1)

4. Go to step 2 and repeat.

Stat 542 (F. Liang) 21

Page 23: A Short Review on Bayesian Analysis - publish.illinois.edupublish.illinois.edu/liangf/files/2016/09/lec_Bayes...A Short Review on Bayesian Analysis • Binomial, Multinomial, Normal,

Gibbs samplers are MCMC algorithms, and they produce samples from the

desired distributions after a so-called burning period. So in practice, we

always drop some samples from the initial steps (say, for example, 1000 or

5000 steps) and start saving samples after that.

Stat 542 (F. Liang) 22

Page 24: A Short Review on Bayesian Analysis - publish.illinois.edupublish.illinois.edu/liangf/files/2016/09/lec_Bayes...A Short Review on Bayesian Analysis • Binomial, Multinomial, Normal,

Suppose we have multiple parameters ✓ = (✓1, . . . , ✓K). Then we can draw

the posterior samples of ✓ using a multi-stage Gibbs sampler.

At each stage, we draw ✓i from the conditional distribution ⇡(✓i|✓[�i],Data)

where ✓[�i] denotes the (K � 1) parameters except ✓i.

Why Gibbs samplers?In many cases the conditional distribution of

⇡(✓|Data) is not of closed form, while all those conditionals are.

Stat 542 (F. Liang) 23

Page 25: A Short Review on Bayesian Analysis - publish.illinois.edupublish.illinois.edu/liangf/files/2016/09/lec_Bayes...A Short Review on Bayesian Analysis • Binomial, Multinomial, Normal,

Revisit the Gaussian Mixture Model

• EM for MAP

• Collapsed Gibbs sampling

• Chinese restaurant process, nonparametric clustering

Stat 542 (F. Liang) 24

Page 26: A Short Review on Bayesian Analysis - publish.illinois.edupublish.illinois.edu/liangf/files/2016/09/lec_Bayes...A Short Review on Bayesian Analysis • Binomial, Multinomial, Normal,

A Gaussian Mixture Model

Suppose the data x1, x2, . . . , xn iid from

w N(µ1,�21) + (1� w) N(µ2,�

22).

For each xi, we introduce a latent variable Zi indicating which component

xi is generated from and

P (Zi = 1) = w, P (Zi = 2) = 1� w.

The parameters of interest are ✓ = (w, µ1, µ2,�21,�

22) and their prior

distributions are specified as follows

w ⇠ Be(1, 1), µ1, µ2 ⇠ N(0, ⌧2), �

21,�

22 ⇠ InvGa(↵,�).

Stat 542 (F. Liang) 25

Page 27: A Short Review on Bayesian Analysis - publish.illinois.edupublish.illinois.edu/liangf/files/2016/09/lec_Bayes...A Short Review on Bayesian Analysis • Binomial, Multinomial, Normal,

EM for MAP

The MAP estimate is defined to be

ˆ

✓ = argmax

✓P (x | ✓)⇡(✓) = argmax

hX

z

P (x, z | ✓)i⇡(✓).

We can use the EM algorithm:

log p(x,Z | ✓)⇡(✓)

=

X

i

1(Zi = 1)⇥hlog �µ1,�2

1(xi) + logw

i

+

X

i

1(Zi = 2)⇥hlog �µ2,�2

2(xi) + log(1� w)

i

+ log ⇡(w) + log ⇡(µ1) + log ⇡(µ2) + log ⇡(�

21) + log ⇡(�

22)

Stat 542 (F. Liang) 26

Page 28: A Short Review on Bayesian Analysis - publish.illinois.edupublish.illinois.edu/liangf/files/2016/09/lec_Bayes...A Short Review on Bayesian Analysis • Binomial, Multinomial, Normal,

Recall the EM algorithm for MLE:

• at the E-step, we replace 1(Zi = 1) and 1(Zi = 2) by its expectation,

i.e., the probability of Zi = 1 or 2 conditioning on the data x and the

current estimate of the parameter ✓0

�i = P (Zi = 1 | xi, ✓0) =w�µ1,�2

1(xi)

w�µ1,�21(xi) + (1� w)�µ2,�2

2(xi)

;

• at the M -step, we update ✓,

• and iterative between the E and M steps, until convergence.

Stat 542 (F. Liang) 27

Page 29: A Short Review on Bayesian Analysis - publish.illinois.edupublish.illinois.edu/liangf/files/2016/09/lec_Bayes...A Short Review on Bayesian Analysis • Binomial, Multinomial, Normal,

For MAP, the E-step is the same; the M-step is slightly di↵erent:

• Without the Beta prior, we would update w by �+/n. But with the

Beta prior on w, we need to add the pseudo-counts.

• Similarly, without the prior, we would update µ1 by

1

�+

X

i

�ixi,

but with the prior, we would update µ1 by a weighted average of the

value above and the prior mean for µ1.

Stat 542 (F. Liang) 28

Page 30: A Short Review on Bayesian Analysis - publish.illinois.edupublish.illinois.edu/liangf/files/2016/09/lec_Bayes...A Short Review on Bayesian Analysis • Binomial, Multinomial, Normal,

Gibbs Sampling from the Posterior Distribution

⇡(✓,Z | x) =

nY

i=1

⇥p(Zi | ✓)p(xi | Zi, ✓)

⇥⇡(w)⇡(µ1)⇡(µ2)⇡(�21)⇡(�

22)

In a Gibbs sampler, we iteratively sample each element from

(w, µ1, µ2,�21,�

22, Z1, . . . , Zn)

conditioning on the other elements being fixed.

Stat 542 (F. Liang) 29

Page 31: A Short Review on Bayesian Analysis - publish.illinois.edupublish.illinois.edu/liangf/files/2016/09/lec_Bayes...A Short Review on Bayesian Analysis • Binomial, Multinomial, Normal,

For example,

• How to sample Zi? Bernoulli

P (Zi = 1 | x,Z[�i], others) / P (Zi = 1|✓)⇥ P (xi | Zi = 1, ✓)

= wp(xi | µ1,�21)

• How to sample µ1? Normal

(µ1 | x,Z, others) /h Y

i:Zi=1

P (Xi | µ1,�21)

i⇥ ⇡(µ1).

Stat 542 (F. Liang) 30

Page 32: A Short Review on Bayesian Analysis - publish.illinois.edupublish.illinois.edu/liangf/files/2016/09/lec_Bayes...A Short Review on Bayesian Analysis • Binomial, Multinomial, Normal,

For a general Gaussian mixture model with K components, we have the

prior on the mixing weights as

w = (w1, . . . , wK) ⇠ Dir(↵

K

, . . . ,

K

),

and again, normal prior on µk’s and InvGa prior on �

2k’s.

The Gibbs sampler iterates the following steps:

1. Draw Zi from a Multinomial distribution for i = 1, . . . , n;

2. Draw w from Dirichlet;

3. Draw µ1, . . . , µK from Normal;

4. Draw �

21, . . . ,�

2K from InvGa.

Stat 542 (F. Liang) 31

Page 33: A Short Review on Bayesian Analysis - publish.illinois.edupublish.illinois.edu/liangf/files/2016/09/lec_Bayes...A Short Review on Bayesian Analysis • Binomial, Multinomial, Normal,

Collapsed Gibbs sampling

The mixing weights w is used in sampling Zi’s:

P (Zi = k | x, z[�i], ✓) / P (xi | Zi = k, ✓)P (Zi = k | z[�i], ✓),

where the 2nd term is equal to P (Zi = k | z[�i],w) = P (Zi = k|w), i.e.,

when conditioning on w, Zi’s are independent.

Stat 542 (F. Liang) 32

Page 34: A Short Review on Bayesian Analysis - publish.illinois.edupublish.illinois.edu/liangf/files/2016/09/lec_Bayes...A Short Review on Bayesian Analysis • Binomial, Multinomial, Normal,

Let’s eliminate w from the parameter list, i.e., integrate over w. So we

need to compute

P (Zi = k | z[�i]) =P (z1, . . . , zi�1, Zi = k, zi+1, . . . , zn)

P (z1, . . . , zi�1, zi+1, . . . , zn)

Note that Zi’s are exchangeable (its meaning will be made clearly in class).

So it su�ces to compute the sampling distribution for the last observation.

Stat 542 (F. Liang) 33

Page 35: A Short Review on Bayesian Analysis - publish.illinois.edupublish.illinois.edu/liangf/files/2016/09/lec_Bayes...A Short Review on Bayesian Analysis • Binomial, Multinomial, Normal,

P (Zn = k | z1, . . . , zn�1) =P (Zn = k, z1, . . . , zi)

P (z1, . . . , zi)

=

Rw

n1+↵/K�11 · · ·wnk+1+↵/K�1

k · · ·wnK+↵/K�1K dw

Rw

n1+↵/K�11 · · ·wnk+↵/K�1

k · · ·wnK+↵/K�1K dw

=

�(n� 1 + ↵)

�(n� 1 + ↵+ 1)

�(nk + ↵/K + 1)

�(nk + ↵/K)

=

nk + ↵/K

n� 1 + ↵

where nk = #{j : zj = k, j = 1 : (n� 1)}.

Stat 542 (F. Liang) 34

Page 36: A Short Review on Bayesian Analysis - publish.illinois.edupublish.illinois.edu/liangf/files/2016/09/lec_Bayes...A Short Review on Bayesian Analysis • Binomial, Multinomial, Normal,

The Collapsed Gibbs sampler iterates the following steps:

1. Draw Zi from a Multinomial(�i1, . . . , �iK), for i = 1, . . . , n where

�ik /n

(i)k + ↵/K

n� 1 + ↵

⇥ P (xi | µk,�2k).

2. No need to sample w from Dirichlet;

3. Draw µ1, . . . , µK from Normal;

4. Draw �

21, . . . ,�

2K from InvGa.

Stat 542 (F. Liang) 35

Page 37: A Short Review on Bayesian Analysis - publish.illinois.edupublish.illinois.edu/liangf/files/2016/09/lec_Bayes...A Short Review on Bayesian Analysis • Binomial, Multinomial, Normal,

Infinite Many Clusters

K = A Large Value (> n)?

At the t-th iteration of the algorithm, there must be some empty clusters.

Let’s re-label the clusters: 1 to K

⇤ being the non-empty clusters (of course,

the value of K⇤ changes from iteration to iteration), and the remaining are

empty ones.

• Update {µk,�2k}K

⇤k=1 from posterior (Normal/InvGa)

• Update {µk,�2k}Kk=K⇤+1 from prior (Normal/InvGa)

Stat 542 (F. Liang) 36

Page 38: A Short Review on Bayesian Analysis - publish.illinois.edupublish.illinois.edu/liangf/files/2016/09/lec_Bayes...A Short Review on Bayesian Analysis • Binomial, Multinomial, Normal,

• Update Zi from a Multinomial(�i1, . . . , �iK), where

�ik /n

(i)k + ↵/K

n� 1 + ↵

⇥ P (xi | µk,�2k).

Zi may start a new cluster, i.e., Zi > K

⇤.

It doesn’t matter which value Zi takes from (K

⇤+ 1, . . . ,K), since we

can always label this new cluster as the (K

⇤+ 1)th cluster, and then

immediately update (µK⇤+1,�2K⇤+1) based on the corresponding

posterior (conditioning on xi).

Stat 542 (F. Liang) 37

Page 39: A Short Review on Bayesian Analysis - publish.illinois.edupublish.illinois.edu/liangf/files/2016/09/lec_Bayes...A Short Review on Bayesian Analysis • Binomial, Multinomial, Normal,

What’t the chance that Zi starts a new cluster?

KX

k=K⇤+1

�ik =

↵/K

n� 1 + ↵

KX

k=K⇤+1

P (xi | µk,�2k)

=

K�K⇤

K

n� 1 + ↵

"1

K �K

KX

k=K⇤+1

P (xi | µk,�2k)

#

! ↵

n� 1 + ↵

ZZP (xi | µ,�2

)⇡(µ,�

2)dµd�

2,

when K ! 1, where

ZZP (xi | µ,�2

)⇡(µ,�

2)dµd�

2= m(xi)

is the integrated (wrt to our prior ⇡) likelihood for a sample xi.

Stat 542 (F. Liang) 38

Page 40: A Short Review on Bayesian Analysis - publish.illinois.edupublish.illinois.edu/liangf/files/2016/09/lec_Bayes...A Short Review on Bayesian Analysis • Binomial, Multinomial, Normal,

Clustering with Chinese Restaurant Process (CRP)

A mixture model

Xi | Zi = k ⇠ P (· | µk,�2k), i = 1 : n.

Prior on Z1, . . . , Zn and (µk,�2k)

Kk=1 (known as the Chinese Restaurant

Process)

• Z1 = 1 and (µ1,�21) ⇠ ⇡(·)

• for i � 1, suppose Z1, . . . , Zi form m clusters with size n1, . . . , nm,

then

P (Zi+1 = k | Z1, . . . , Zi) =

nk

i+ ↵

, k = 1, . . . ,m;

P (Zi+1 = m+ 1 | Z1, . . . , Zi) =

i+ ↵

, (µm+1,�2m+1) ⇠ ⇡(·).

Stat 542 (F. Liang) 39

Page 41: A Short Review on Bayesian Analysis - publish.illinois.edupublish.illinois.edu/liangf/files/2016/09/lec_Bayes...A Short Review on Bayesian Analysis • Binomial, Multinomial, Normal,

Alternatively, you can describe the prior first and then the likelihood (which

gives you a clear idea of how data are generated):

• Set Z1 = 1, generate (µ1,�21) ⇠ ⇡(·) and X1 ⇠ P (· | µ1,�

21).

• Loop over i = 1, . . . , n� 1: suppose the previous i samples form m

clusters with cluster-specific parameters (µk,�2k)

mk=1; then

P (Zi+1 = k | Z1, . . . , Zi) =

nk

i+ ↵

, k = 1, . . . ,m;

P (Zi+1 = m+ 1) | Z1, . . . , Zi) =

i+ ↵

.

If Zi+1 = m+ 1, generate (µm+1,�2m+1) ⇠ ⇡(·). Then generate

Xi+1 ⇠ P (· | µk,�2k).

Stat 542 (F. Liang) 40

Page 42: A Short Review on Bayesian Analysis - publish.illinois.edupublish.illinois.edu/liangf/files/2016/09/lec_Bayes...A Short Review on Bayesian Analysis • Binomial, Multinomial, Normal,

Advantages

• We do not need to specify K.

• K is treated as a random variable, and its (posterior) distribution is

learned from the data.

• Can model unseen data: for any new sample X

⇤, there is always a

positive chance that it can start a new cluster.

Stat 542 (F. Liang) 41

Page 43: A Short Review on Bayesian Analysis - publish.illinois.edupublish.illinois.edu/liangf/files/2016/09/lec_Bayes...A Short Review on Bayesian Analysis • Binomial, Multinomial, Normal,

Exchangeability of Zi’s

In CRP, the labels Zi’s are generated sequentially, but in fact they are

exchangeable (up to a permutation of the cluster labels – labels should start

from 1, 2, ...)

P (11122) = P (12121) = P (12221)

=

2(2!)(1!)

(1 + ↵)(2 + ↵) · · · (4 + ↵)

.

In general, suppose z1, . . . , zn form m clusters with size n1, . . . , nm, then

P (z1, . . . , zn) =↵

m(n1 � 1)! · · · (nm � 1)!Qn

i=2(i� 1 + ↵)

.

So the order of zi’s doesn’t matter; what matters is the partition of the n

samples implied by zi’s.

Stat 542 (F. Liang) 42

Page 44: A Short Review on Bayesian Analysis - publish.illinois.edupublish.illinois.edu/liangf/files/2016/09/lec_Bayes...A Short Review on Bayesian Analysis • Binomial, Multinomial, Normal,

Posterior Sampling for Clustering with CRP

Same as the Gibbs sampler we have derived for K ! 1. At the tth

iteration, repeat the following:

• Suppose Z[�i] form K clusters, labeled from 1 to K, of size n

(i)k .

Sample Zi from a Multinomial with

P (Zi = k) /n

(i)k

n� 1 + ↵

⇥ P (xi | µk,�2k), k = 1, . . . ,K

P (Zi = K + 1) / ↵

n� 1 + ↵

m(xi),

where m(xi) =RR

P (xi|µ,�2)⇡(µ,�

2)dµd�

2.

• Update {µk,�2k} from posterior (Normal/InvGa)

Stat 542 (F. Liang) 43

Page 45: A Short Review on Bayesian Analysis - publish.illinois.edupublish.illinois.edu/liangf/files/2016/09/lec_Bayes...A Short Review on Bayesian Analysis • Binomial, Multinomial, Normal,

• The exchangeability of Zi’s plays an important role in the algorithm.

Where we use this property?

• The marginal likelihood m(·) is easy to compute if the prior ⇡(µ,�2) is

conjugate, otherwise, we need to figure a way to compute m(·).

• For other MCMC algorithms, check Neal (2000); for Variational Bayes,

check Blei and Jordan (2004).

• The “ugly” side: labeling issue.

Stat 542 (F. Liang) 44

Page 46: A Short Review on Bayesian Analysis - publish.illinois.edupublish.illinois.edu/liangf/files/2016/09/lec_Bayes...A Short Review on Bayesian Analysis • Binomial, Multinomial, Normal,

Non-parametric Bayesian (NB) Models

• The finite mixture model

Xi | Zi = k ⇠ P✓⇤k, P (Zi = k) = wk.

⇤k iid ⇠ G0, w ⇠ Dir(↵/K, . . . ,↵/K).

• Alternatively,

Xi | ✓i ⇠ P✓i , ✓i | G ⇠ G

G(·) =KX

k=1

wk�✓⇤k(·), w ⇠ Dir

⇣↵

K

, · · ·⌘, ✓

⇤k ⇠ G0.

The prior on G is a K-element discrete dist. In the NB approach, we’ll

drop this restriction.

Stat 542 (F. Liang) 45

Page 47: A Short Review on Bayesian Analysis - publish.illinois.edupublish.illinois.edu/liangf/files/2016/09/lec_Bayes...A Short Review on Bayesian Analysis • Binomial, Multinomial, Normal,

• A NB approach for clustering

Xi | ✓i ⇠ P✓i , ✓i | G ⇠ G

G ⇠ DP(↵, G0)

where DP(↵, G0) denotes a Dirichlet Process with a scale (precision)

parameter ↵ and a base measure G0.

Stat 542 (F. Liang) 46

Page 48: A Short Review on Bayesian Analysis - publish.illinois.edupublish.illinois.edu/liangf/files/2016/09/lec_Bayes...A Short Review on Bayesian Analysis • Binomial, Multinomial, Normal,

Dirichlet Process (DP)

⇤i | G iid G, G ⇧ DP(�, G0)

• Define DP as a distribution over distributions (Ferguson, 1973)

• Describe DP as a stick-breaking process (Sethuraman, 1994)

• If we integrate over G (wrt DP), the resulting prior on

(⇤1, . . . , ⇤n),

⌅(⇤1, . . . , ⇤n) =

� n⌥

i=1

G(⇤i)d⇥(G)

is the Chinese restaurant process (CRP).

7

Page 49: A Short Review on Bayesian Analysis - publish.illinois.edupublish.illinois.edu/liangf/files/2016/09/lec_Bayes...A Short Review on Bayesian Analysis • Binomial, Multinomial, Normal,

Dirichlet Process

G � DP(·|G0,�) OK, but what does it look like?

Samples from a DP are discrete with probability one:

G(⇤) =��

k=1

⌅k⇥�k(⇤)

where ⇥�k(·) is a Dirac delta at ⇤k, and ⇤k � G0(·).

Note: E(G) = G0

As �⇥⇤, G looks more like G0.

Page 50: A Short Review on Bayesian Analysis - publish.illinois.edupublish.illinois.edu/liangf/files/2016/09/lec_Bayes...A Short Review on Bayesian Analysis • Binomial, Multinomial, Normal,

Dirichlet Processes: Stick Breaking Representation

G ⇥ DP(·|G0,�)Samples G from a DP can berepresented as follows:

G(·) =⇥⇥

k=1

⇧k⇤�k(·)

where ⌅k ⇥ G0(·),�⇥

k=1 ⇧k = 1,

⇧k = ⇥k

k�1⇤

j=1

(1� ⇥j)

and

⇥k ⇥ Beta(·|1,�)

0 0.2 0.4 0.6 0.8 10

5

10

15

!

p(!)

Beta(1,1)

Beta(1,10)

Beta(1,0.5)

(Sethuraman, 1994)

Page 51: A Short Review on Bayesian Analysis - publish.illinois.edupublish.illinois.edu/liangf/files/2016/09/lec_Bayes...A Short Review on Bayesian Analysis • Binomial, Multinomial, Normal,

Chinese Restaurant Process

!1

!1 !

3!5 !2 !3 !4

!2 !

6!4

. . .

Generating from a CRP:customer 1 enters the restaurant and sits at table 1.⌅1 = ⇤1 where ⇤1 ⇥ G0, K = 1, n = 1, n1 = 1for n = 2, . . .,

customer n sits at table

�k with prob nk

n�1+� for k = 1 . . .KK + 1 with prob �

n�1+� (new table)

if new table was chosen then K ⇤ K + 1, ⇤K+1 ⇥ G0 endifset ⌅n to ⇤k of the table k that customer n sat at; set nk ⇤ nk + 1

endfor

The resulting conditional distribution over ⌅n:

⌅n|⌅1, . . . ,⌅n�1, G0,� ⇥�

n� 1 + �G0(·) +

K⇥

k=1

nk

n� 1 + �⇥⇥k

(·)