A Short Review on Bayesian Analysis - publish.illinois.edupublish.illinois.edu/liangf/files/2016/09/lec_Bayes...A Short Review on Bayesian Analysis • Binomial, Multinomial, Normal,

A Short Review on Bayesian Analysis

• Binomial, Multinomial, Normal, Beta, Dirichlet

• Posterior mean, MAP, credible interval, posterior distribution

• Gibbs sampling

Stat 542 (F. Liang) 1

Frequentist vs Bayesian

Statistical methods, such as hypothesis testing (p-values), maximum

likelihood estimates (MLE), and confidence intervals (CI), are known as

frequentist methods. In the frequentist framework,

• probabilities refer to long run frequencies and are objective quantities;

• parameters are fixed but unknown constants;

• statistical procedures should have well-defined long run frequency

properties (e.g. 95% CI).


There is another approach to inference known as Bayesian inference. In the

Bayesian framework,

• probabilities reflect (subjective) personal belief;

• unknown parameters are treated as random variables;

• we make inferences about a parameter ✓ by producing a probability

distribution for ✓.


Bayesian Analysis

The Bayesian inference is carried out in the following way:

1. Choose a statistical model p(x|✓), i.e., the likelihood, same as in

frequentist approach;

2. Choose a prior distribution ⇡(✓);

3. Calculate the posterior distribution ⇡(✓|x).

⇡(✓|x) = p(x|✓)⇡(✓)Rp(x|✓)⇡(✓)d✓


Alternatively, we can write the posterior as

⇡(✓|x) =

p(x|✓)⇡(✓)Rp(x|✓)⇡(✓)d✓

/ p(x|✓)⇡(✓)

where we drop the scale factor⇥ R

p(x|✓)⇡(✓)d✓⇤�1

since it is a constant

not depending on ✓.

For example, if ⇡(✓|x) / e

�✓2+2✓, then we know that ✓|x ⇠ N(1, 1/2).


Now any inference on the parameter ✓ can be obtained from the posterior

distribution ⇡(✓|x). For example, if one wants a

• point estimate of ✓, we can report the mean (E(✓ | x)), the median

(median(✓ | x)), or the mode of the posterior distribution

(max✓ ⇡(✓ | x));

• an interval estimate of ✓, we can report the 95% credible interval,

which is a region with 0.95 posterior probability. So 95% credible

interval (1.2, 3.5) means that

P(✓ 2 (1.2, 3.5)) = 0.95

where P corresponds to the posterior distribution over ✓.

For a 95% confidence interval (1.2, 3.5), we CANNOT say that “(1.2,

3.5) covers the true ✓ with prob 95%.”


• The mode estimate is often referred to as the MAP (maximum a

posteriori) estimate, which is the solution of

max

✓log ⇡(✓|x) = max

✓

hlog p(x|✓) + log ⇡(✓)

i.

Note that MAP is also the solution of

min

✓

h� log p(x | ✓)� log ⇡(✓)

i.

So MAP is related to the regularization approach, with the penalty

term being (� log) of the prior.


• Your first resistance to Bayesian inference may be the prior choice.

Where does one find priors?

• Priors, like the likelihood, is part of your assumption: it’s one’s initial

guess of the parameter; after observing the data which carry

information about the parameter, one updates his/her prior to the

posterior. Priors matter and do not matter.

• Next I’ll introduce some default prior choices. Of course the sensitivity

of prior choices–how di↵erent priors a↵ect the final result–should always

be examined in practice.


A Bernoulli Example

• Suppose X = (X1, . . . , Xn) denotes the outcomes from n

coin-tossings, 1 means a head and 0 means a tail. They are iid samples

from a Bernoulli distribution with parameter ✓, where ✓ is the

probability of getting a head.

• Without any information about the coin, we can put a uniform prior on

✓, that is,

⇡(✓) = 1, 0 ✓ 1.


Next we calculate the posterior distribution of ✓ given X.

⇡(✓|X) /nY

i=1

p(Xi|✓)

/ ✓

s(1� ✓)

n�s, s =

XXi,

which implies that ✓|X ⇠ Beta(s+ 1, n+ 1� s). The corresponding

posterior mean can be used as a point estimate for ✓,

ˆ

✓ =

s+ 1

n+ 2

. (1)


Post-mean =

s+ 1

n+ 2

, MLE =

s

n

.

MLE is equal to the observed frequency of heads among n experiments; the

Bayes estimator is the frequency of heads among (n+ 2) experiments in

which there are two “prior” experiments, one is a head and the other one is

a tail.

Without the data, one just looks at the prior experiments and a reasonable

guess for ✓ is 1/2. After observing the data, the final estimate (1) is some

number between 1/2 and s/n as a compromise between the prior

information and the MLE. Note that when n gets large, the prior gets

“washed out”.


Post-mean =

s+ 1

n+ 2

, MLE =

s

n

.

The extra counts–one for head and one for tail–are often called the

pseudo-counts. Having pseudo-counts is appealing in cases where ✓ is likely

to take extreme values close to 1 or 0. For example, to estimate ✓ for a rare

event, it is likely to observe Xi = 0 for all i = 1, . . . , n, but it may be

dangerous to conclude ˆ

✓ = 0.


Beta distributions are often used as a prior on ✓; in fact, uniform is a special

case of Beta. Suppose the prior on ✓ is Beta(↵,�),

⇡(✓) =

✓

↵�1(1� ✓)

��1

B(↵,�)

,

then

⇡(✓|X) ⇠ Beta(s+ ↵, n� s+ �). (2)

We call Beta distributions the conjugate family for Bernoulli models since

both the prior and the posterior distributions belong to the same family.

The posterior mean of (2) is equal to (↵+ s)/(n+ ↵+ �). So Beta priors

can be viewed as having (↵+ �) prior experiments in which we have ↵

heads and � tails.


A Multinomial Example

• Suppose you randomly draw a card from an ordinary deck of playing

cards, and then put it back in the deck. Repeat this exercise five times

(i.e., sampling with replacement).

• Let (N1, N2, N3, N4) denote the number of spades, hearts, diamonds,

and clubs among the five cards. We say (N1, N2, N3, N4) follow a

multinomial distribution, with a probability distribution function given

by

P (N1 = n1, . . . , N4 = n4 | n, ✓1, . . . , ✓4)

=

(n)!

(n1)!(n2)!(n3)!(n4)!✓

n11 · · · ✓n4

4 ,

where n1 + · · ·n4 = n = 5 and ✓1 = · · · = ✓4 = 1/4.


• In the Bernoulli example, we conduct n independent trials and each

trial results in one of two possible outcomes, e.g., head or tail, with

probabilities ✓ and (1� ✓), respectively.

• In the multinomial example, we conduct n independent trials and each

trial results in one of k possible outcomes, e.g., k = 4 in the

aforementioned card example, with probabilities ✓1, . . . , ✓k, respectively.


• For multinomial distributions, the parameter of interest is

✓ = (✓1, . . . , ✓k) which lies in a simplex of Rk,

S = {✓ = (✓1, . . . , ✓k);X

i

✓i = 1, ✓i � 0}.

• A Dirichlet distribution on S, Dir(↵1, . . . ,↵k), is an extension of the

Beta distribution, with a density function given by

p(✓|↵) =

�(↵1 + · · ·+ ↵k)

�(↵1) · · ·�(↵k)

kY

i=1

✓

↵i�1i .

The Dirichlet distributions are conjugate priors for Multinomial models.


Dirichlet Distributions

Examples of Dirichlet distributions over p = (p1, p2, p3) which can be plotted in 2Dsince p3 = 1� p1 � p2:

A Normal Example

Assume X1, . . . , Xn ⇠ N(✓,�

2). The parameters here are (✓,�

2) and we

would like to get the posterior distribution ⇡(✓,�

2|X). If using Gibbs

sampler which will be introduced later, we only need to know the posterior

distribution of ⇡(✓|�2,X) and ⇡(�

2|✓,X).


For the location parameter ✓, the conjugate prior is normal.

¯

X | ✓,�2 ⇠ N(✓,�2/n)

✓ ⇠ N(µ0, ⌧20 ),

then ✓|�2,X ⇠ N(µ, ⌧2) where

µ = w

¯

X + (1� w)µ0, w =

1�2/n

1�2/n +

1⌧20

, ⌧

2=

⇣1

�

2/n

+

1

⌧

20

⌘�1.


For the scale parameter �2, the conjugate prior is Invse Gamma, that is, the

prior on 1/�

2 is Gamma. Suppose ⇡(�

2) = InvGa(↵,�), then

⇡(�

2|µ,X) /⇣1

�

2

⌘n/2exp{�

P(Xi � µ)

2

2�

2}⇣1

�

2

⌘↵�1e

� ��2

⇠ InvGa�n

2

+ ↵,

P(Xi � µ)

2

2

+ �

�.

In practice, we often specify ⇡(�

2) using Inv�2 distributions which are

special cases of InvGa,

Inv�2(v0, s

20) = InvGa(

v0

2

,

v0

2

s

20).

With prior Inv�2(v0, s

20) the posterior distribution is also Inv�2

(vn, s2n) where

vn = v0 + n, vns2n = v0s

20 +

X(Xi � µ)

2.

v0 pseudo samples and each contributes s20 into RSS.


Gibbs Sampling for Posterior Inference

Suppose the random variables X and Y have a joint probability density

function p(x, y).

Sometimes it is not easy to simulate directly from the joint distribution.

Instead, suppose it is possible to simulate from the individual conditional

distributions pX|Y (x|y) and pY |X(y|x).


Then a Gibbs sampler draws (X1, Y1), . . . , (XT , YT ) as follows:

1. Initialization: let (X0, Y0) be some starting values; set n = 0.

2. draw Xn+1 ⇠ pX|Y (x|Yn)

3. draw Yn+1 ⇠ pY |X(y|Xn+1)

4. Go to step 2 and repeat.


Gibbs samplers are MCMC algorithms, and they produce samples from the

desired distributions after a so-called burning period. So in practice, we

always drop some samples from the initial steps (say, for example, 1000 or

5000 steps) and start saving samples after that.


Suppose we have multiple parameters ✓ = (✓1, . . . , ✓K). Then we can draw

the posterior samples of ✓ using a multi-stage Gibbs sampler.

At each stage, we draw ✓i from the conditional distribution ⇡(✓i|✓[�i],Data)

where ✓[�i] denotes the (K � 1) parameters except ✓i.

Why Gibbs samplers?In many cases the conditional distribution of

⇡(✓|Data) is not of closed form, while all those conditionals are.


Revisit the Gaussian Mixture Model

• EM for MAP

• Collapsed Gibbs sampling

• Chinese restaurant process, nonparametric clustering


A Gaussian Mixture Model

Suppose the data x1, x2, . . . , xn iid from

w N(µ1,�21) + (1� w) N(µ2,�

22).

For each xi, we introduce a latent variable Zi indicating which component

xi is generated from and

P (Zi = 1) = w, P (Zi = 2) = 1� w.

The parameters of interest are ✓ = (w, µ1, µ2,�21,�

22) and their prior

distributions are specified as follows

w ⇠ Be(1, 1), µ1, µ2 ⇠ N(0, ⌧2), �

21,�

22 ⇠ InvGa(↵,�).


EM for MAP

The MAP estimate is defined to be

ˆ

✓ = argmax

✓P (x | ✓)⇡(✓) = argmax

✓

hX

z

P (x, z | ✓)i⇡(✓).

We can use the EM algorithm:

log p(x,Z | ✓)⇡(✓)

=

X

i

1(Zi = 1)⇥hlog �µ1,�2

1(xi) + logw

i

+

X

i

1(Zi = 2)⇥hlog �µ2,�2

2(xi) + log(1� w)

i

+ log ⇡(w) + log ⇡(µ1) + log ⇡(µ2) + log ⇡(�

21) + log ⇡(�

22)


Recall the EM algorithm for MLE:

• at the E-step, we replace 1(Zi = 1) and 1(Zi = 2) by its expectation,

i.e., the probability of Zi = 1 or 2 conditioning on the data x and the

current estimate of the parameter ✓0

�i = P (Zi = 1 | xi, ✓0) =w�µ1,�2

1(xi)

w�µ1,�21(xi) + (1� w)�µ2,�2

2(xi)

;

• at the M -step, we update ✓,

• and iterative between the E and M steps, until convergence.


For MAP, the E-step is the same; the M-step is slightly di↵erent:

• Without the Beta prior, we would update w by �+/n. But with the

Beta prior on w, we need to add the pseudo-counts.

• Similarly, without the prior, we would update µ1 by

1

�+

X

i

�ixi,

but with the prior, we would update µ1 by a weighted average of the

value above and the prior mean for µ1.


Gibbs Sampling from the Posterior Distribution

⇡(✓,Z | x) =

nY

i=1

⇥p(Zi | ✓)p(xi | Zi, ✓)

⇤

⇥⇡(w)⇡(µ1)⇡(µ2)⇡(�21)⇡(�

22)

In a Gibbs sampler, we iteratively sample each element from

(w, µ1, µ2,�21,�

22, Z1, . . . , Zn)

conditioning on the other elements being fixed.


For example,

• How to sample Zi? Bernoulli

P (Zi = 1 | x,Z[�i], others) / P (Zi = 1|✓)⇥ P (xi | Zi = 1, ✓)

= wp(xi | µ1,�21)

• How to sample µ1? Normal

(µ1 | x,Z, others) /h Y

i:Zi=1

P (Xi | µ1,�21)

i⇥ ⇡(µ1).


For a general Gaussian mixture model with K components, we have the

prior on the mixing weights as

w = (w1, . . . , wK) ⇠ Dir(↵

K

, . . . ,

↵

K

),

and again, normal prior on µk’s and InvGa prior on �

2k’s.

The Gibbs sampler iterates the following steps:

1. Draw Zi from a Multinomial distribution for i = 1, . . . , n;

2. Draw w from Dirichlet;

3. Draw µ1, . . . , µK from Normal;

4. Draw �

21, . . . ,�

2K from InvGa.


Collapsed Gibbs sampling

The mixing weights w is used in sampling Zi’s:

P (Zi = k | x, z[�i], ✓) / P (xi | Zi = k, ✓)P (Zi = k | z[�i], ✓),

where the 2nd term is equal to P (Zi = k | z[�i],w) = P (Zi = k|w), i.e.,

when conditioning on w, Zi’s are independent.


Let’s eliminate w from the parameter list, i.e., integrate over w. So we

need to compute

P (Zi = k | z[�i]) =P (z1, . . . , zi�1, Zi = k, zi+1, . . . , zn)

P (z1, . . . , zi�1, zi+1, . . . , zn)

Note that Zi’s are exchangeable (its meaning will be made clearly in class).

So it su�ces to compute the sampling distribution for the last observation.


P (Zn = k | z1, . . . , zn�1) =P (Zn = k, z1, . . . , zi)

P (z1, . . . , zi)

=

Rw

n1+↵/K�11 · · ·wnk+1+↵/K�1

k · · ·wnK+↵/K�1K dw

Rw

n1+↵/K�11 · · ·wnk+↵/K�1

k · · ·wnK+↵/K�1K dw

=

�(n� 1 + ↵)

�(n� 1 + ↵+ 1)

�(nk + ↵/K + 1)

�(nk + ↵/K)

=

nk + ↵/K

n� 1 + ↵

where nk = #{j : zj = k, j = 1 : (n� 1)}.


The Collapsed Gibbs sampler iterates the following steps:

1. Draw Zi from a Multinomial(�i1, . . . , �iK), for i = 1, . . . , n where

�ik /n

(i)k + ↵/K

n� 1 + ↵

⇥ P (xi | µk,�2k).

2. No need to sample w from Dirichlet;

3. Draw µ1, . . . , µK from Normal;

4. Draw �

21, . . . ,�

2K from InvGa.


Infinite Many Clusters

K = A Large Value (> n)?

At the t-th iteration of the algorithm, there must be some empty clusters.

Let’s re-label the clusters: 1 to K

⇤ being the non-empty clusters (of course,

the value of K⇤ changes from iteration to iteration), and the remaining are

empty ones.

• Update {µk,�2k}K

⇤k=1 from posterior (Normal/InvGa)

• Update {µk,�2k}Kk=K⇤+1 from prior (Normal/InvGa)


• Update Zi from a Multinomial(�i1, . . . , �iK), where

�ik /n

(i)k + ↵/K

n� 1 + ↵

⇥ P (xi | µk,�2k).

Zi may start a new cluster, i.e., Zi > K

⇤.

It doesn’t matter which value Zi takes from (K

⇤+ 1, . . . ,K), since we

can always label this new cluster as the (K

⇤+ 1)th cluster, and then

immediately update (µK⇤+1,�2K⇤+1) based on the corresponding

posterior (conditioning on xi).


What’t the chance that Zi starts a new cluster?

KX

k=K⇤+1

�ik =

↵/K

n� 1 + ↵

KX

k=K⇤+1

P (xi | µk,�2k)

=

↵

K�K⇤

K

n� 1 + ↵

"1

K �K

⇤

KX

k=K⇤+1

P (xi | µk,�2k)

#

! ↵

n� 1 + ↵

ZZP (xi | µ,�2

)⇡(µ,�

2)dµd�

2,

when K ! 1, where

ZZP (xi | µ,�2

)⇡(µ,�

2)dµd�

2= m(xi)

is the integrated (wrt to our prior ⇡) likelihood for a sample xi.


Clustering with Chinese Restaurant Process (CRP)

A mixture model

Xi | Zi = k ⇠ P (· | µk,�2k), i = 1 : n.

Prior on Z1, . . . , Zn and (µk,�2k)

Kk=1 (known as the Chinese Restaurant

Process)

• Z1 = 1 and (µ1,�21) ⇠ ⇡(·)

• for i � 1, suppose Z1, . . . , Zi form m clusters with size n1, . . . , nm,

then

P (Zi+1 = k | Z1, . . . , Zi) =

nk

i+ ↵

, k = 1, . . . ,m;

P (Zi+1 = m+ 1 | Z1, . . . , Zi) =

↵

i+ ↵

, (µm+1,�2m+1) ⇠ ⇡(·).


Alternatively, you can describe the prior first and then the likelihood (which

gives you a clear idea of how data are generated):

• Set Z1 = 1, generate (µ1,�21) ⇠ ⇡(·) and X1 ⇠ P (· | µ1,�

21).

• Loop over i = 1, . . . , n� 1: suppose the previous i samples form m

clusters with cluster-specific parameters (µk,�2k)

mk=1; then

P (Zi+1 = k | Z1, . . . , Zi) =

nk

i+ ↵

, k = 1, . . . ,m;

P (Zi+1 = m+ 1) | Z1, . . . , Zi) =

↵

i+ ↵

.

If Zi+1 = m+ 1, generate (µm+1,�2m+1) ⇠ ⇡(·). Then generate

Xi+1 ⇠ P (· | µk,�2k).


Advantages

• We do not need to specify K.

• K is treated as a random variable, and its (posterior) distribution is

learned from the data.

• Can model unseen data: for any new sample X

⇤, there is always a

positive chance that it can start a new cluster.


Exchangeability of Zi’s

In CRP, the labels Zi’s are generated sequentially, but in fact they are

exchangeable (up to a permutation of the cluster labels – labels should start

from 1, 2, ...)

P (11122) = P (12121) = P (12221)

=

↵

2(2!)(1!)

(1 + ↵)(2 + ↵) · · · (4 + ↵)

.

In general, suppose z1, . . . , zn form m clusters with size n1, . . . , nm, then

P (z1, . . . , zn) =↵

m(n1 � 1)! · · · (nm � 1)!Qn

i=2(i� 1 + ↵)

.

So the order of zi’s doesn’t matter; what matters is the partition of the n

samples implied by zi’s.


Posterior Sampling for Clustering with CRP

Same as the Gibbs sampler we have derived for K ! 1. At the tth

iteration, repeat the following:

• Suppose Z[�i] form K clusters, labeled from 1 to K, of size n

(i)k .

Sample Zi from a Multinomial with

P (Zi = k) /n

(i)k

n� 1 + ↵

⇥ P (xi | µk,�2k), k = 1, . . . ,K

P (Zi = K + 1) / ↵

n� 1 + ↵

m(xi),

where m(xi) =RR

P (xi|µ,�2)⇡(µ,�

2)dµd�

2.

• Update {µk,�2k} from posterior (Normal/InvGa)


• The exchangeability of Zi’s plays an important role in the algorithm.

Where we use this property?

• The marginal likelihood m(·) is easy to compute if the prior ⇡(µ,�2) is

conjugate, otherwise, we need to figure a way to compute m(·).

• For other MCMC algorithms, check Neal (2000); for Variational Bayes,

check Blei and Jordan (2004).

• The “ugly” side: labeling issue.


Non-parametric Bayesian (NB) Models

• The finite mixture model

Xi | Zi = k ⇠ P✓⇤k, P (Zi = k) = wk.

✓

⇤k iid ⇠ G0, w ⇠ Dir(↵/K, . . . ,↵/K).

• Alternatively,

Xi | ✓i ⇠ P✓i , ✓i | G ⇠ G

G(·) =KX

k=1

wk�✓⇤k(·), w ⇠ Dir

⇣↵

K

, · · ·⌘, ✓

⇤k ⇠ G0.

The prior on G is a K-element discrete dist. In the NB approach, we’ll

drop this restriction.


• A NB approach for clustering

Xi | ✓i ⇠ P✓i , ✓i | G ⇠ G

G ⇠ DP(↵, G0)

where DP(↵, G0) denotes a Dirichlet Process with a scale (precision)

parameter ↵ and a base measure G0.


Dirichlet Process (DP)

⇤i | G iid G, G ⇧ DP(�, G0)

• Define DP as a distribution over distributions (Ferguson, 1973)

• Describe DP as a stick-breaking process (Sethuraman, 1994)

• If we integrate over G (wrt DP), the resulting prior on

(⇤1, . . . , ⇤n),

⌅(⇤1, . . . , ⇤n) =

� n⌥

i=1

G(⇤i)d⇥(G)

is the Chinese restaurant process (CRP).

7

Dirichlet Process

G � DP(·|G0,�) OK, but what does it look like?

Samples from a DP are discrete with probability one:

G(⇤) =��

k=1

⌅k⇥�k(⇤)

where ⇥�k(·) is a Dirac delta at ⇤k, and ⇤k � G0(·).

Note: E(G) = G0

As �⇥⇤, G looks more like G0.

Dirichlet Processes: Stick Breaking Representation

G ⇥ DP(·|G0,�)Samples G from a DP can berepresented as follows:

G(·) =⇥⇥

k=1

⇧k⇤�k(·)

where ⌅k ⇥ G0(·),�⇥

k=1 ⇧k = 1,

⇧k = ⇥k

k�1⇤

j=1

(1� ⇥j)

and

⇥k ⇥ Beta(·|1,�)

0 0.2 0.4 0.6 0.8 10

5

10

15

!

p(!)

Beta(1,1)

Beta(1,10)

Beta(1,0.5)

(Sethuraman, 1994)

Chinese Restaurant Process

!1

!1 !

3!5 !2 !3 !4

!2 !

6!4

. . .

Generating from a CRP:customer 1 enters the restaurant and sits at table 1.⌅1 = ⇤1 where ⇤1 ⇥ G0, K = 1, n = 1, n1 = 1for n = 2, . . .,

customer n sits at table

�k with prob nk

n�1+� for k = 1 . . .KK + 1 with prob �

n�1+� (new table)

if new table was chosen then K ⇤ K + 1, ⇤K+1 ⇥ G0 endifset ⌅n to ⇤k of the table k that customer n sat at; set nk ⇤ nk + 1

endfor

The resulting conditional distribution over ⌅n:

⌅n|⌅1, . . . ,⌅n�1, G0,� ⇥�

n� 1 + �G0(·) +

K⇥

k=1

nk

n� 1 + �⇥⇥k

(·)

Documents

A Short Review on Bayesian Analysis - publish.illinois.edupublish.illinois.edu/liangf/files/2016/09/lec_Bayes...A Short Review on Bayesian Analysis • Binomial, Multinomial, Normal,