Bayesian Inference Chapter 2: Conjugate models Inference Chapter 2: Conjugate models Conchi Aus n and Mike Wiper Department of Statistics Universidad Carlos III de Madrid Advanced

Bayesian Inference

Chapter 2: Conjugate models

Conchi Ausın and Mike Wiper

Department of Statistics

Universidad Carlos III de Madrid

Advanced Statistics and Data Mining Summer School29th June - 10th July, 2015

Conchi Ausın and Mike Wiper Conjugate models Advanced Statistics and Data Mining 1 / 40

Objective

In this class we study the situations when Bayesian statistics is easy!


Conjugate models

Yesterday we looked at a coin tossing example.

We found that a particular, beta prior distribution lead to a beta posterior.

This is an example of a conjugate family of prior distributions.


Coin tossing problems

In coin tossing problems, the likelihood function has the form

f (x|θ) = cθx(1− θ)n−x

where x is the number of observed heads, n is the number of observed tosses andc is a constant determined by the experimental design.

Therefore, it is clear that a beta prior

f (θ) =1

B(a, b)θa−1(1− θ)b−1

implies that the posterior is also beta:

f (θ|x) ∝ θa+x−1(1− θ)b+n−x−1

=1

B(a + x , b + n − x)θa+x−1(1− θ)b+n−x−1

θ ∼ Beta(a + x , b + n − x)


Advantages of conjugate priors I: simplicity ofcalculation

Using a beta prior in this context has a number of advantages.

Given that we know the properties of the beta distribution, prior to posteriorinference is equivalent to changing of parameter values. Prediction is alsostraightforward in the same way.

If θ ∼ Beta(a, b) and X |θ ∼ Binomial(n, θ), then

P(X = x) =

(nx

)B(a + x , b + n − x)

B(a, b)for x = 0, ..., n.


Advantages of conjugate priors II: interpretability

We can see that a (b) in the prior plays the same role as x (n − x).

Therefore we can think of the information represented by the prior asequivalent to the information in n tosses of the coin with x heads and n − xtails.

This gives one way of thinking about how to elicit sensible values for a and b.

To how many tosses of a coin and how many heads does my priorinformation equate?

A problem is that people are often overconfident.


Prior elicitation

The previous method is a little artificial.

If we are asking a real expert to provide information it is better to ask questionsabout observable quantities.

For example:

What would be the average number of heads to occur in 100 tosses of thecoin?

What about the standard deviation?

Then assuming a beta prior, we can solve

µ = 100a

a + bσ = 100

ab

(a + b)(a + b + 1)

Many people don’t understand means and standard deviations so it could be evenbetter to ask about modes or medians or quartiles.


Haldane’s priorRecalling the role of a and b also gives a reasonable way of defining a default,non-informative prior by letting a, b → 0.

In this case we have a prior distribution

f (θ) ∝ 1

θ(1− θ)for 0 < θ < 1

and the posterior is θ|x ∼ Beta(x , n − x), with mean E [θ|x] = xn = θ, the MLE.

This prior is improper!

Should we care?

What if we only observe a sample of heads (tails)?

Then the posterior would be improper too!

This is a big problem in modern Bayesian statistics.


Other ways of choosing a default “objective” prior

Given the Principle of Insufficient Reason we saw yesterday, a uniform prior seemsa natural selection.

However, if we know nothing about θ, shouldn’t we also know nothing aboutϑ = log θ

1−θ for example?

If θ ∼ Uniform(0, 1), then the laws of probability imply that the density of ϑ is

f (ϑ) =eϑ

(1 + eϑ)2

which is clearly not uniform.

Uniform priors are sensible as default options for discrete variables but here, it isnot so clear.


Jeffreys prior

Let X |θ ∼ f (·|θ). Then the Jeffreys prior is

f (θ) ∝√I (θ)

where I (θ) = −EX

[d2

dθ2 log f (X |θ)]

is the expected Fisher information.

Let X |θ ∼ Binomial(n, θ). Then the Jeffreys prior is θ ∼ Beta(

12 ,

12

).

Let X |θ ∼ Negative Binomial(r , θ). The Jeffreys prior is f (θ) ∝ 1θ(1−θ)1/2 .

The prior depends on the experimental design.

This doesn’t comply with the stopping rule principle!

There is no truly objective prior!


Example

The following plot gives the posterior densities of θ for our coin tossing examplegiven the Haldane (blue), uniform (green), Jeffreys I (red) and Jeffreys II (brown)priors.

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.5

1.0

1.5

2.0

2.5

3.0

theta

f

Posterior means for θ are 0.75, 0.714, 0.731 and 0.72 respectively.

In small samples the prior can make a (small) difference ...


Example

0.40 0.45 0.50 0.55 0.60

010

2030

4050

theta

f

... but in our Chinese babies example, it is impossible to differentiate between theposteriors and the posterior means are all equal to 0.5254 to 4 d.p.


Advantages of conjugate priors III: mixtures are stillconjugate

A single beta prior might not represent prior beliefs well.

A mixture of k (sufficiently many) betas can.

The posterior is still a mixture of k betas.

Suppose we set f (θ) = 0.5Beta(5, 5) + 0.5Beta(8, 1) in the coin tossing problemof yesterday. Then, given the observed data, we have

f (θ|x) ∝ θ9(1− θ)3

[0.5

1

B(5, 5)θ5−1(1− θ)5−1 + 0.5

1

B(8, 1)θ8−1(1− θ)1−1

]∝ 1

B(5, 5)θ14−1(1− θ)8−1 + 0.5

1

B(8, 1)θ17−1(1− θ)11−1

= wBeta(14, 8) + (1− w)Beta(17, 11)

where w = B(14,8)/B(5,5)B(14,8)/B(5,5)+B(17,11)/B(8,1) .


Example

The plot shows the prior (black), scaled likelihood (blue) and posterior (red)density.

0.0 0.2 0.4 0.6 0.8 1.0

01

23

4

theta

f


When do conjugate priors exist?

Conjugate priors are associated with exponential family distributions.

f (x|θ) = C (x)D(θ) exp(E(x)TF(θ)

)A conjugate prior is then

f (θ) ∝ D(θ)a exp(bTF(θ))

Given a sample of size n,

f (θ|x) ∝ D(θ)a+n exp((b + nE )TF(θ))

where E = 1n

∑ni=1 E (xi ) is the vector of sufficient statistics.

Letting a, b → 0 gives a natural, “objective” prior.


Rare events models

Consider models associated with rare events (Poisson process).

The likelihood function takes the form:

f (x|θ) = cθne−xθ

where n represents the number of events to have occurred in a time period oflength x and c depends on the experimental design.

Therefore, a gamma distribution θ ∼ Gamma(a, b), that is

f (θ) =ba

Γ(a)θa−1e−bθ for 0 < θ <∞

is conjugate.

The posterior distribution is then θ|x ∼ Gamma(a + n, b + x).


The information in the prior is easily interpretable: a represents the priorequivalent of the number of rare events to occur in a time period of length b.

Letting a, b → 0 gives the natural default prior f (θ) ∝ 1θ .

(This is the Jeffreys prior for exponential data but not for Poisson data).

In this case, given n observed events in time x , the posterior isθ|x ∼ Gamma(n, x), with mean n

x which is equal to the MLE in experimentsof this type.


Example: Software failure data

The CSIAC database provides data showing the times between 136 successivesoftware failures. The diagram shows a histogram of the data and a classical, plugin estimator (blue) of the predictive distribution of x as well as the Bayesianposterior given a Jeffreys prior (red). The Bayesian and classical predictors areindistinguishable.

x

f

0 1000 2000 3000 4000 5000 6000

0.00

000.

0004

0.00

080.

0012


Example: Inference for a queueing system

The M/M/1 queuing system assumes arrivals occur according to a Poissonprocess with rate λ.

There is a single server.

Service occurs on a first come first served basis.

Service times are exponential with mean service time 1/µ.

The system is stable if ρ = λµ < 1.

In this case, the equilibrium distribution of the number of people in thesystem, N, is geometric: N ∼ Geometric(1− ρ).

Time spent in the system by an arriving customer, W ∼ Exponential(µ− λ).


Example

Hall (1991) provides collected inter-arrival and service time data for 98 users of anautomatic teller machine in Berkeley, California. We shall assume that theinterarrival times and service times both follow exponential distributions. Thesufficient statistics were na = ns = 98 and xa = 119.71 and xs = 81.35 minutes.

Given default priors for λ, µ, the posterior distributions are

λ|x ∼ Gamma(98, 119.71) µ|x ∼ Gamma(98, 81.35).


It is easy to calculate the posterior probability that the system is stable ...

... remembering that the ratio of two χ2 distributions divided by their degrees offreedom is F distributed.

P(ρ < 1|x) = P

(119.71

81.35ρ <

119.71

81.35| x)

= P

(F 196

196 <119.71

81.35

)= 0.9965.

Given this is so high, it makes sense to consider the equilibrium distributions.


0 5 10 15 20

0.00

0.10

0.20

0.30

n

p

0 5 10 15 20

0.0

0.2

0.4

0.6

0.8

1.0

w

F


Normal models

Consider a sample from a normal distribution X |µ, σ ∼ Normal(µ, σ2

).

The likelihood function is

f (x|µ, σ) ∝ σ− n2 exp

(− 1

2σ2

[(n − 1)s2 + n(x − µ)2

])Rewrite in terms of the precision, τ = 1

σ2 . Then

f (x|µ, τ) ∝ τ n2 exp

(−τ

2

[(n − 1)s2 + n(x − µ)2

])Define f (µ, τ) = f (τ)f (τ |µ) and assume τ ∼ Gamma

(a2 ,

b2

)and

µ|τ ∼ Normal(m, 1

cτ

).

The marginal distribution of µ is a (scaled, shifted) Student’s t.


A posteriori, we have

µ|τ, x ∼ Normal

(cm + nx

c + n,

1

(c + n)τ

)τ ∼ Gamma

(a + n

2,b + (n − 1)s2 + cn

c+n (m − x)2

2

)

The conditional posterior precision is the sum of prior precision (cτ) andprecision of the MLE (nτ).

The posterior mean is a weighted average of the prior mean (m) and theMLE (x).

A default prior is obtained by letting a, b, c → 0 which implies f (µ, τ) ∝ 1τ

and

µ|τ, x ∼ Normal

(x ,

1

nτ

)τ |x ∼ Gamma

(n − 1

2,

(n − 1)s2

2

)Then µ−x

s/√n| x ∼ Student’s tn−1 (boring)


One sample example

The normal core body temperature of a healthy adult is supposed to be 98.6degrees Fahrenheit or 37 degrees Celsius on average. A normal model fortemperatures, say X |µ, τ ∼ Normal(µ, 1/τ), has been proposed.

Mackowiak et al (1992) measured the core body temperatures of 130 individualswith mean .

The sample mean temperature is x = 98.2492 Fahrenheit with standard deviations = 0.7332.

Thus, a classical 95% confidence interval for µ is

98.2492± 1.96× 0.7332/√

130 = (98.1232, 98.3752)

and the hypothesis that the true mean is equal to 98.6 is rejected.


Consider a prior for µ centred on 98.6, for example µ|τ ∼ Normal(98.6, 1/τ) withf (τ) ∝ 1/τ . The posterior mean for µ is 98.2519 Fahrenheit and a 95% credibleinterval is (98.1251, 98.3787) so that there still appears to be evidence against thehypothesis.

Also, the classical ’plug in’ density for X (blue) and the Bayesian posteriorpredictive density (red) are almost identical.

x

f

96 97 98 99 100 101

0.0

0.1

0.2

0.3

0.4

0.5

0.6


An odd feature of the conjugate prior

The model precision and the prior precision of the distribution of µ are bothproportional to the model precision, τ .

This may be restrictive and unrealistic in practical applications.

A more natural prior for µ might be Normal(m, 1

c

)independent of τ .

Then, the joint posterior distribution looks nasty.

f (µ, τ |x) ∝ τa+n

2 −1 exp(−τ

2

[b + (n − 1)s2 + n(x − µ)2

]− c

2[µ−m]2

)What can we do?


In our problem, both conditional posterior distributions are available:

µ|τ, x ∼ Normal

(cm + nτ x

c + nτ,

1

(c + nτ)

)τ |µ, x ∼ Gamma

(a + n

2,b + (n − 1)s2 + n(x − µ)2

2

)Both these distributions are straightforward to sample from.

Can we use this to give a Monte Carlo sample from the posterior?


Introduction to Gibbs sampling

A Gibbs sampler is a technique for sampling a multivariate distribution when it isstraightforward to sample from the conditionals.

Assume that we have a distribution f (θ) where θ = (θ1, ..., θk).

Let θ−i represent the remaining elements of θ when θi is removed.

Assume that we can sample from θi |θ−i .


The Gibbs sampler

The Gibbs sampler proceeds by starting from (arbitrary) initial values andsuccessively sampling the conditional distributions.

1 Set initial values θ(0) =(θ

(0)1 , ..., θ

(0)k

). Set t = 0.

2 For i = 1, ..., k:

1 Generate θ(t+1)i ∼ θi |θ(t)

−i.

2 Set θ(t) = θ(t+1)i ∪ θ

(t)−i.

3 t = t + 1

4 Go to 2.

As t →∞, the sampled values approach a simple Monte Carlo sample from f (θ).


The example revisitedConsider now that we use independent priors,

µ ∼ Normal(98.6, 1) f (τ) ∝ 1

τ

Then an estimated 95% posterior interval for µ, based on a sample of size 10000is (98.1222, 98.3789), very similar to the previous case. The diagram shows theestimated posterior density (green) and the posterior given the conjugate prior(red).

mu

f

98.0 98.1 98.2 98.3 98.4 98.5

01

23

45

67

Both densities are very similar.Conchi Ausın and Mike Wiper Conjugate models Advanced Statistics and Data Mining 31 / 40

Two samples: the Behrens Fisher problem

For most simple one and two sample problems, when the usual default prior forµ, τ is used, posterior means and intervals for µ coincide with their frequentistcounterparts. An exception is the following two sample problem:

Consider the model

X |µ1, τ1 ∼ N

(µ1,

1

τ1

), Y |µ2, τ2 ∼ N

(µ2,

1

τ2

)with priors f (µi , τi ) ∝ 1

τiand independent samples of size ni for i = 1, 2. Then,

µ1 − x

s1/√n1∼ Student’s t(n1 − 1)

and similarly for µ2.

Therefore, if δ = µ1 − µ2, we have

δ = x − y +s1√n1

T1 −s2√n2

T2


The distribution of δ is a scaled, shifted difference of two Student’s tvariables.

Quantiles, ... can be calculated to a given precision by e.g. Monte Carlo.

Writing δ′ = δ/√

s21

n1+

s21

n1gives δ′ = sinwT1 + coswT2 where

w = tan−1 s1/√n1

s2/√n2

, a Behrens Fisher distribution.

This problem is difficult to solve classically.

Usually a t approximation to the sampling distribution of δ′ is used, but ...

the quality of the approximation depends on the true variance ratio.


Example

Returning to the normal body temperature example, the histograms indicate theremay be a difference between the sexes.

Men

x

f

96 97 98 99 100 101

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Women

x

f

96 97 98 99 100 1010.

00.

10.

20.

30.

40.

50.

6

The sample means are 98.1046 and 98.3939 respectively.


An approximate 95% confidence interval for the mean difference is(−0.5396,−0.03881) suggesting that the true mean for women is higher than thatfor men.

Using the Bayesian approach as earlier (based on 10000 simulated values), wehave an estimate of the posterior density of δ.

delta

f

−0.8 −0.6 −0.4 −0.2 0.0 0.2

0.0

0.5

1.0

1.5

2.0

2.5

3.0

A Bayesian 95% credible interval is estimated as (−0.5448,−0.0368).


Multinomial models

The multinomial distribution is the extension of the binomial distribution todice throwing problems.

Assume a dice with k faces and probability θi for face i is thrown n times. LetX be a k × 1 vector such that Xi is the number of times face i occurs. Then

P(X = x|θ) =x!∏ki=1 xi !

k∏i=1

θxii ,

where x = (x1, ..., xk), xi ∈ Z+ and∑k

i=1 xi = n and 0 ≤ θi ≤ 1,∑ki=1 θi = 1.

Consider a Dirichlet prior, θ ∼ Dirichlet(a), where a = (a1, ..., ak) andai > 0.

f (θ) =Γ(∑k

i=1 ai )∏ki=1 Γ(ai )

k∏i=1

θai−1i .

Then θ|x ∼ Dirichlet (a + x).


Example

After the recent abdication of the King of Spain in favour of his son, 20minutos.eslaunched a survey asking whether this was the correct decision, (X1 = 3698 votes)whether the King should have waited longer (X2 = 347) or whether he shouldhave considered other options such as a referendum (X3 = 2446).1

Let θ = (θ1, θ2, θ3) and assume a Dirichlet (1/2, 1/2, 1/2) prior. The posteriordistribution is Dirichlet(3698.5, 347.5, 2446.5).

Consider the difference θ1 − θ3, reflecting (?) the difference between Monarchistsand Republicans. We have E [θ1 − θ3|x] = 0.193.

0.14 0.16 0.18 0.20 0.22 0.24

05

1015

2025

3035

theta1−theta3

f

1Votes at 12:30 on 5th May 2014.


The Dirichlet process priorSometimes, we do not wish to assume a parametric model for the data generatingdistribution. How can we do this in a Bayesian context?

Assume X |F ∼ F and define a Dirichlet process prior for F .

If the support of X is C , then for any partition, C = C1 ∪ C2 ∪ ... ∪ Ck andk ∈ N, we suppose that

(F (C1),F (C2), ...,F (Ck)) ∼ Dirichlet (aF0(C1), aF0(C2), ..., aF0(Ck))

where a > 0 and F0 is a baseline, prior mean c.d.f.

We write F ∼ Dirichlet process(a,F0).

Given a sample, x1, ..., xn, we have

F |x ∼ Dirichlet process

(a + n,

aF0 + nF

a + n

)

where F is the empirical c.d.f.

The posterior mean is a weighted average of the empirical c.d.f. and the priormean.


Example

The following plot shows the prior (green), posterior (red), empirical (blue) andtrue (black) c.d.f.s’ when 20 data were generated from a Beta(2,1) distributionand a Dirichlet process prior with a = 5 and F0 a uniform distribution were used.

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

x

F


Summary and next chapter

In this chapter we have illustrated the basic properties of conjugate models. Whenthese exist, they allow for simple interpretation and straightforward inference.

Unfortunately, conjugate priors do not always exist, for example if data are tor F distributed.

Then we need numerical techniques like Gibbs sampling.

We study these in more detail in the next chapter.


Documents

Bayesian Inference Chapter 2: Conjugate models Inference Chapter 2: Conjugate models Conchi Aus n and Mike Wiper Department of Statistics Universidad Carlos III de Madrid Advanced