40
Bayesian Inference Chapter 2: Conjugate models Conchi Aus´ ın and Mike Wiper Department of Statistics Universidad Carlos III de Madrid Advanced Statistics and Data Mining Summer School 29th June - 10th July, 2015 Conchi Aus´ ın and Mike Wiper Conjugate models Advanced Statistics and Data Mining 1 / 40

Bayesian Inference Chapter 2: Conjugate models Inference Chapter 2: Conjugate models Conchi Aus n and Mike Wiper Department of Statistics Universidad Carlos III de Madrid Advanced

  • Upload
    lycong

  • View
    218

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Bayesian Inference Chapter 2: Conjugate models Inference Chapter 2: Conjugate models Conchi Aus n and Mike Wiper Department of Statistics Universidad Carlos III de Madrid Advanced

Bayesian Inference

Chapter 2: Conjugate models

Conchi Ausın and Mike Wiper

Department of Statistics

Universidad Carlos III de Madrid

Advanced Statistics and Data Mining Summer School29th June - 10th July, 2015

Conchi Ausın and Mike Wiper Conjugate models Advanced Statistics and Data Mining 1 / 40

Page 2: Bayesian Inference Chapter 2: Conjugate models Inference Chapter 2: Conjugate models Conchi Aus n and Mike Wiper Department of Statistics Universidad Carlos III de Madrid Advanced

Objective

In this class we study the situations when Bayesian statistics is easy!

Conchi Ausın and Mike Wiper Conjugate models Advanced Statistics and Data Mining 2 / 40

Page 3: Bayesian Inference Chapter 2: Conjugate models Inference Chapter 2: Conjugate models Conchi Aus n and Mike Wiper Department of Statistics Universidad Carlos III de Madrid Advanced

Conjugate models

Yesterday we looked at a coin tossing example.

We found that a particular, beta prior distribution lead to a beta posterior.

This is an example of a conjugate family of prior distributions.

Conchi Ausın and Mike Wiper Conjugate models Advanced Statistics and Data Mining 3 / 40

Page 4: Bayesian Inference Chapter 2: Conjugate models Inference Chapter 2: Conjugate models Conchi Aus n and Mike Wiper Department of Statistics Universidad Carlos III de Madrid Advanced

Coin tossing problems

In coin tossing problems, the likelihood function has the form

f (x|θ) = cθx(1− θ)n−x

where x is the number of observed heads, n is the number of observed tosses andc is a constant determined by the experimental design.

Therefore, it is clear that a beta prior

f (θ) =1

B(a, b)θa−1(1− θ)b−1

implies that the posterior is also beta:

f (θ|x) ∝ θa+x−1(1− θ)b+n−x−1

=1

B(a + x , b + n − x)θa+x−1(1− θ)b+n−x−1

θ ∼ Beta(a + x , b + n − x)

Conchi Ausın and Mike Wiper Conjugate models Advanced Statistics and Data Mining 4 / 40

Page 5: Bayesian Inference Chapter 2: Conjugate models Inference Chapter 2: Conjugate models Conchi Aus n and Mike Wiper Department of Statistics Universidad Carlos III de Madrid Advanced

Advantages of conjugate priors I: simplicity ofcalculation

Using a beta prior in this context has a number of advantages.

Given that we know the properties of the beta distribution, prior to posteriorinference is equivalent to changing of parameter values. Prediction is alsostraightforward in the same way.

If θ ∼ Beta(a, b) and X |θ ∼ Binomial(n, θ), then

P(X = x) =

(nx

)B(a + x , b + n − x)

B(a, b)for x = 0, ..., n.

Conchi Ausın and Mike Wiper Conjugate models Advanced Statistics and Data Mining 5 / 40

Page 6: Bayesian Inference Chapter 2: Conjugate models Inference Chapter 2: Conjugate models Conchi Aus n and Mike Wiper Department of Statistics Universidad Carlos III de Madrid Advanced

Advantages of conjugate priors II: interpretability

We can see that a (b) in the prior plays the same role as x (n − x).

Therefore we can think of the information represented by the prior asequivalent to the information in n tosses of the coin with x heads and n − xtails.

This gives one way of thinking about how to elicit sensible values for a and b.

To how many tosses of a coin and how many heads does my priorinformation equate?

A problem is that people are often overconfident.

Conchi Ausın and Mike Wiper Conjugate models Advanced Statistics and Data Mining 6 / 40

Page 7: Bayesian Inference Chapter 2: Conjugate models Inference Chapter 2: Conjugate models Conchi Aus n and Mike Wiper Department of Statistics Universidad Carlos III de Madrid Advanced

Prior elicitation

The previous method is a little artificial.

If we are asking a real expert to provide information it is better to ask questionsabout observable quantities.

For example:

What would be the average number of heads to occur in 100 tosses of thecoin?

What about the standard deviation?

Then assuming a beta prior, we can solve

µ = 100a

a + bσ = 100

ab

(a + b)(a + b + 1)

Many people don’t understand means and standard deviations so it could be evenbetter to ask about modes or medians or quartiles.

Conchi Ausın and Mike Wiper Conjugate models Advanced Statistics and Data Mining 7 / 40

Page 8: Bayesian Inference Chapter 2: Conjugate models Inference Chapter 2: Conjugate models Conchi Aus n and Mike Wiper Department of Statistics Universidad Carlos III de Madrid Advanced

Haldane’s priorRecalling the role of a and b also gives a reasonable way of defining a default,non-informative prior by letting a, b → 0.

In this case we have a prior distribution

f (θ) ∝ 1

θ(1− θ)for 0 < θ < 1

and the posterior is θ|x ∼ Beta(x , n − x), with mean E [θ|x] = xn = θ, the MLE.

This prior is improper!

Should we care?

What if we only observe a sample of heads (tails)?

Then the posterior would be improper too!

This is a big problem in modern Bayesian statistics.

Conchi Ausın and Mike Wiper Conjugate models Advanced Statistics and Data Mining 8 / 40

Page 9: Bayesian Inference Chapter 2: Conjugate models Inference Chapter 2: Conjugate models Conchi Aus n and Mike Wiper Department of Statistics Universidad Carlos III de Madrid Advanced

Other ways of choosing a default “objective” prior

Given the Principle of Insufficient Reason we saw yesterday, a uniform prior seemsa natural selection.

However, if we know nothing about θ, shouldn’t we also know nothing aboutϑ = log θ

1−θ for example?

If θ ∼ Uniform(0, 1), then the laws of probability imply that the density of ϑ is

f (ϑ) =eϑ

(1 + eϑ)2

which is clearly not uniform.

Uniform priors are sensible as default options for discrete variables but here, it isnot so clear.

Conchi Ausın and Mike Wiper Conjugate models Advanced Statistics and Data Mining 9 / 40

Page 10: Bayesian Inference Chapter 2: Conjugate models Inference Chapter 2: Conjugate models Conchi Aus n and Mike Wiper Department of Statistics Universidad Carlos III de Madrid Advanced

Jeffreys prior

Let X |θ ∼ f (·|θ). Then the Jeffreys prior is

f (θ) ∝√I (θ)

where I (θ) = −EX

[d2

dθ2 log f (X |θ)]

is the expected Fisher information.

Let X |θ ∼ Binomial(n, θ). Then the Jeffreys prior is θ ∼ Beta(

12 ,

12

).

Let X |θ ∼ Negative Binomial(r , θ). The Jeffreys prior is f (θ) ∝ 1θ(1−θ)1/2 .

The prior depends on the experimental design.

This doesn’t comply with the stopping rule principle!

There is no truly objective prior!

Conchi Ausın and Mike Wiper Conjugate models Advanced Statistics and Data Mining 10 / 40

Page 11: Bayesian Inference Chapter 2: Conjugate models Inference Chapter 2: Conjugate models Conchi Aus n and Mike Wiper Department of Statistics Universidad Carlos III de Madrid Advanced

Example

The following plot gives the posterior densities of θ for our coin tossing examplegiven the Haldane (blue), uniform (green), Jeffreys I (red) and Jeffreys II (brown)priors.

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.5

1.0

1.5

2.0

2.5

3.0

theta

f

Posterior means for θ are 0.75, 0.714, 0.731 and 0.72 respectively.

In small samples the prior can make a (small) difference ...

Conchi Ausın and Mike Wiper Conjugate models Advanced Statistics and Data Mining 11 / 40

Page 12: Bayesian Inference Chapter 2: Conjugate models Inference Chapter 2: Conjugate models Conchi Aus n and Mike Wiper Department of Statistics Universidad Carlos III de Madrid Advanced

Example

0.40 0.45 0.50 0.55 0.60

010

2030

4050

theta

f

... but in our Chinese babies example, it is impossible to differentiate between theposteriors and the posterior means are all equal to 0.5254 to 4 d.p.

Conchi Ausın and Mike Wiper Conjugate models Advanced Statistics and Data Mining 12 / 40

Page 13: Bayesian Inference Chapter 2: Conjugate models Inference Chapter 2: Conjugate models Conchi Aus n and Mike Wiper Department of Statistics Universidad Carlos III de Madrid Advanced

Advantages of conjugate priors III: mixtures are stillconjugate

A single beta prior might not represent prior beliefs well.

A mixture of k (sufficiently many) betas can.

The posterior is still a mixture of k betas.

Suppose we set f (θ) = 0.5Beta(5, 5) + 0.5Beta(8, 1) in the coin tossing problemof yesterday. Then, given the observed data, we have

f (θ|x) ∝ θ9(1− θ)3

[0.5

1

B(5, 5)θ5−1(1− θ)5−1 + 0.5

1

B(8, 1)θ8−1(1− θ)1−1

]∝ 1

B(5, 5)θ14−1(1− θ)8−1 + 0.5

1

B(8, 1)θ17−1(1− θ)11−1

= wBeta(14, 8) + (1− w)Beta(17, 11)

where w = B(14,8)/B(5,5)B(14,8)/B(5,5)+B(17,11)/B(8,1) .

Conchi Ausın and Mike Wiper Conjugate models Advanced Statistics and Data Mining 13 / 40

Page 14: Bayesian Inference Chapter 2: Conjugate models Inference Chapter 2: Conjugate models Conchi Aus n and Mike Wiper Department of Statistics Universidad Carlos III de Madrid Advanced

Example

The plot shows the prior (black), scaled likelihood (blue) and posterior (red)density.

0.0 0.2 0.4 0.6 0.8 1.0

01

23

4

theta

f

Conchi Ausın and Mike Wiper Conjugate models Advanced Statistics and Data Mining 14 / 40

Page 15: Bayesian Inference Chapter 2: Conjugate models Inference Chapter 2: Conjugate models Conchi Aus n and Mike Wiper Department of Statistics Universidad Carlos III de Madrid Advanced

When do conjugate priors exist?

Conjugate priors are associated with exponential family distributions.

f (x|θ) = C (x)D(θ) exp(E(x)TF(θ)

)A conjugate prior is then

f (θ) ∝ D(θ)a exp(bTF(θ))

Given a sample of size n,

f (θ|x) ∝ D(θ)a+n exp((b + nE )TF(θ))

where E = 1n

∑ni=1 E (xi ) is the vector of sufficient statistics.

Letting a, b → 0 gives a natural, “objective” prior.

Conchi Ausın and Mike Wiper Conjugate models Advanced Statistics and Data Mining 15 / 40

Page 16: Bayesian Inference Chapter 2: Conjugate models Inference Chapter 2: Conjugate models Conchi Aus n and Mike Wiper Department of Statistics Universidad Carlos III de Madrid Advanced

Rare events models

Consider models associated with rare events (Poisson process).

The likelihood function takes the form:

f (x|θ) = cθne−xθ

where n represents the number of events to have occurred in a time period oflength x and c depends on the experimental design.

Therefore, a gamma distribution θ ∼ Gamma(a, b), that is

f (θ) =ba

Γ(a)θa−1e−bθ for 0 < θ <∞

is conjugate.

The posterior distribution is then θ|x ∼ Gamma(a + n, b + x).

Conchi Ausın and Mike Wiper Conjugate models Advanced Statistics and Data Mining 16 / 40

Page 17: Bayesian Inference Chapter 2: Conjugate models Inference Chapter 2: Conjugate models Conchi Aus n and Mike Wiper Department of Statistics Universidad Carlos III de Madrid Advanced

The information in the prior is easily interpretable: a represents the priorequivalent of the number of rare events to occur in a time period of length b.

Letting a, b → 0 gives the natural default prior f (θ) ∝ 1θ .

(This is the Jeffreys prior for exponential data but not for Poisson data).

In this case, given n observed events in time x , the posterior isθ|x ∼ Gamma(n, x), with mean n

x which is equal to the MLE in experimentsof this type.

Conchi Ausın and Mike Wiper Conjugate models Advanced Statistics and Data Mining 17 / 40

Page 18: Bayesian Inference Chapter 2: Conjugate models Inference Chapter 2: Conjugate models Conchi Aus n and Mike Wiper Department of Statistics Universidad Carlos III de Madrid Advanced

Example: Software failure data

The CSIAC database provides data showing the times between 136 successivesoftware failures. The diagram shows a histogram of the data and a classical, plugin estimator (blue) of the predictive distribution of x as well as the Bayesianposterior given a Jeffreys prior (red). The Bayesian and classical predictors areindistinguishable.

x

f

0 1000 2000 3000 4000 5000 6000

0.00

000.

0004

0.00

080.

0012

Conchi Ausın and Mike Wiper Conjugate models Advanced Statistics and Data Mining 18 / 40

Page 19: Bayesian Inference Chapter 2: Conjugate models Inference Chapter 2: Conjugate models Conchi Aus n and Mike Wiper Department of Statistics Universidad Carlos III de Madrid Advanced

Example: Inference for a queueing system

The M/M/1 queuing system assumes arrivals occur according to a Poissonprocess with rate λ.

There is a single server.

Service occurs on a first come first served basis.

Service times are exponential with mean service time 1/µ.

The system is stable if ρ = λµ < 1.

In this case, the equilibrium distribution of the number of people in thesystem, N, is geometric: N ∼ Geometric(1− ρ).

Time spent in the system by an arriving customer, W ∼ Exponential(µ− λ).

Conchi Ausın and Mike Wiper Conjugate models Advanced Statistics and Data Mining 19 / 40

Page 20: Bayesian Inference Chapter 2: Conjugate models Inference Chapter 2: Conjugate models Conchi Aus n and Mike Wiper Department of Statistics Universidad Carlos III de Madrid Advanced

Example

Hall (1991) provides collected inter-arrival and service time data for 98 users of anautomatic teller machine in Berkeley, California. We shall assume that theinterarrival times and service times both follow exponential distributions. Thesufficient statistics were na = ns = 98 and xa = 119.71 and xs = 81.35 minutes.

Given default priors for λ, µ, the posterior distributions are

λ|x ∼ Gamma(98, 119.71) µ|x ∼ Gamma(98, 81.35).

Conchi Ausın and Mike Wiper Conjugate models Advanced Statistics and Data Mining 20 / 40

Page 21: Bayesian Inference Chapter 2: Conjugate models Inference Chapter 2: Conjugate models Conchi Aus n and Mike Wiper Department of Statistics Universidad Carlos III de Madrid Advanced

It is easy to calculate the posterior probability that the system is stable ...

... remembering that the ratio of two χ2 distributions divided by their degrees offreedom is F distributed.

P(ρ < 1|x) = P

(119.71

81.35ρ <

119.71

81.35| x)

= P

(F 196

196 <119.71

81.35

)= 0.9965.

Given this is so high, it makes sense to consider the equilibrium distributions.

Conchi Ausın and Mike Wiper Conjugate models Advanced Statistics and Data Mining 21 / 40

Page 22: Bayesian Inference Chapter 2: Conjugate models Inference Chapter 2: Conjugate models Conchi Aus n and Mike Wiper Department of Statistics Universidad Carlos III de Madrid Advanced

0 5 10 15 20

0.00

0.10

0.20

0.30

n

p

0 5 10 15 20

0.0

0.2

0.4

0.6

0.8

1.0

w

F

Conchi Ausın and Mike Wiper Conjugate models Advanced Statistics and Data Mining 22 / 40

Page 23: Bayesian Inference Chapter 2: Conjugate models Inference Chapter 2: Conjugate models Conchi Aus n and Mike Wiper Department of Statistics Universidad Carlos III de Madrid Advanced

Normal models

Consider a sample from a normal distribution X |µ, σ ∼ Normal(µ, σ2

).

The likelihood function is

f (x|µ, σ) ∝ σ− n2 exp

(− 1

2σ2

[(n − 1)s2 + n(x − µ)2

])Rewrite in terms of the precision, τ = 1

σ2 . Then

f (x|µ, τ) ∝ τ n2 exp

(−τ

2

[(n − 1)s2 + n(x − µ)2

])Define f (µ, τ) = f (τ)f (τ |µ) and assume τ ∼ Gamma

(a2 ,

b2

)and

µ|τ ∼ Normal(m, 1

).

The marginal distribution of µ is a (scaled, shifted) Student’s t.

Conchi Ausın and Mike Wiper Conjugate models Advanced Statistics and Data Mining 23 / 40

Page 24: Bayesian Inference Chapter 2: Conjugate models Inference Chapter 2: Conjugate models Conchi Aus n and Mike Wiper Department of Statistics Universidad Carlos III de Madrid Advanced

A posteriori, we have

µ|τ, x ∼ Normal

(cm + nx

c + n,

1

(c + n)τ

)τ ∼ Gamma

(a + n

2,b + (n − 1)s2 + cn

c+n (m − x)2

2

)

The conditional posterior precision is the sum of prior precision (cτ) andprecision of the MLE (nτ).

The posterior mean is a weighted average of the prior mean (m) and theMLE (x).

A default prior is obtained by letting a, b, c → 0 which implies f (µ, τ) ∝ 1τ

and

µ|τ, x ∼ Normal

(x ,

1

)τ |x ∼ Gamma

(n − 1

2,

(n − 1)s2

2

)Then µ−x

s/√n| x ∼ Student’s tn−1 (boring)

Conchi Ausın and Mike Wiper Conjugate models Advanced Statistics and Data Mining 24 / 40

Page 25: Bayesian Inference Chapter 2: Conjugate models Inference Chapter 2: Conjugate models Conchi Aus n and Mike Wiper Department of Statistics Universidad Carlos III de Madrid Advanced

One sample example

The normal core body temperature of a healthy adult is supposed to be 98.6degrees Fahrenheit or 37 degrees Celsius on average. A normal model fortemperatures, say X |µ, τ ∼ Normal(µ, 1/τ), has been proposed.

Mackowiak et al (1992) measured the core body temperatures of 130 individualswith mean .

The sample mean temperature is x = 98.2492 Fahrenheit with standard deviations = 0.7332.

Thus, a classical 95% confidence interval for µ is

98.2492± 1.96× 0.7332/√

130 = (98.1232, 98.3752)

and the hypothesis that the true mean is equal to 98.6 is rejected.

Conchi Ausın and Mike Wiper Conjugate models Advanced Statistics and Data Mining 25 / 40

Page 26: Bayesian Inference Chapter 2: Conjugate models Inference Chapter 2: Conjugate models Conchi Aus n and Mike Wiper Department of Statistics Universidad Carlos III de Madrid Advanced

Consider a prior for µ centred on 98.6, for example µ|τ ∼ Normal(98.6, 1/τ) withf (τ) ∝ 1/τ . The posterior mean for µ is 98.2519 Fahrenheit and a 95% credibleinterval is (98.1251, 98.3787) so that there still appears to be evidence against thehypothesis.

Also, the classical ’plug in’ density for X (blue) and the Bayesian posteriorpredictive density (red) are almost identical.

x

f

96 97 98 99 100 101

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Conchi Ausın and Mike Wiper Conjugate models Advanced Statistics and Data Mining 26 / 40

Page 27: Bayesian Inference Chapter 2: Conjugate models Inference Chapter 2: Conjugate models Conchi Aus n and Mike Wiper Department of Statistics Universidad Carlos III de Madrid Advanced

An odd feature of the conjugate prior

The model precision and the prior precision of the distribution of µ are bothproportional to the model precision, τ .

This may be restrictive and unrealistic in practical applications.

A more natural prior for µ might be Normal(m, 1

c

)independent of τ .

Then, the joint posterior distribution looks nasty.

f (µ, τ |x) ∝ τa+n

2 −1 exp(−τ

2

[b + (n − 1)s2 + n(x − µ)2

]− c

2[µ−m]2

)What can we do?

Conchi Ausın and Mike Wiper Conjugate models Advanced Statistics and Data Mining 27 / 40

Page 28: Bayesian Inference Chapter 2: Conjugate models Inference Chapter 2: Conjugate models Conchi Aus n and Mike Wiper Department of Statistics Universidad Carlos III de Madrid Advanced

In our problem, both conditional posterior distributions are available:

µ|τ, x ∼ Normal

(cm + nτ x

c + nτ,

1

(c + nτ)

)τ |µ, x ∼ Gamma

(a + n

2,b + (n − 1)s2 + n(x − µ)2

2

)Both these distributions are straightforward to sample from.

Can we use this to give a Monte Carlo sample from the posterior?

Conchi Ausın and Mike Wiper Conjugate models Advanced Statistics and Data Mining 28 / 40

Page 29: Bayesian Inference Chapter 2: Conjugate models Inference Chapter 2: Conjugate models Conchi Aus n and Mike Wiper Department of Statistics Universidad Carlos III de Madrid Advanced

Introduction to Gibbs sampling

A Gibbs sampler is a technique for sampling a multivariate distribution when it isstraightforward to sample from the conditionals.

Assume that we have a distribution f (θ) where θ = (θ1, ..., θk).

Let θ−i represent the remaining elements of θ when θi is removed.

Assume that we can sample from θi |θ−i .

Conchi Ausın and Mike Wiper Conjugate models Advanced Statistics and Data Mining 29 / 40

Page 30: Bayesian Inference Chapter 2: Conjugate models Inference Chapter 2: Conjugate models Conchi Aus n and Mike Wiper Department of Statistics Universidad Carlos III de Madrid Advanced

The Gibbs sampler

The Gibbs sampler proceeds by starting from (arbitrary) initial values andsuccessively sampling the conditional distributions.

1 Set initial values θ(0) =(θ

(0)1 , ..., θ

(0)k

). Set t = 0.

2 For i = 1, ..., k:

1 Generate θ(t+1)i ∼ θi |θ(t)

−i.

2 Set θ(t) = θ(t+1)i ∪ θ

(t)−i.

3 t = t + 1

4 Go to 2.

As t →∞, the sampled values approach a simple Monte Carlo sample from f (θ).

Conchi Ausın and Mike Wiper Conjugate models Advanced Statistics and Data Mining 30 / 40

Page 31: Bayesian Inference Chapter 2: Conjugate models Inference Chapter 2: Conjugate models Conchi Aus n and Mike Wiper Department of Statistics Universidad Carlos III de Madrid Advanced

The example revisitedConsider now that we use independent priors,

µ ∼ Normal(98.6, 1) f (τ) ∝ 1

τ

Then an estimated 95% posterior interval for µ, based on a sample of size 10000is (98.1222, 98.3789), very similar to the previous case. The diagram shows theestimated posterior density (green) and the posterior given the conjugate prior(red).

mu

f

98.0 98.1 98.2 98.3 98.4 98.5

01

23

45

67

Both densities are very similar.Conchi Ausın and Mike Wiper Conjugate models Advanced Statistics and Data Mining 31 / 40

Page 32: Bayesian Inference Chapter 2: Conjugate models Inference Chapter 2: Conjugate models Conchi Aus n and Mike Wiper Department of Statistics Universidad Carlos III de Madrid Advanced

Two samples: the Behrens Fisher problem

For most simple one and two sample problems, when the usual default prior forµ, τ is used, posterior means and intervals for µ coincide with their frequentistcounterparts. An exception is the following two sample problem:

Consider the model

X |µ1, τ1 ∼ N

(µ1,

1

τ1

), Y |µ2, τ2 ∼ N

(µ2,

1

τ2

)with priors f (µi , τi ) ∝ 1

τiand independent samples of size ni for i = 1, 2. Then,

µ1 − x

s1/√n1∼ Student’s t(n1 − 1)

and similarly for µ2.

Therefore, if δ = µ1 − µ2, we have

δ = x − y +s1√n1

T1 −s2√n2

T2

Conchi Ausın and Mike Wiper Conjugate models Advanced Statistics and Data Mining 32 / 40

Page 33: Bayesian Inference Chapter 2: Conjugate models Inference Chapter 2: Conjugate models Conchi Aus n and Mike Wiper Department of Statistics Universidad Carlos III de Madrid Advanced

The distribution of δ is a scaled, shifted difference of two Student’s tvariables.

Quantiles, ... can be calculated to a given precision by e.g. Monte Carlo.

Writing δ′ = δ/√

s21

n1+

s21

n1gives δ′ = sinwT1 + coswT2 where

w = tan−1 s1/√n1

s2/√n2

, a Behrens Fisher distribution.

This problem is difficult to solve classically.

Usually a t approximation to the sampling distribution of δ′ is used, but ...

the quality of the approximation depends on the true variance ratio.

Conchi Ausın and Mike Wiper Conjugate models Advanced Statistics and Data Mining 33 / 40

Page 34: Bayesian Inference Chapter 2: Conjugate models Inference Chapter 2: Conjugate models Conchi Aus n and Mike Wiper Department of Statistics Universidad Carlos III de Madrid Advanced

Example

Returning to the normal body temperature example, the histograms indicate theremay be a difference between the sexes.

Men

x

f

96 97 98 99 100 101

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Women

x

f

96 97 98 99 100 1010.

00.

10.

20.

30.

40.

50.

6

The sample means are 98.1046 and 98.3939 respectively.

Conchi Ausın and Mike Wiper Conjugate models Advanced Statistics and Data Mining 34 / 40

Page 35: Bayesian Inference Chapter 2: Conjugate models Inference Chapter 2: Conjugate models Conchi Aus n and Mike Wiper Department of Statistics Universidad Carlos III de Madrid Advanced

An approximate 95% confidence interval for the mean difference is(−0.5396,−0.03881) suggesting that the true mean for women is higher than thatfor men.

Using the Bayesian approach as earlier (based on 10000 simulated values), wehave an estimate of the posterior density of δ.

delta

f

−0.8 −0.6 −0.4 −0.2 0.0 0.2

0.0

0.5

1.0

1.5

2.0

2.5

3.0

A Bayesian 95% credible interval is estimated as (−0.5448,−0.0368).

Conchi Ausın and Mike Wiper Conjugate models Advanced Statistics and Data Mining 35 / 40

Page 36: Bayesian Inference Chapter 2: Conjugate models Inference Chapter 2: Conjugate models Conchi Aus n and Mike Wiper Department of Statistics Universidad Carlos III de Madrid Advanced

Multinomial models

The multinomial distribution is the extension of the binomial distribution todice throwing problems.

Assume a dice with k faces and probability θi for face i is thrown n times. LetX be a k × 1 vector such that Xi is the number of times face i occurs. Then

P(X = x|θ) =x!∏ki=1 xi !

k∏i=1

θxii ,

where x = (x1, ..., xk), xi ∈ Z+ and∑k

i=1 xi = n and 0 ≤ θi ≤ 1,∑ki=1 θi = 1.

Consider a Dirichlet prior, θ ∼ Dirichlet(a), where a = (a1, ..., ak) andai > 0.

f (θ) =Γ(∑k

i=1 ai )∏ki=1 Γ(ai )

k∏i=1

θai−1i .

Then θ|x ∼ Dirichlet (a + x).

Conchi Ausın and Mike Wiper Conjugate models Advanced Statistics and Data Mining 36 / 40

Page 37: Bayesian Inference Chapter 2: Conjugate models Inference Chapter 2: Conjugate models Conchi Aus n and Mike Wiper Department of Statistics Universidad Carlos III de Madrid Advanced

Example

After the recent abdication of the King of Spain in favour of his son, 20minutos.eslaunched a survey asking whether this was the correct decision, (X1 = 3698 votes)whether the King should have waited longer (X2 = 347) or whether he shouldhave considered other options such as a referendum (X3 = 2446).1

Let θ = (θ1, θ2, θ3) and assume a Dirichlet (1/2, 1/2, 1/2) prior. The posteriordistribution is Dirichlet(3698.5, 347.5, 2446.5).

Consider the difference θ1 − θ3, reflecting (?) the difference between Monarchistsand Republicans. We have E [θ1 − θ3|x] = 0.193.

0.14 0.16 0.18 0.20 0.22 0.24

05

1015

2025

3035

theta1−theta3

f

1Votes at 12:30 on 5th May 2014.

Conchi Ausın and Mike Wiper Conjugate models Advanced Statistics and Data Mining 37 / 40

Page 38: Bayesian Inference Chapter 2: Conjugate models Inference Chapter 2: Conjugate models Conchi Aus n and Mike Wiper Department of Statistics Universidad Carlos III de Madrid Advanced

The Dirichlet process priorSometimes, we do not wish to assume a parametric model for the data generatingdistribution. How can we do this in a Bayesian context?

Assume X |F ∼ F and define a Dirichlet process prior for F .

If the support of X is C , then for any partition, C = C1 ∪ C2 ∪ ... ∪ Ck andk ∈ N, we suppose that

(F (C1),F (C2), ...,F (Ck)) ∼ Dirichlet (aF0(C1), aF0(C2), ..., aF0(Ck))

where a > 0 and F0 is a baseline, prior mean c.d.f.

We write F ∼ Dirichlet process(a,F0).

Given a sample, x1, ..., xn, we have

F |x ∼ Dirichlet process

(a + n,

aF0 + nF

a + n

)

where F is the empirical c.d.f.

The posterior mean is a weighted average of the empirical c.d.f. and the priormean.

Conchi Ausın and Mike Wiper Conjugate models Advanced Statistics and Data Mining 38 / 40

Page 39: Bayesian Inference Chapter 2: Conjugate models Inference Chapter 2: Conjugate models Conchi Aus n and Mike Wiper Department of Statistics Universidad Carlos III de Madrid Advanced

Example

The following plot shows the prior (green), posterior (red), empirical (blue) andtrue (black) c.d.f.s’ when 20 data were generated from a Beta(2,1) distributionand a Dirichlet process prior with a = 5 and F0 a uniform distribution were used.

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

x

F

Conchi Ausın and Mike Wiper Conjugate models Advanced Statistics and Data Mining 39 / 40

Page 40: Bayesian Inference Chapter 2: Conjugate models Inference Chapter 2: Conjugate models Conchi Aus n and Mike Wiper Department of Statistics Universidad Carlos III de Madrid Advanced

Summary and next chapter

In this chapter we have illustrated the basic properties of conjugate models. Whenthese exist, they allow for simple interpretation and straightforward inference.

Unfortunately, conjugate priors do not always exist, for example if data are tor F distributed.

Then we need numerical techniques like Gibbs sampling.

We study these in more detail in the next chapter.

Conchi Ausın and Mike Wiper Conjugate models Advanced Statistics and Data Mining 40 / 40