520
. Bayesian Methods for Data Analysis ENAR Annual Meeting Tampa, Florida – March 26, 2006 ENAR - March 2006 1

Bayesian Methods for Data Analysis - ISU Public Homepage Server

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Bayesian Methods for Data Analysis - ISU Public Homepage Server

.

Bayesian Methods forData Analysis

ENAR Annual Meeting

Tampa, Florida – March 26, 2006

ENAR - March 2006 1

Page 2: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Course contents

• Introduction of Bayesian concepts using single-parameter models.

• Multiple-parameter models and hyerarchical models.

• Computation: approximations to the posterior, rejection andimportance sampling and MCMC.

• Model checking, diagnostics, model fit.

• Linear hierarchical models: random effects and mixed models.

• Generalized linear models: logistic, multinomial, Poisson regression.

• Hierarchical models for spatially correlated data.

ENAR - March 2006 2

Page 3: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Course emphasis

• Notes draw heavily on the book by Gelman et al., Bayesian DataAnalysis 2nd. ed., and many of the figures are ’borrowed’ directly fromthat book.

• We focus on the implementation of Bayesian methods and interpretationof results.

• Little theory, but some is needed to understand methods.

• Lots of examples. Some are not directly drawn from biological problems,but still serve to illustrate methodology.

• Biggest idea to get across: inference by simulation.

ENAR - March 2006 3

Page 4: Bayesian Methods for Data Analysis - ISU Public Homepage Server

• Software: R (or SPlus) early on, and WinBUGS for most of theexamples after we discuss computational methods.

• All of the programs used to construct examples can be downloadedfrom www.public.iastate.edu/ ∼ alicia.

• On my website, go to Teaching and then to ENAR Short Course.

ENAR - March 2006 4

Page 5: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Single parameter models

• There is no real advantage to being a Bayesian in these simple models.We discuss them to introduce important concepts:

– Priors: non-informative, conjugate, other informative– Computations– Summary of posterior, presentation of results

• Binomial, Poisson, Normal models as examples

• Some “real” examples

ENAR - March 2006 5

Page 6: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Binomial model

• Of historical importance: Bayes derived his theorem and Laplaceprovided computations for the Binomial model.

• Before Bayes, question was: given θ, what are the probabilities of thepossible outcomes of y?

• Bayes asked: what is Pr(θ1 < θ < θ2|y)?

• Using a uniform prior for θ, Bayes showed that

Pr(θ1 < θ < θ2|y) ∝∫ θ2

θ1θy(1− θ)n−ydθ

p(y),

with p(y) = 1/(n + 1) uniform a priori.

ENAR - March 2006 6

Page 7: Bayesian Methods for Data Analysis - ISU Public Homepage Server

• Laplace devised an approximation to the integral expression innumerator.

• First application by Laplace: estimate the probability that there weremore female births in Paris from 1745 to 1770.

• Consider n exchangeable Bernoulli trials y1, ..., yn.

• yi = 1 a “success”, yi = 0 a “failure”.

• Exchangeability:summary is # of successes in n trials y.

• For θ the probability of success, Y ∼ B(n, θ):

p(y|θ) =(

n

y

)θy(1− θ)n−y.

ENAR - March 2006 7

Page 8: Bayesian Methods for Data Analysis - ISU Public Homepage Server

• MLE is sample proportion: θ = y/n.

• Prior on θ: uniform on [0,1] (for now)

p(θ) ∝ 1.

• Posterior:p(θ|y) ∝ θy(1− θ)n−y.

• As a function of θ, posterior is proportional to a Beta(y +1, n− y +1)density.

• We could also do the calculations to derive the posterior:

p(θ|y) =

(ny

)θy(1− θ)n−y∫ (

ny

)θy(1− θ)n−ydθ

ENAR - March 2006 8

Page 9: Bayesian Methods for Data Analysis - ISU Public Homepage Server

= (n + 1)n!

y!(n− y)!θy(1− θ)n−y

=(n + 1)!

y!(n− y)!θy(1− θ)n−y

=Γ(n + 2)

Γ(y + 1)Γ(n− y + 1)θy+1−1(1− θ)n−y+1−1

= Beta(y + 1, n− y + 1)

• Point estimation

– Posterior mean E(θ|y) = y+1n+2

– Note posterior mean is compromise between prior mean (1/2) andsample proportion y/n.

– Posterior mode y/n– Posterior median θ∗ such that Pr(θ ≤ θ∗|y) = 0.5.– Best point estimator minimizes the expected loss (more later).

ENAR - March 2006 9

Page 10: Bayesian Methods for Data Analysis - ISU Public Homepage Server

– Posterior variance

V ar(θ|y) =(y + 1)(n− y + 1)(n + 2)2(n + 3)

.

• Interval estimation

– 95% credibe set (or central posterior interval) is (a, b) if:∫ a

0

p(θ|y)dθ = 0.025 and

∫ b

0

p(θ|y)dθ = 0.975

– A 100(1− α)% highest posterior density credible set is subset C ofΘ such that

C = {θ ∈ Θ : p(θ|y) ≤ k(α)}where k(α) is largest constant such that Pr(C|y) ≤ 1− α.

– For symmetric unimodal posteriors, credible sets and highestposterior density credible sets coincide.

ENAR - March 2006 10

Page 11: Bayesian Methods for Data Analysis - ISU Public Homepage Server

– In other cases, HPD sets have smallest size.– Interpretation (for either): probability that θ is in set is equal to

1− α. Which is what we want to say!.

Credible set and HPD set

ENAR - March 2006 11

Page 12: Bayesian Methods for Data Analysis - ISU Public Homepage Server

• Inference by simulation:

– Draw values from posterior: easy for closed form posteriors, can alsodo in other cases (later)

– Monte Carlo estimates of point and interval estimates– Added MC error (due to “sampling”)– Easy to get interval estimates and estimators for functions of

parameters.

• Prediction:

– Prior predictive distribution

p(y) =∫ 1

0

(n

y

)θy(1− θ)n−ydθ =

1n + 1

.

– A priori, all values of y are equally likely

ENAR - March 2006 12

Page 13: Bayesian Methods for Data Analysis - ISU Public Homepage Server

– Posterior predictive distribution, to predict outcome of a new trial ygiven y successes in previous n trials

Pr(y = 1|y) =∫ 1

0

Pr(y = 1|y, θ)p(θ|y)dθ

=∫ 1

0

Pr(y = 1|θ)p(θ|y)dθ

=∫ 1

0

θp(θ|y)dθ = E(θ|y) =y + 1n + 2

ENAR - March 2006 13

Page 14: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Binomial model: different priors

• How do we choose priors?

– In a purely subjective manner (orthodox)– Using actual information (e.g., from literature or scientific

knowledge)– Eliciting from experts– For mathematical convenience– To express ignorance

• Unless we have reliable prior information about θ, we prefer to let thedata ’speak’ for themselves.

• Asymptotic argument: as sample size increases, likelihood shoulddominate posterior

ENAR - March 2006 14

Page 15: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Conjugate prior

• Suppose that we choose a Beta prior for θ:

p(θ|α, β) ∝ θα−1(1− θ)β−1

• Posterior is now

p(θ|y) ∝ θy+α−1(1− θ)n−y+β−1

• Posterior is again proportional to a Beta:

p(θ|y) = Beta(y + α, n− y + β).

• For now, α, β considered fixed and known, but they can also get theirown prior distribution (hierarchical model).

ENAR - March 2006 15

Page 16: Bayesian Methods for Data Analysis - ISU Public Homepage Server

• Beta is the conjugate prior for binomial model: posterior is in thesame form as the prior.

• To choose prior parameters, think as follows: observe α successes inα + β prior “trials”. Prior “guess” for θ is α/(α + β).

ENAR - March 2006 16

Page 17: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Conjugate priors

• Formal definition: F a class of sampling distributions and P a classof prior distributions. Then P is conjugate for F if p(θ) ∈ P andp(y|θ) ∈ F implies p(θ|y) ∈ P .

• If F is exponential family then distributions in F have natural conjugatepriors.

• A distribution in F has form

p(yi|θ) = f(yi)g(θ) exp(φ(θ)Tu(yi)).

• For an iid sequence, likelihood function is

p(y|θ) ∝ g(θ)n exp(φ(θ)T t(y)),

ENAR - March 2006 17

Page 18: Bayesian Methods for Data Analysis - ISU Public Homepage Server

for φ(θ) the natural parameter and t(y) a sufficient statistic.

• Consider prior density

p(θ) ∝ g(θ)η exp(φ(θ)Tν)

• Posterior also in exponential form

p(θ|y) ∝ g(θ)n+η exp(φ(θ)T (t(y) + ν))

ENAR - March 2006 18

Page 19: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Conjugate prior for binomial proportion

• y is # of successes in n exchangeable trials, so sampling distribution isbinomial with prob of success θ:

p(y|θ) ∝ θy(1− θ)n−y

• Note that

p(y|θ) ∝ θy(1− θ)n(1− θ)−y

∝ (1− θ)n exp{y[log θ − log(1− θ)]}

• Written in exponential family form

p(y|θ) ∝ (1− θ)n exp{y logθ

1− θ}

ENAR - March 2006 19

Page 20: Bayesian Methods for Data Analysis - ISU Public Homepage Server

where g(θ)n = (1 − θ)n, y is the sufficient statistic, and the logitlog θ/(1− θ) is the natural parameter.

• Consider prior for θ: Beta(α, β):

p(θ) ∝ θα−1(1− θ)β−1

• If we let ν = α− 1 and η = β + α− 2, can write

p(θ) ∝ θν(1− θ)η−ν

or

p(θ) ∝ (1− θ)η exp{ν logθ

1− θ}

ENAR - March 2006 20

Page 21: Bayesian Methods for Data Analysis - ISU Public Homepage Server

• Then posterior is in same form as prior:

p(θ|y) ∝ (1− θ)n+η exp{(y + ν) logθ

1− θ}

• Since p(y|θ) ∝ θu(1 − θ)n−y then prior Beta(α, β) suggests that apriori we believe in approximately α successes in α + β trials.

• Our prior guess for the probability of success is α/(α + β)

• By varying (α + β) (with α/(α + β) fixed), we can incorporate moreor less information into the prior: “prior sample size”

• Also note:

E(θ|y) =α + y

α + β + n

is always between the prior mean α/(α + β) and the MLE y/n.

ENAR - March 2006 21

Page 22: Bayesian Methods for Data Analysis - ISU Public Homepage Server

• Posterior variance

V ar(θ|y) =(α + y)(β + n− y)

(α + β + n)2(α + β + n + 1)

=E(θ|y)[1− E(θ|y)]

α + β + n + 1

• As y and n− y get large:

– E(θ|y) → y/n– var(θ|y) → (y/n)[1− (y/n)]/n which approaches zero at rate 1/n.– Prior parameters have diminishing influence in the posterior as n →∞.

ENAR - March 2006 22

Page 23: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Placenta previa example

• Condition in which placenta is implanted low in uterus, preventingnormal delivery.

• In study in Germany, found that 437 out of 980 births with placentaprevia were females so observed ratio is 0.446

• Suppose that in general population, proportion of female births is 0.485.

• Can we say that the proportion of female babies among placenta previabirths is lower than in the general population?

• If y = number of female births, θ is probability of a female birth, andp(θ) is uniform, then

p(θ|y) ∝ θy(1− θ)n−1

ENAR - March 2006 23

Page 24: Bayesian Methods for Data Analysis - ISU Public Homepage Server

and therefore

θ|y ∼ Beta(y + 1, n− y + 1).

• Thus a posteriori, θ is Beta(438, 544) and

E(θ|y) =438

438 + 544

V ar(θ|y) =438× 544

(438 + 544)2(438 + 544 + 1)

• Check sensitivity to choice of priors: try several Beta priors withincreasingly more “information” about θ.

• Fix prior mean at 0.485 and increase “prior sample size” α + β.

ENAR - March 2006 24

Page 25: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Prior Post. Post 2.5th Post 97.5thmean α + β median pctile pctile

0.5 2 0.446 0.415 0.4770.485 2 0.446 0.415 0.4770.485 5 0.446 0.415 0.4770.485 10 0.446 0.415 0.4770.485 20 0.447 0.416 0.4780.485 100 0.450 0.420 0.4790.485 200 0.453 0.424 0.481

Results robust to choice of prior, even very informative prior. Prior meannot in posterior 95% credible set.

ENAR - March 2006 25

Page 26: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Conjugate prior for Poisson rate

• y ∈ (0, 1, ...) is counts, with rate θ > 0. Sampling distribution isPoisson

p(yi|θ) =θyi exp(θ)

yi!

• For exchangeable (y1, ..., yn)

p(y|θ) =n∏

i=1

θyi exp(−θ)yi!

=θny exp(−nθ)∏n

i=1 yi!.

• Consider Gamma(α, β) as prior for θ: p(θ) ∝ θα−1 exp(−βθ)

• Then posterior is also Gamma:

p(θ|y) ∝ θny exp(−nθ)θα−1 exp(−βθ) ∝ θny+α−1 exp(−nθ − βθ)

ENAR - March 2006 26

Page 27: Bayesian Methods for Data Analysis - ISU Public Homepage Server

• For η = ny + α and ν = n + β, posterior is Gamma(η, ν)

• Note:

– Prior mean of θ is α/β– Posterior mean of θ is

E(θ|y) =ny + α

n + β

• If sample size n →∞ then E(θ|y) approaches MLE of θ.

• If sample size goes to zero, then E(θ|y) approaches prior mean.

ENAR - March 2006 27

Page 28: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Mean of a normal distribution

• y ∼ N(θ, σ2) with σ2 known

• For {y1, ..., yn} an iid sample, the likelihood is:

p(y|θ) = Πi1√

2πσ2exp(− 1

2σ2(yi − θ)2)

• Viewed as function of θ, likelihood is exponential of a quadratic in θ:

p(y|θ) ∝ exp(− 12σ2

∑i

(θ2 − 2yiθ + y2i ))

• Conjugate prior for θ must belong to family of form

p(θ) = exp(Aθ2 + Bθ + C)

ENAR - March 2006 28

Page 29: Bayesian Methods for Data Analysis - ISU Public Homepage Server

that can be parameterized as

p(θ) ∝ exp(− 12τ2

0

(θ − µ0)2)

and then p(θ) is N(µ0, τ20 )

• (µ0, τ20 ) are hyperparameters. For now, consider known.

• If p(θ) is conjugate, then posterior p(θ|y) must also be normal withparameters (µn, τ2

n).

• Recall that p(θ|y) ∝ p(θ)p(y|θ)

• Then

p(θ|y) ∝ exp(−12[∑

i(yi − θ)2

σ2+

(θ − µ0)2

τ20

])

ENAR - March 2006 29

Page 30: Bayesian Methods for Data Analysis - ISU Public Homepage Server

• Expand squares, collect terms in θ2 and in θ:

p(θ|y) ∝ exp(−12[∑

i y2i − 2

∑i yiθ +

∑i θ

2

σ2+

θ2 − 2µ0θ + µ20

τ20

])

∝ exp(−12[(τ2

0 + σ2)nθ2 − 2(nyτ20 + µ0σ

2)θσ2τ2

0

])

∝ exp(−12((σ2/n) + τ2

0 )(σ2/n)τ2

0

[θ2 − 2(yτ2

0 + µ0(σ2/n)(σ2/n) + τ2

0

)θ])

• Then p(θ|y) is normal with

– Mean: µn = (yτ20 + µ0(σ2/n))/((σ2/n) + τ2

0 )– Variance: τ2

n = ((σ2/n)τ20 )/((σ2/n) + τ2

0 )

Posterior mean

ENAR - March 2006 30

Page 31: Bayesian Methods for Data Analysis - ISU Public Homepage Server

• Note that

µn =yτ2

0 + µ0(σ2/n)(σ2/n) + τ2

0

=nσ2 y + 1

τ20µ0

nσ2 + 1

τ20

(by dividing numerator and denominator into (σ2/n)τ20 )

• Posterior mean is weighted average of prior mean and sample mean.

• Weights are given by precisions n/σ2 and 1/τ20 .

• As data precision increases ((σ2/n) → 0) because σ2 → 0 or becausen →∞, µn → y.

• Also, note that

µn =yτ2

0 + µ0(σ2/n)(σ2/n) + τ2

0

ENAR - March 2006 31

Page 32: Bayesian Methods for Data Analysis - ISU Public Homepage Server

= µ0(σ2/n

(σ2/n) + τ20

) + y(τ20

(σ2/n) + τ20

)

Add and subtract µ0τ20/((σ2/n) + τ2

0 ) to see that

µn = µ0 + (y − µ0)(τ20

(σ2/n) + τ20

)

• Posterior mean is prior mean shrunken towards observed value. Amountof shrinkage depends on relative size of precisions

Posterior variance:

• Recall that

p(θ|y) ∝ exp(−12((σ2/n) + τ2

0 )(σ2/n)τ2

0

[θ2 − 2(yτ2

0 + µ0(σ2/n)(σ2/n) + τ2

0

)θ])

ENAR - March 2006 32

Page 33: Bayesian Methods for Data Analysis - ISU Public Homepage Server

• Then

1τ2n

=((σ2/n) + τ2

0 )(σ2/n)τ2

0

=n

σ2+

1τ20

• Posterior precision = sum of prior and data precisions.

Posterior predictive distribution

• Want to predict next observation y.

• Recall p(y|y) ∝∫

p(y|θ)p(θ|y)dθ

• We know that

ENAR - March 2006 33

Page 34: Bayesian Methods for Data Analysis - ISU Public Homepage Server

– Given θ, y ∼ N(θ, σ2)– θ|y ∼ N(µn, τ2

n)

• Then

p(y|y) ∝∫

exp{−12(y − θ)2

σ2} exp{−1

2(θ − µn)2

τ2n

}dθ

∝∫

exp{−12[(y − θ)2

σ2+

(θ − µn)2

τ2n

]}dθ

• Integrand is kernel of bivariate normal, so (y, θ) have bivariate normaljoint posterior.

• Marginal p(y|y) must be normal.

ENAR - March 2006 34

Page 35: Bayesian Methods for Data Analysis - ISU Public Homepage Server

• Need E(y|y) and var(y|y):

E(y|y) = E[E(y|y, θ)|y] = E(θ|y) = µn,

because E(y|y, θ) = E(y|θ) = θ

var(y|y) = E[var(y|y, θ)|y] + var[E(y|y, θ)|y]

= E(σ2|y) + var(θ|y)

= σ2 + τ2n

because var(y|y, θ) = var(y|θ) = σ2

• Variance includes additional term τn as a penalty for us not knowingthe true value of θ.

• Recall µn =1

τ20

µ0+nσ2 y

1τ20

+ nσ2

ENAR - March 2006 35

Page 36: Bayesian Methods for Data Analysis - ISU Public Homepage Server

• In µn, prior precision 1/τ20 and data precision n/σ2 are “equivalent”.

Then:

– For n large, (y, σ2) determine posterior– For τ2

0 = σ2/n, prior has the same weight as adding one moreobservation with value µ0.

– When τ20 →∞ with n fixed, or when n →∞ with τ2

0 fixed:

p(θ|y) → N(y, σ2/n)

– Good approximation in practice when prior beliefs about θ are vagueor when sample size is large.

ENAR - March 2006 36

Page 37: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Normal variance

• Example of a scale model

• Assume that y ∼ N(θ, σ2), θ known.

• For iid y1, ..., yn:

p(y|σ2) ∝ (σ2)−n/2 exp(− 12σ2

n∑i=1

(yi − θ)2)

p(y|σ2) ∝ (σ2)−n/2 exp(− nv

2σ2)

for suffient statistic

v =1n

n∑i=1

(yi − θ)2

ENAR - March 2006 37

Page 38: Bayesian Methods for Data Analysis - ISU Public Homepage Server

• Likelihood is in exponential family form with natural parameter φ(σ2) =σ−2. Then, natural conjugate prior must be of form

p(σ2) ∝ (σ2)−η exp(−βφ(σ2))

• Consider an inverse Gamma prior:

p(σ2) ∝ (σ2)−(α+1) exp(−β/σ2)

• For ease of interpretation, reparameterize as a scaled-inverted χ2

distribution, with ν0 d.f. and σ20 scale.

– Scale σ20 corresponds to prior “guess” for σ2.

– Large ν0: lots of confidence in σ20 as a good value for σ2.

ENAR - March 2006 38

Page 39: Bayesian Methods for Data Analysis - ISU Public Homepage Server

• If σ2 ∼ inv − χ2(ν0, σ20) then

p(σ2) ∝ (σ2

σ20

)−(ν02 +1) exp(−ν0σ

20

2σ2)

• Corresponds to Inv −Gamma(ν02 ,

ν0σ20

2 ).

• Prior mean is σ20ν0/(ν0 − 2)

• Prior variance behaves like σ40/ν0: large ν0, large prior precision.

• Posterior

p(σ2|y) ∝ p(σ2)p(y|σ2)

∝ (σ2)−n/2 exp(− nv

2σ2)(

σ2

σ20

)−(ν02 +1) exp(−ν0σ

20

2σ2)

ENAR - March 2006 39

Page 40: Bayesian Methods for Data Analysis - ISU Public Homepage Server

∝ (σ2)−(ν12 +1) exp(− 1

2σ2ν1σ

21)

with ν1 = ν0 + n and

σ21 =

nv + ν0σ20

n + ν0

• p(σ2|y) is also a scaled inverted χ2

• Posterior scale is weighted average of prior “guess” and data estimate

• Weights given by prior and sample degrees of freedom

• Prior provides information equivalent to ν0 observations with averagesquared deviation equal to σ2

0.

• As n →∞, σ21 → v.

ENAR - March 2006 40

Page 41: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Poisson model

• Appropriate for count data such as number of cases, number ofaccidents, etc. For a vector of n iid observations:

p(y|θ) = Πiθyie−θ

yi!= p(y|θ) =

θt(y)e−nθ∏ni=1 y!

,

where θ is the rate, y = 0, 1, ... and t(y) =∑n

i=1 yi the sufficientstatistic for θ.

• We can write the model in the exponential family form:

p(y|θ) ∝ e−nθet(y) log θ

where φ(θ) = log θ is natural parameter.

ENAR - March 2006 41

Page 42: Bayesian Methods for Data Analysis - ISU Public Homepage Server

• Natural conjugate prior must have form

p(θ) ∝ e−ηθeν log θ

or p(θ) ∝ e−βθθα−1 that looks like a Gamma(α, β).

• Posterior is Gamma(α + ny, β + n)

ENAR - March 2006 42

Page 43: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Poisson model - Digression

• In the conjugate case, can often derive p(y) using

p(y) =p(θ)p(y|θ)

p(θ|y)

• In case of Gamma-Poisson model for a single observation:

p(y) =Poisson(θ) Gamma(α, β)Gamma(α + y, β + 1)

=Γ(α + y)βα

Γ(α)y!(1 + β)(α+y)

= (α + y − 1, y)(β

β + 1)α(

1β + 1

)y

the density of a negative binomial with parameters (α, β).

ENAR - March 2006 43

Page 44: Bayesian Methods for Data Analysis - ISU Public Homepage Server

• Thus we can interpret negative binomial distributions as arising from amixture of Poissons with rate θ, where θ follows a Gamma distribution:

p(y) =∫

Poisson (y|θ) Gamma (θ|α, β)dθ

• Negative binomial is robust alternative to Poisson.

ENAR - March 2006 44

Page 45: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Poisson model (cont’d)

• Given rate, observations assumed to be exchangeable.

• Add an exposure to model: observations are exchangeable within smallexposure intervals.

• Examples:

– In a city of size 500,000, define rate of death from cancer per millionpeople, then exposure is 0.5

– Intersection in Ames with traffic of one million vehicles per year,consider number of traffic accidents per million vehicles per year.Exposure for intersection is 1.

– Exposure is typically known and reflects the fact that in each unit,the number of persons or cars or plants or animals that are ’at risk’is different.

ENAR - March 2006 45

Page 46: Bayesian Methods for Data Analysis - ISU Public Homepage Server

• Nowp(y|θ) ∝ θ

Pyi exp(−θ

∑xi)

for xi the exposure of unit i.

• With prior p(θ) = Gamma (α, β), posterior is

p(θ|y) = Gamma (α +∑

i

yi, β +∑

i

xi)

ENAR - March 2006 46

Page 47: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Non-informative prior distns

• A non-informative prior distribution

– Has minimal impact on posterior– Lets the data “speak for themselves”– Also called vague, flat, diffuse

• So far, we have concentrated on conjugate family.

• Conjugate priors can also be almost non-informative.

• If y ∼ N(θ, 1), natural conjugate for θ is N(µ0, τ20 ), and posterior is

N(µ1, τ21 ), where

µ1 =µ0/τ2

0 + ny/σ2

1/τ20 + n/σ2

, τ21 =

11/τ2

0 + n/σ2

ENAR - March 2006 47

Page 48: Bayesian Methods for Data Analysis - ISU Public Homepage Server

• For τ0 →∞:

– µ1 → y– τ2

1 → σ2/n

• Same result could have been obtained using p(θ) ∝ 1.

• The uniform prior is a natural non-informative prior for locationparameters (see later).

• It is improper: ∫p(θ)dθ =

∫dθ = ∞

yet leads to proper posterior for θ. This is not always the case.

• Poisson example: y1, ..., yn iid Poisson(θ), consider p(θ) ∝ θ−1/2.

–∫∞0

θ−1/2dθ = ∞ improper

ENAR - March 2006 48

Page 49: Bayesian Methods for Data Analysis - ISU Public Homepage Server

– Yet

p(θ|y) ∝ θP

i yie−nθθ−1/2

= θP

i yi−1/2e−nθ

= θP

i yi+1/2−1e−nθ

proportional to a Gamma(12 +

∑i yi, n), proper

• Uniform priors arise when we assign equal density to each θ ∈ Θ: nopreferences for one value over the other.

• But they are not invariant under one-to-one transformations.

• Example:

– η = exp{θ} so that θ = log{η} is inverse transformation.

ENAR - March 2006 49

Page 50: Bayesian Methods for Data Analysis - ISU Public Homepage Server

– dθdη = 1

η is Jacobian

– Then, if p(θ) is prior for θ, p∗(η) = η−1p(log η) is correspondingprior for transformation.

– For p(θ) ∝ c, p∗(η) ∝ η−1, informative.– Informative prior is needed to arrive at same answer in both

parameterizations.

ENAR - March 2006 50

Page 51: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Jeffreys prior

• In 1961, Jeffreys proposed a method for finding non-informative priorsthat are invariant to one-to-one transformations

• Jeffreys proposedp(θ) ∝ [I(θ)]1/2

where I(θ) is the expected Fisher information:

I(θ) = −Eθ[d2

dθ2log p(y|θ)]

• If θ is a vector, then I(θ) is a matrix with (i, j) element equal to

−Eθ[d2

dθidθjlog p(y|θ)]

ENAR - March 2006 51

Page 52: Bayesian Methods for Data Analysis - ISU Public Homepage Server

andp(θ) ∝ |I(θ)|1/2

• Theorem: Jeffreys prior is locally uniform and therefore non-informative.

ENAR - March 2006 52

Page 53: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Jeffreys prior for binomial proportion

• If y ∼ B(n, θ) then log p(y|θ) ∝ y log θ + (n− y) log(1− θ) and

d2

dθ2log p(y|θ) = −yθ−2 − (n− y)(1− θ)−2

• Taking expectations:

I(θ) = −Eθ[d2

dθ2log p(y|θ)] = E(y)θ−2 + (n− E(y))(1− θ)−2

= nθθ−2(n− nθ)(1− θ)−2 =n

θ(1− θ).

• Then

p(θ) ∝ [I(θ)]1/2 ∝ θ−1/2(1− θ)−1/2 ∝ Beta(12,12).

ENAR - March 2006 53

Page 54: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Jeffreys prior for normal mean

• y1, ..., yn iid N(θ, σ2), σ2 known.

p(y|θ) ∝ exp(− 12σ2

∑(y − θ)2)

log p(y|θ) ∝ − 12σ2

∑(y − θ)2

d2

dθ2log p(y|θ) = − n

σ2

constant with respect to θ.

• Then I(θ) constant and p(θ) ∝ constant.

ENAR - March 2006 54

Page 55: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Jeffreys prior for normal variance

• y1, ..., yn iid N(θ, σ2), θ known.

p(y|σ) ∝ σ−n exp(− 12σ2

∑(y − θ)2)

log p(y|σ) ∝ −n log σ − 12σ2

∑(y − θ)2

d2

dσ2log p(y|σ) =

n

σ2− 3

2σ4

∑(yi − θ)2

• Take negative of expectation so that:

I(σ) = − n

σ2+

32σ4

nσ2 =n

2σ2

and therefore the Jeffreys prior is p(σ) ∝ σ−1.

ENAR - March 2006 55

Page 56: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Invariance property of Jeffreys’

• The Jeffreys’ prior is invariant to one-to-one transformations φ = h(θ)

• Then:

p(φ) = p(θ)|dθ

dφ| = p(θ)|dh(θ)

dθ|−1

• Under invariance, choose p(θ) such that p(φ) constructed as abovewould match what would be obtained directly.

• Consider p(θ) = [I(θ)]1/2 and evaluate I(φ) at θ = h−1(φ):

I(φ) = −E[d2

dφ2log p(y|φ)]

= −E[d2

dφ2log p(y|θ = h−1(φ))[

dφ]2]

ENAR - March 2006 56

Page 57: Bayesian Methods for Data Analysis - ISU Public Homepage Server

= I(θ)[dθ

dφ]2

• Then [I(φ)]1/2 = [I(θ)]1/2dθdφ as required.

ENAR - March 2006 57

Page 58: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Multiparameter models - Intro

• Most realistic problems require models with more than one parameter

• Typically, we are interested in one or a few of those parameters

• Classical approach for estimation in multiparameter models:

1. Maximize a joint likelihood: can get nasty when there are manyparameters

2. Proceed in steps

• Bayesian approach: base inference on the marginal posteriordistributions of the parameters of interest.

• Parameters that are not of interest are called nuisance parameters.

ENAR - March 2006 1

Page 59: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Nuisance parameters

• Consider a model with two parameters (θ1, θ2) (e.g., a normaldistribution with unknown mean and variance)

• We are interested in θ1 so θ2 is a nuisance parameter

• The marginal posterior distribution of interest is p(θ1|y)

• Can be obtained directly from the joint posterior density

p(θ1, θ2|y) ∝ p(θ1, θ2)p(y|θ1, θ2)

by integrating with respect to θ2:

p(θ1|y) =∫

p(θ1, θ2|y)dθ2

ENAR - March 2006 2

Page 60: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Nuisance parameters (cont’d)

• Note too that

p(θ1|y) =∫

p(θ1, |θ2, y)p(θ2|y)dθ2

• The marginal of θ1 is a mixture of conditionals on θ2, or a weightedaverage of the conditional evaluated at different values of θ2. Weightsare given by marginal p(θ2|y)

ENAR - March 2006 3

Page 61: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Nuisance parameters (cont’d)

• Important difference with frequentists!

• By averaging conditional p(θ1, |θ2, y) over possible values of θ2, weexplicitly recognize our uncertainty about θ2.

• Two extreme cases:

1. Almost certainty about the value of θ2: If prior and sample are veryinformative about θ2, marginal p(θ2|y) will be concentrated aroundsome value θ2. In that case,

p(θ1|y) ≈ p(θ1|θ2, y)

2. Lots of uncertainty about θ2: Marginal p(θ2|y) will assign relativelyhigh probability to wide range of values of θ2. Point estimate θ2 nolonger “reliable”. Important to average over range of values of θ2.

ENAR - March 2006 4

Page 62: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Nuisance parameters (cont’d)

• In most cases, integral not computed explicitly

• Instead, use a two-step simulation approach

1. Marginal simulation step: Draw value θ(k)2 of θ2 from p(θ2|y) for

k = 1, 2, ...

2. Conditional simulation step: For each θ(k)2 , draw a value of θ1 from

the conditional density p(θ1|θ(k)2 , y)

• Effective approach when marginal and conditional are of standard form.

• More sophisticated simulation approaches later.

ENAR - March 2006 5

Page 63: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Example: Normal model

• yi iid from N(µ, σ2), both unknown

• Non-informative prior for (µ, σ2) assuming prior independence:

p(µ, σ2) ∝ 1× σ−2

• Joint posterior:

p(µ, σ2|y) ∝ p(µ, σ2)p(y|µ, σ2)

∝ σ−n−2 exp(− 12σ2

n∑i=1

(yi − µ)2)

ENAR - March 2006 6

Page 64: Bayesian Methods for Data Analysis - ISU Public Homepage Server

• Note that

n∑i=1

(yi − µ)2 =∑

i

(y2i − 2µyi + µ2)

=∑

i

y2i − 2µny + nµ2

=∑

i

(yi − y)2 + n(y − µ)2

by adding and subtracting 2ny2.

ENAR - March 2006 7

Page 65: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Example: Normal model (cont’d)

• Let

s2 =1

n− 1

∑i

(yi − y)2

• Then can write posterior for (µ, σ2) as

p(µ, σ2|y) ∝ σ−n−2 exp(− 12σ2

[(n− 1)s2 + n(y − µ)2])

• Sufficient statistics are (y, s2)

ENAR - March 2006 8

Page 66: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Example: Conditional posterior p(µ|σ2, y)

• Conditional on σ2:

p(µ|σ2, y) = N(y, σ2/n)

• We know this from earlier chapter (posterior of normal mean whenvariance is known)

• We can also see this by noting that, viewed as a function of µ only:

p(µ|σ2, y) ∝ exp(− n

2σ2(y − µ)2)

that we recognize as the kernel of a N(y, σ2/n)

ENAR - March 2006 9

Page 67: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Example: Marginal posterior p(σ2|y)

• To get p(σ2|y) we need to integrate p(µ, σ2|y) over µ:

p(σ2|y) ∝∫

σ−n−2 exp(− 12σ2

[(n− 1)s2 + n(y − µ)2])dµ

∝ σ−n−2 exp(−(n− 1)s2

2σ2)∫

exp(− n

2σ2(y − µ)2)dµ

∝ σ−n−2 exp(−(n− 1)s2

2σ2)√

2πσ2/n

• Then

p(σ2|y) ∝ (σ2)−(n+1)/2 exp(−(n− 1)s2

2σ2)

which is proportional to a scaled-inverse χ2 distribution with degreesof freedom (n− 1) and scale s2.

ENAR - March 2006 10

Page 68: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Recall classical result: conditional on σ2, the distribution of the scaledsufficient statistic (n− 1)s2/σ2 is χ2

n−1.

ENAR - March 2006 11

Page 69: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Normal model: analytical derivation

• For the normal model, we can derive the marginal p(µ|y) analytically:

p(µ|y) =∫

p(µ, σ2|y)dσ2

∝∫

(1

2σ2)n/2+1 exp(− 1

2σ2[(n− 1)s2 + n(y − µ)2])dσ2

• Use the transformation

z =A

2σ2

where A = (n− 1)s2 + n(y − µ)2. Then

dσ2

dz= − A

2z2

ENAR - March 2006 12

Page 70: Bayesian Methods for Data Analysis - ISU Public Homepage Server

and

p(µ|y) ∝∫ ∞

0

(z

A)

n2+1 A

z2exp(−z)dz

∝ A−n/2

∫z

n2−1 exp(−z)dz

ENAR - March 2006 13

Page 71: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Normal model: analytical derivation

p(µ|y) ∝ A−n/2

∫z

n2−1 exp (−z)dz

• Integrand is unnormalized Gamma(n/2, 1), so integral is constant w.r.t.µ

• Recall that A = (n− 1)s2 + n(y − µ)2. Then

p(µ|y) ∝ A−n/2

∝ [(n− 1)s2 + n(y − µ)2]−n/2

∝ [1 +n(µ− y)2

(n− 1)s2]−n/2

the kernel of a t−distribution with n− 1 degrees of freedom, centeredat y and with scale parameter s2/n

ENAR - March 2006 14

Page 72: Bayesian Methods for Data Analysis - ISU Public Homepage Server

• For the non-informative prior p(µ, σ2) ∝ σ−2, the posterior distributionof µ is a non-standard t. Then,

p(µ− y

s/√

n|y) = tn−1

the standard t distribution.

ENAR - March 2006 15

Page 73: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Normal model: analytical derivation

• We saw that

p(µ− y

s/√

n|y) = tn−1

• Notice similarity to classical result: for iid normal observations fromN(µ, σ2), given (µ, σ2), the pivotal quantity

y − µ

s/√

n|µ, σ2 ∼ tn−1

• A pivot is a non-trivial function of the data and the parameter(s) θwhose distribution, given θ, is independent of θ. Property deducedfrom sampling distribution as above.

• Baby example of pivot property: y ∼ N(θ, 1). Pivot is x = y − θ.Given θ, x ∼ N(0, 1), independent of θ.

ENAR - March 2006 16

Page 74: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Posterior predictive for future obs

• Posterior predictive distribution for future observation y is a mixture:

p(y|y) =∫ ∫

p(y|y, σ2, µ)p(µ, σ2|y)dµdσ2

• First factor in integrand is just normal model, and it does not dependon y at all.

• To simulate y from posterior predictive distributions, do the following:

1. Draw σ2 from Inv-χ2(n− 1, s2)2. Draw µ from N(y, σ2/n)3. Draw y from N(µ, σ2)

ENAR - March 2006 17

Page 75: Bayesian Methods for Data Analysis - ISU Public Homepage Server

• Can derive the posterior predictive distribution in analytic form. Notethat

p(y|σ2, y) =∫

p(y|σ2, µ)p(µ|σ2, y)dµ

∝∫

exp(− 12σ2

(y − µ)2) exp(− n

2σ2(µ− y)2)dµ

• After some algebra:

p(y|σ2, y) = N(y, (1 +1n)σ2)

• Using same approach as in deriving posterior distribution of µ, we findthat

p(y|y) ∝ tn−1(y, (1 +1n)1/2s)

ENAR - March 2006 18

Page 76: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Normal data and conjugate prior

• Recall that using a non-informative prior, we found that

p(µ|σ2, y) ∝ N(y, σ2/n)

p(σ2|y) ∝ Inv − χ2(n− 1, s2)

• Then, factoring p(µ, σ2) = p(µ|σ2)p(σ2) the conjugate prior for σ2

would also be scaled inverse χ2 and for µ (conditional on σ2) would benormal. Consider

µ|σ2 ∼ N(µ0, σ2/κ0)

σ2 ∼ Inv-χ2(ν0, σ20)

• Jointly:

p(µ, σ2) ∝ σ−1(σ2)−(ν0/2+1) exp(− 12σ2

[ν0σ20 + κ0(µ0 − µ)2])

ENAR - March 2006 19

Page 77: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Normal data and conjugate prior

• Note that µ and σ2 are not independent a priori.

• Posterior density for (µ, σ2):

– Multiply likelihood by N-Inv-χ2(µ0, σ2/κ0; ν0, σ

20) prior

– Expand the two squares in µ– Complete the square by adding and subtracting term depending on

y and µ0

Then p(µ, σ2|y) ∝ N-Inv-χ2(µn, σ2n/κn; νn, σ2

n) where

µn =κ0

κ0 + nµ0 +

n

κ0 + ny

κn = κ0 + n, νn = ν0 + n

νnσ2n = ν0σ

20 + (n− 1)s2 +

κ0n

κ0 + n(y − µ)2

ENAR - March 2006 20

Page 78: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Normal data and conjugate prior

• Interpretation of posterior parameters:

– µn a weighted average as earlier– νnσ2

n: sum of the sample sum of squares, prior sum of squares, andadditional uncertainty due to difference between sample mean andprior mean.

ENAR - March 2006 21

Page 79: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Normal data and conjugate prior

• Conditional posterior of µ: As before µ|σ2, y ∼ N(µn, σ2/κn).

• Marginal posterior of σ2: As before σ2|y ∼ Inv-χ2(νn, σ2n).

• Marginal posterior of µ: As before µ|y ∼ tνn(µ|µn, σ2n/κn).

• Two ways to sample from joint posterior distribution:

1. Sample µ from t and σ2 from Inv-χ2

2. Sample σ2 from Inv-χ2 and given σ2, sample µ from N

ENAR - March 2006 22

Page 80: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Semi-conjugate prior for normal model

• Consider setting independent priors for µ and σ2:

µ ∼ N(µ0, τ20 )

σ2 ∼ Inv-χ2(ν0, σ20)

• Example: mean weight of students, visual inspection shows weightsbetween 100 and 200 pounds.

• p(µ, σ2) = p(µ)p(σ2) is not conjugate and does not lead to posteriorof known form

• Can factor as earlier:

µ|σ2, y ∼ N(µn, τ2n)

ENAR - March 2006 23

Page 81: Bayesian Methods for Data Analysis - ISU Public Homepage Server

with

µn =1τ20µ0 + n

σ2 y

1τ22

+ nσ2

, τ2n =

11τ20

+ nσ2

• NOTE: Even though µ and σ2 are independent a priori, they are notindependent in the posterior.

ENAR - March 2006 24

Page 82: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Semi-conjugate prior and p(σ2|y)

• The marginal posterior p(σ2|y) can be obtained by integrating the jointp(µ, σ2|y) w.r.t. µ:

p(σ2|y) ∝∫

N(µ|µ0, τ20 ) Invχ2(σ2|ν0, σ

20)Π N(yi|µ, σ2)dµ

• Integration can be performed by noting that integrand as function of µis proportional to normal density.

• Keeping track of normalizing constants that depend on σ2 is messy.Easier to note that:

p(σ2|y) =p(µ, σ2|y)p(µ|σ2, y)

ENAR - March 2006 25

Page 83: Bayesian Methods for Data Analysis - ISU Public Homepage Server

so that

p(σ2|y) ∝ N(µ|µ0, τ20 ) Inv− χ2(σ2|ν0, σ

20)Π N(yi|µ, σ2)

N(µ|µn, τ2n)

which is still a mess.

ENAR - March 2006 26

Page 84: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Semi-conjugate prior and p(σ2|y)

• From earlier page:

p(σ2|y) ∝ N(µ|µ0, τ20 ) Inv− χ2(σ2|ν0, σ

20)Π N(yi|µ, σ2)

N(µ|µn, τ2n)

• Factors that depend on µ must cancel, and therefore we know thatp(σ2|y) does not depend on µ in the sense that we can evaluate p(σ2|y)for a grid of values of σ2 and any arbitrary value of µ.

• Choose µ = µn and then denominator simplifies to somethingproportional to τ−1

n . Then

p(σ2|y) ∝ τn N(µ|µ0, τ20 ) Inv− χ2(σ2|ν0, σ

20)Π N(yi|µ, σ2)

that can be evaluated for a grid of values of σ2.

ENAR - March 2006 27

Page 85: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Implementing inverse CDF method

• We often need to draw values from distributions that do not have a“nice” standard form. Example is expression 3.14 in text, for p(σ2|y)in the normal model with semi-conjugate priors for µ and σ2.

• One approach is the inverse cdf method (see earlier notes), implementednumerically.

• We assume that we can evaluate the pdf (even in unnormalized form)for a grid of values of θ in the appropriate range

ENAR - March 2006 28

Page 86: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Implementing inverse CDF method

• To generate 1000 draws from a distribution p(θ|y) with parameter θ,do

1. Evaluate p(θ|y) for a grid of m values of θ. Let the evaluations bedenoted by (p1, p2, ..., pm)

2. Compute the CDF over the grid: (p1, p1 + p2, ....,∑m−1

i=1 pi, 1) anddenote those (f1, f2, ..., fm)

3. Generate M uniform random variables u in [0, 1].4. If u ∈ [fi−1, fi], draw θi.

ENAR - March 2006 29

Page 87: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Inverse CDF method - Example

θ Prob(θ) CDF (F)1 0.03 0.032 0.04 0.073 0.08 0.154 0.15 0.305 0.20 0.506 0.30 0.807 0.10 0.908 0.05 0.959 0.03 0.98

10 0.02 1.00

• Consider a parameter θ with values in (1, 10) with probabilitydistribution and cumulative distribution functions as above. Underdistribution, values of θ between 4 and 7 are more likely.

ENAR - March 2006 30

Page 88: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Inverse CDF method - Example

• To implement method do:

1. Draw u ∼ U(0, 1). For example, in first draw u = 0.452. For u ∈ (fi−1, fi), draw θi.3. For u = 0.45 ∈ (0.3, 0.5), draw θ = 5.4. Alternative approach:(a) Flip another coin v ∼ U(0, 1).(b) Pick θi−1 if v ≤ 0.5 and pick θi if v > 0.5

5. In example, for u = 0.45, would either choose θ = 4 or would chooseθ = 4 with probability 1/2.

6. Repeat many times M .

• If M very large, we expect that about 50% of our draws will beθ = 4, 5, or 6, about 2% will be 10, etc.

ENAR - March 2006 31

Page 89: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Example: Football point spreads

• Data di, i = 1, ..., 672 are differences between predicted outcome offootball game and actual score.

• Normal model:di ∼ N (µ, σ2)

• Priors:µ ∼ N (0, 22)p(σ2) ∝ σ−2

• To draw values of (µ, σ2) from posterior, do

– First draw σ2 from p(σ2|y) using inverse cdf method– Then draw µ from p(µ|σ2, y), normal.

• To draw σ2, evaluate p(σ2|y) on the grid [150, 250].

ENAR - March 2006 32

Page 90: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Example: Rounded measurements (prob. 3.5)

• Sometimes, measurements are rounded, and we do not observe “true”values.

• yi are observed rounded values, and zi are unobserved truemeasurements.

• If zi ∼ N (µ, σ2), then

y|µ, σ2 ∼ ΠiΦ(

yi + 0.5− µ

σ

)− Φ

(yi − 0.5− µ

σ

)

• Prior: p(µ, σ2) ∝ σ−2

• We are interested in posterior inference about (µ, σ2) and in differencesbetween rounded and exact analysis.

ENAR - March 2006 33

Page 91: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Joint posterior contoursexact normal

mu

sigm

a2

8 9 10 11 12 130

12

34

rounded

mu

sigm

a2

8 9 10 11 12 13

01

23

4

KSU - April 2005 1

Page 92: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Multinomial model - Intro

• Generalization of binomial model, for the case where observations canhave more than two possible values.

• Sampling distribution: multinomial with parameters (θ1, ..., θk), theprobabilities associated to each of the k possible outcomes.

• Example: In a survey, respondents may: Strongly Agree, Agree,Disagree, Strongly Disagree, or have No Opinion when presented witha statement such as “Instructor for Stat 544 is spectacular”.

ENAR - March 2006 34

Page 93: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Multinomial model - Sampling dist’n

• Formally:

– y = k × 1 vector of counts of #s of observations in each outcome– θj: probability of jth outcome

–∑k

j=1 θj = 1 and∑k

j=1 yj = n

• Sampling distribution:

p(y|θ) ∝ Πkj=1θ

yj

j

ENAR - March 2006 35

Page 94: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Multinomial model - Prior

• Conjugate prior for (θ1, ..., θk) is Dirichlet distribution, a multivariategeneralization of the Beta:

p(θ|α) ∝ Πkj=1θ

αj−1

j

with

αj > 0∀j, and α0 =k∑

j=1

αj

θj > 0∀j, andk∑

j=1

θj = 1

ENAR - March 2006 36

Page 95: Bayesian Methods for Data Analysis - ISU Public Homepage Server

• The αj can be thought of as “prior counts” associated to jth outcome,so that α0 would then be a “prior sample size”.

• For the Dirichlet:

E(θj) =αj

α0, V ar(θj) =

αj(α0 − αj)α2

0(α0 + 1)

Cov(θi, θj) = − αiαj

α20(α0 + 1)

ENAR - March 2006 37

Page 96: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Dirichlet distribution

• The Dirichlet distribution is the conjugate prior for the parameters ofthe multinomial model.

• If (θ1, θ2, ...θK) ∼ D(α1, α2, ..., αK) then

p(θ1, ...θk) =Γ(α0)

ΠjΓ(αj)Πjθ

αj−1

j ,

where θj ≥ 0, Σjθj = 1, αj ≥ 0 and α0 = Σjαj.

• Some properties:

E(θj) =αj

α0

ENAR - March 2006 38

Page 97: Bayesian Methods for Data Analysis - ISU Public Homepage Server

V ar(θj) =αj(α0 − αj)α2

0(α0 + 1)

Cov(θi, θj) = − αiαj

α20(α0 + 1)

Note that the θj are negatively correlated.

• Because of the sum to one restriction, the pdf of the K-dimensionalrandom vector can be written in terms of K − 1 random variables.

• The marginal distributions can be shown to be Beta.

• Proof for K = 3:

p(θ1, θ2) =Γ(α1, α2, α3)

ΠjΓ(α0)θα1−11 θα2−1

2 (1− θ1 − θ2)α3−1

ENAR - March 2006 39

Page 98: Bayesian Methods for Data Analysis - ISU Public Homepage Server

• To get marginal for θ1, integrate p(θ1, θ2) with respect to θ2 withlimits of integration 0 and 1− θ1. Call normalizing constant Q and usechange of variable:

v =θ2

1− θ1

so that θ2 = v(1− θ1) and dθ2 = (1− θ1)dv.

• The marginal is then:

p(θ1) = Q

∫ 1

0

θα1−11 (v(1− θ1))α2−1

(1− θ1 − v(1− θ1))α3−1(1− θ1)dv

= Qθα1−11 (1− θ1)α2+α3−1

∫ 1

0

vα2−1(1− v)α3−1dv

= Qθα1−11 (1− θ1)α2+α3−1Γ(α2)Γ(α3)

Γ(α2 + α3)

ENAR - March 2006 40

Page 99: Bayesian Methods for Data Analysis - ISU Public Homepage Server

• Then,θ1 ∼ Beta(α1, α2 + α3).

• This generalizes to what is called the clumping property of the Dirichlet.In general, if (θ1, ..., θk) ∼ D(α1, ..., αK):

p(θi) = Beta(α1, α0 − αi)

ENAR - March 2006 41

Page 100: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Multinomial model - Posterior

• Posterior must have Dirichlet form also:

p(θ|y) ∝ Πkj=1θ

αj−1

j θyj

j

∝ Πkj=1θ

yj+αj−1

j

• αn =∑

j(αj + yj) = α0 + n is “total” number of “observations”.

• Posterior mean (a point estimate) of θj is:

E(θj|y) =αj + yj

α0 + n

=“#” of obs. of jth outcome

“total” # of obs

ENAR - March 2006 42

Page 101: Bayesian Methods for Data Analysis - ISU Public Homepage Server

• For αj = 1 ∀j: uniform non-informative prior on all vectors of θj suchthat

∑j θj = 1

• For αj = 0 ∀j: the uniform prior is on the log(θj) with samerestriction.

• In either case, posterior is proper if yj ≥ 1 ∀j.

ENAR - March 2006 43

Page 102: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Multinomial model - samples from p(θ|y)

• Gamma method: Two steps for each θj:

1. Draw x1, x2, ..., xk from independentGamma(δ, (αj + yj)) for any common δ

2. Set θj = xj/∑k

i=1 xi

• Beta method: Relies on properties of Dirichlet:

– Marginal p(θj|y) = Beta(αj + yj, αn − (αj + yj))– Conditional p(θj|θ−j, y) = Dirichlet

ENAR - March 2006 44

Page 103: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Multinomial model - Example

• Pre-election polling in 1988:

– n = 1,447 adults in the US– y1 = 727 supported G. Bush (the elder)– y2 = 583 supported M. Dukakis– y3 = 137 supported other or had no opinion

• If no other information available, can assume that observations areexchangeable given θ. (However, if information on party affiliation isavailable, unreasonable to assume exchangeability)

• Polling is done under complex survey design. Ignore for now andassume simple random sampling. Then, (y1, y2, y3) ∼ Mult(θ1, θ2, θ3)

• Of interest: θ1 − θ2, the difference in the population in support of thetwo major candidates.

ENAR - March 2006 45

Page 104: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Multinomial model - Example

• Non-informative prior, for now: set α1 = α2 = α3 = 1

• Posterior distribution is Dirichlet(728, 584, 138). Then:

E(θ1|y) = 0.502, E(θ2|y) = 0.403

• Other quantities obtain by simulation (see next)

• To derive p(θ1 − θ2|y) do:

1. Draw m values of (θ1, θ2, θ3) from posterior2. For each draw, compute θ1 − θ2

• Results and program: see quantiles of posterior distributions of θ1 andθ2, credible set for θ1 − θ2 and Prob(θ1 > θ2|y).

ENAR - March 2006 46

Page 105: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Proportion voting for Bush (the elder)

theta_1

0.46 0.48 0.50 0.52 0.54

KSU - April 2005 1

Page 106: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Difference between two candidates

theta_1 - theta_2

0.0 0.05 0.10 0.15 0.20

KSU - April 2005 3

Page 107: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Proportion voting for Dukakis

theta_2

0.36 0.38 0.40 0.42 0.44

KSU - April 2005 2

Page 108: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Example: Bioassay experiment

• Does mortality in lab animals increase with increased dose of somedrug?

• Experiment: 20 animals randomly allocated to four doses(“treatments”), and number of dead animals within each dose recorded.

Dose xi ni No. of deaths yi

-0.863 5 0-0.296 5 1-0.053 5 30.727 5 5

• Animals exchangeable within dose.

ENAR - March 2006 47

Page 109: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Bioassay example (cont’d)

• Modelyi ∼ Bin (ni, θi)

• θi are not exchangeable because probability of death depends on dose.

• One way to model is with linear relationship:

θi = α + βxi.

Not a good idea because θi ∈ (0, 1).

• Transform θi:

logit (θi) = log(

θi

1− θi

)ENAR - March 2006 48

Page 110: Bayesian Methods for Data Analysis - ISU Public Homepage Server

• Since logit (θi) ∈ (−∞,+∞), can use linear model. (Logisticregression)

E[logit (θi)] = α + βxi.

• Likelihood: if yi ∼ Bin (ni, θi), then

p(yi|ni, θi) ∝ (θi)yi(1− θi)ni−yi.

• But recall that

log(

θi

1− θi

)= α + βxi,

so that

θi =exp(α + βxi)

1 + exp(α + βxi).

ENAR - March 2006 49

Page 111: Bayesian Methods for Data Analysis - ISU Public Homepage Server

• Then

p(yi| α , β, ni, xi)

∝[

exp(α + βxi)1 + exp(α + βxi)

]yi[1− exp(α + βxi)

1 + exp(α + βxi)

]ni−yi

• Prior: p(α, β) ∝ 1.

• Posterior:p(α, β|y, n, x) ∝ Πk

i=1p(yi|α, β, ni, xi).

• We first evaluate p(α, β|y, n, x) on the grid

(α, β) ∈ [−5, 10]× [−10, 40]

and use inverse cdf method to sample from posterior.

ENAR - March 2006 50

Page 112: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Bioassay example (cont’d)

Posterior evaluated on 200× 200 grid of values of (α, β).

α1 α2 ... ... α200

β1 p(β1|y)β2 p(β2|y)... ...... ...

β200 p(β200|y)p(α1|y) p(α2|y) ... ... p(α200|y)

• Entry (j, i) in grid is p(α, β|α = αi, β = βj, y).

• Sum of column j entries is p(αj|y) because:

p(αj|y) =∫

p(αj, β|y)dβ

ENAR - March 2006 51

Page 113: Bayesian Methods for Data Analysis - ISU Public Homepage Server

=∫

p(αj|β, y)p(β|y)dβ

≈∑

p(αj|β, y)

• Sum of row i entries is p(βi|y).

ENAR - March 2006 52

Page 114: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Bioassay example (cont’d)

• We sample α from p(α|y), and then β from p(β|α, y) (or the otherway around).

• To sample α from p(α|y):

1. Obtain empirical p(α|α = αj, y), j = 1, ...200 by summing over theβ.

2. Use inverse cdf method to sample from p(α|y).

• To sample β from p(β|α, y):

1. Given a draw α∗, choose appropriate column in grid2. Use inverse cdf method on p(β|α = α∗, y).

ENAR - March 2006 53

Page 115: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Bioassay example (cont’d)

LD50 is dose at which probability of death is 0.5, or

E

(yi

ni

)= θi = 0.5

so that

0.5 = logit−1(α + βxi)

logit(0.5) = α + βxi

0 = α + βxi

Then LD50 is computed as xi = −α/β.

ENAR - March 2006 54

Page 116: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Posterior distribution of the probability of death for each dose

[1]

[2]

[3]

[4]

box plot: theta

0.0

0.25

0.5

0.75

1.0

Page 117: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Posterior distributions of the regression parameters Alpha: posterior mean = 1.297, with 95% credible set [-0.365, 3.755] Beta: posterior mean = 11.52, with 95% credible set [3.572, 25.79]

alpha sample: 2000

-2.5 0.0 2.5 5.0

0.0 0.2 0.4 0.6

beta sample: 2000

-20.0 0.0 20.0 40.0

0.0 0.025 0.05 0.075 0.1

Page 118: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Posterior distribution of the LD50 in the log scale Posterior mean: -0.1064 95% credible set: [-0.268, 0.1141]

LD50 sample: 2000

-1.0 -0.5 0.0 0.5

0.0 2.0 4.0 6.0

Page 119: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Multivariate normal model - Intro

• Generalization of the univariate normal model, where now observationsare d× 1 vectors of measurements

• For the ith sample unit: yi = (yi1, yi2, ..., yid), and yi ∼ N(µ,Σ), with(µ,Σ) d× 1 and d× d respectively

• For a sample of n iid observations:

p(y|µ,Σ) ∝ |Σ|−n/2 exp[−12

∑i

(yi − µ)′Σ−1(yi − µ)]

∝ |Σ|−n/2 exp[−12tr(yi − µ)′Σ−1(yi − µ)]

∝ |Σ|−n/2 exp[−12trΣ−1S]

ENAR - March 2006 55

Page 120: Bayesian Methods for Data Analysis - ISU Public Homepage Server

withS =

∑i

(yi − µ)(yi − µ)′

• S is the d× d matrix of sample squared deviations and cross-deviationsfrom µ.

ENAR - March 2006 56

Page 121: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Multivariate normal model - Known Σ

• We want the posterior distribution of the mean p(µ|y) under theassumption that the variance matrix is known.

• Conjugate prior for µ is p(µ|µ0,Λ0) = N(µ0,Λ0), as in the univariatecase

• Posterior for µ:

p (µ|y, Σ) ∝

exp

{−1

2

[(µ− µ0)′Λ−1(µ− µ0) +

∑i

(yi − µ)′Σ−1(yi − µ)

]}

that is a quadratic function of µ. Completing the square in theexponent:

p(µ|y, Σ) = N(µn,Λn)

ENAR - March 2006 57

Page 122: Bayesian Methods for Data Analysis - ISU Public Homepage Server

with

µn = (Λ−1µ0 + nΣ−1y)(Λ−1 + nΣ−1)−1

Λn = (Λ−1 + nΣ−1)−1

ENAR - March 2006 58

Page 123: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Multivariate normal model - Known Σ

• Posterior precision is equal to the sum of prior and sample precisions:

Λ−1n = Λ−1

0 + nΣ−1

• Posterior mean is weighted average of prior and sample means:

µn = (Λ−10 µ0 + nΣ−1y)(Λ−1

0 + nΣ−1)−1

• From usual properties of multivariate normal distribution, can derivemarginal posterior distribution of any subvector of µ or conditionalposterior distribution of µ(1) given µ(2).

ENAR - March 2006 59

Page 124: Bayesian Methods for Data Analysis - ISU Public Homepage Server

• Let µ′ = (µ(1)′, µ(2)′)′ and correspondingly, let

µn =

(1)n

µ(2)n

]

and

Λn =

[Λ(1,1)

n Λ(1,2)n

Λ(2,1)n Λ(2,2)

n

]

ENAR - March 2006 60

Page 125: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Multivariate normal model - Known Σ

• Marginal of subvector µ(1) is p(µ(1)|Σ, y) = N(µ(1)n ,Λ(1,1)

n ).

• Conditional of subvector µ(1) given µ(2) is

p(µ(1)|µ(2),Σ, y) = N(µ(1)n + β1|2(µ(2) − µ(2)

n ),Λ1|2)

where

β1|2 = Λ(1,2)n

(Λ(2,2)

n

)−1

Λ1|2 = Λ(1,1)n − Λ(1,2)

n

(Λ(2,2)

n

)−1

Λ(2,1)n

• We recognize β1|2 as the regression coefficient in a regression of µ(1)

on µ(2)

ENAR - March 2006 61

Page 126: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Multivariate normal - Posterior predictive

• We seek p(y|y), and note that

p(y, µ|y) = N(y;µ,Σ)N(µ;µn,Λn)

• Exponential in p(y, µ|y) is a quadratic form in (y, µ), so (y, µ) have anormal joint posterior distribution.

• Marginal p(y|y) is also normal with mean and variance:

E(y|y) = E(E(y|µ, y)|y) = E(µ|y) = µn

var(y|y) = E(var(y|µ, y)|y) + var(E(y|µ, y)|y)

= E(Σ|y) + var(µ|y)

= Σ + Λn.

ENAR - March 2006 62

Page 127: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Multivariate normal - Sampling from p(y|y)

• With Σ known, to draw a value y from p(y|y) note that

p(y|y) =∫

p(y|µ, y)p(µ|y)dµ

• Then:

1. Draw µ from p(µ|y) = N(µn,Λn)2. Draw y from p(y|µ, y) = N(µ,Σ)

• Alternatively (better), draw y directly from

p(y|y) = N(µn,Σ + Λn)

ENAR - March 2006 63

Page 128: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Sampling from a multivariate normal

• Want to sample y d× 1 from N(µ,Σ).

• Two common approaches

– Using Cholesky decomposition of Σ:1. Get A, a lower triangular matrix such that AA′ = Σ2. Draw (z1, z2, ..., zd) iid N(0, 1) and let z = (z1, ..., zd)3. Compute y = µ + Az

ENAR - March 2006 64

Page 129: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Sampling from a multivariate normal

• Using sequential conditional sampling:

• Use fact that all conditionals in a multivariate normal are also normal.

1. Draw y1 from N(µ1,Σ(11))2. Then draw y2|y1 from N(µ2|1,Σ2|1)3. Etc.

• For example, if d = 2,

1. Draw y1 fromN(µ(1), σ2(1))

2. Draw y2 from

N(µ(2) +σ(12)

σ2(2)(y1 − µ(1)), σ2(2) − (σ(12))2

σ2(1))

ENAR - March 2006 65

Page 130: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Non-informative prior for µ

• A non-informative prior for µ is obtained by letting the prior precisionΛ0 go to zero.

• With the uniform prior, the posterior for µ is proportional to thelikelihood.

• Posterior is proper only if n > d; otherwise, S is not of full columnrank.

• If n > d,p(µ|Σ, y) = N(y, Σ/n)

ENAR - March 2006 66

Page 131: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Multivariate normal - unknown µ,Σ

• The conjugate family of priors for (µ,Σ) is the normal-inverse Wishartfamily.

• The inverse Wishart is the multivariate generalization of the inverse χ2

distribution

• If Σ|ν0,Λ0 ∼ Inv-Wishartν0(Λ−10 ) then

p(Σ|ν0,Λ0) ∝ |Σ|−(ν0+d+1)/2 exp{−1

2tr(Λ0Σ−1)

}for Σ positive definite and symmetric.

• For the Inv-Wishart with ν0 degrees of freedom and scale Λ0: E(Σ) =(ν0 − d− 1)−1Λ0

ENAR - March 2006 67

Page 132: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Multivariate normal - unknown µ,Σ

• Then conjuage prior is

Σ ∼ Inv-Wishartν0(Λ−10 )

µ|Σ ∼ N(µ0,Σ/κ0)

which corresponds to

p(µ,Σ) ∝ |Σ|−(ν0+d)/2+1

exp(−1

2tr(Λ0Σ−1)− κ0

2(µ− µ0)′Σ−1(µ− µ0)

)

ENAR - March 2006 68

Page 133: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Multivariate normal - unknown µ,Σ

• Posterior must also be in the Normal-Inv-Wishart form

• Results from univariate normal generalize directly:

– Σ|y ∼ Inv-Wishartνn(Λ−1n )

– µ|Σ, y ∼ N(µn,Σ/κn)– µ|y ∼ Mult-tνn−d+1(µn,Λn/(κn(νn − d + 1)))– y|y ∼ Mult-t

• Here:

µn =κ0

κ0 + nµ0 +

n

κ0 + ny

κn = κ0 + n

νn = ν0 + n

ENAR - March 2006 69

Page 134: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Λn = Λ0 + S +nκ0

κ0 + n(y − µ0)(y − µ0)′

S =∑

i

(yi − y)(yi − y)′

ENAR - March 2006 70

Page 135: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Multivariate normal - sampling µ,Σ

• To sample from p(µ,Σ|y) do: (1) Sample Σ from p(Σ|y) (2) Sampleµ from p(µ|Σ, y)

• To sample from p(y|y), use drawn (µ,Σ) and obtain draw y fromN(µ,Σ)

• Sampling from Wishartν(S):

1. Draw ν independent d× 1 vectors α1, α2, ..., αν from a N(0, S)2. Let Q =

∑νi=1 αiα

′i

• Method works when ν > d. If Q ∼ Wishart then Q−1 = Σ ∼Inv-Wishart.

• We have already seen how to sample from a multivariate normal givenmean vector and covariance matrix.

ENAR - March 2006 71

Page 136: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Multivariate normal - Jeffreys prior

• If we start from the conjugate normal-inv-Wishart prior and let :

κ0 → 0

ν0 → −1

|Λ0| → 0

then resulting prior is Jeffreys prior for (µ,Σ):

p(µ,Σ) ∝ |Σ|−(d+1)/2

ENAR - March 2006 72

Page 137: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Example: bullet lead concentrations

• In the US, four major manufacturers produce all bullets fired. One ofthem is Cascade.

• A sample of 200 round-nosed .38 caliber bullets with lead tips wereobtained from Cascade.

• Concentration of five elements (antimony, copper, arsenic, bismuth,and silver) in the lead alloy was obtained. Data for for Cascade arestored in federal.data in the course’s web site.

• Assuming that the 200 5 × 1 observation vectors for Cascade canbe represented by a multivariate normal distribution (perhaps aftertransformation), we are interested in the posterior distribution of themean vector and of functions of the covariance parameters.

ENAR - March 2006 1

Page 138: Bayesian Methods for Data Analysis - ISU Public Homepage Server

• Particular quantities of interest for inference include:

– The mean trace element concentrations (µ1, ..., µ5)– The correlations between trace element concentrations

(ρ11, ρ12, ..., ρ45).– The largest eigenvalue of the covariance matrix

• In the data file, columns correspond to antimony, copper, arsenic,bismuth, and silver, in that order.

• For this example, the antimony concentration was divided by 100

ENAR - March 2006 2

Page 139: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Bullet lead example - Model and prior

• We concentrate on the correlations that involve copper (four of them).

• Sampling distribution:

y|µ,Σ ∼ N (µ,Σ)

• We use the conjugate Normal-Inv-Wishart family of priors, and chooseparameters for p(Σ|ν0,Λ0) first:

Λ0,ii = (100, 5000, 15000, 1000, 100)

Λ0,ij = 0 ∀i, jν0 = 7

ENAR - March 2006 3

Page 140: Bayesian Methods for Data Analysis - ISU Public Homepage Server

• For p(µ|Σ,mu0, κ0):

µ0 = (200, 200, 200, 100, 50)

κ0 = 10

• Low values for ν0 and κ0 suggest little confidence in prior guessesµ0,Λ0

• We set Λ0 to be diagonal apriori: we have some information about thevariances of the five element concentrations, but no information abouttheir correlation.

• Sampling from posterior distribution: We follow the usual approach:

1. Draw Σ from Inv-Wishartνn(Λ−1n )

2. Draw µ from N(µn,Σ/κn)

ENAR - March 2006 4

Page 141: Bayesian Methods for Data Analysis - ISU Public Homepage Server

3. For each draw, compute:(a) the ratio between λ1 (largest eigenvalue of Σ) and λ2 (second

largest eigenvalue of Σ)(b) the four correlations ρcopper,j = σcopper,j/(σcopperσj), for j ∈

{antimony, arsenic, bismuth, silver}.

• We are interested in the posterior distributions of the five means, theeigenvalues ratio, and the four correlation coefficients

ENAR - March 2006 5

Page 142: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Scatter plot of the posterior mean concentrations

• ••

•• •••

••

••

••• •

•••

••

••

•••

••••

•••

••••

••

•• •

••

••

••

•• ••

••

••

•• •

••

••• •

•••

••

••

••

••

• •••

••

••

••

• •

• •

••

••

•••

••

• • •

•• •

••

•••

••

•• ••

•• •••

••

••

••• •

•••

••

••

•••

••••

•••

••••

••

•• •

••

••

••

•• ••

• ••

•• •••

••

••

••• •

•••

••

••

•••

••••

•••

••••

••

•• •

••

••

••

•• ••

••

••

• •

••

•••

• •••

••

• •

•••

••

••

••

••

••

•••

• •

•• •

•••

••

••

••

• •

••

• •

• ••• ••

••

••

••

•••

••

••

••

••

• •

••

••

••

• • ••

••

•• •

•• • •

••

••

•••

••

••

•••

••

••

•••

••••

•••

• •

••

••

•• •

••

••• •

•••

••

••

••

••

• •••

••

••

••

• •

• •

••

••

•••

••

• • •

•• •

••

•••

••

••

••

••

•• •

••

••• •

•••

••

••

••

••

• •••

••

••

••

• •

• •

••

••

•••

••

• • •

•• •

••

•••

••

••

••

• •

••

•••

• •••

••

• •

•••

••

••

••

••

••

•••

• •

•• •

•••

••

••

••

• •

••

• •

• ••• ••

••

••

••

• •

••

•••

• •

•• ••

••

••

••

•• •

•••

••

•••

••

••

••

••

••

•••

••

••

•• •

•• •

••

••

•••

•• • •

••

••

••

•• ••

•••

••

• •

•• •

•••

•••

••

•••

••

•••

• •

• •

••

•••

••

••

•••

••

• •••

•• •

••

••

••

••

•• ••

••

••

••

••

••

••

••

••

••

•••

•••

••

•••

• •

• •••

•• • •

••

••

••

•• ••

•••

••

• •

•• •

•••

•••

••

•••

••

•••

• •

• •

••

•••

•••

•• • •

••

••

••

•• ••

•••

••

• •

•• •

•••

•••

••

•••

••

•••

• •

• •

••

•••

••

••

••

••

••

•• •

•• ••

••

••

• ••

••

••

• •

••

• •

•••

••

• ••

•• •

••

••

••

• •

••

••

••• •• •

••

••

•••••

••

••

• •

••

• •

••

• •

••

••• •

••

•• ••

••••

••

••

• ••

••

••

•••

••

••

••

• •••

•• •

••

••

••

•••

••

• •••

•• •

••

••

••

••

•• ••

••

••

••

••

••

••

••

••

••

•••

•••

••

•••

• •

••

••

••

•••

••

• •••

•• •

••

••

••

••

•• ••

••

••

••

••

••

••

••

••

••

•••

•••

••

•••

• •

••

••

••

••

•• •

•• ••

••

••

• ••

••

••

• •

••

• •

•••

••

• ••

•• •

••

••

••

• •

••

••

••• •• •

••

••

••

••

••

••

• •

•• ••

••

••

••

•• •

• ••

• •

•••

••

••

••

••

••

•••

••

••

• ••

•••

••

••

•••

••••

••

••

••

••••

•••

••

••

•• •

•• •

•••

••

••

••

•• •

••

••

••

•• •

••

••

•• •

••

••• •

•••

••

••

••

••

•••

•••

•••

• •

• •

••

•••

••

••

• ••

•• •

••

•••

••

• •••

••••

••

••

••

••••

•••

••

••

•• •

•• •

•••

••

••

••

•• •

••

••

••

•• •

•••

••••

••

••

••

••••

•••

••

••

•• •

•• •

•••

••

••

••

•• •

••

••

••

•• •

••

••

••

• •

••

•••

• •••

••

• •

•••

••

••

••

••

••

•• •••

• • •

•••

••

••

••

••

••

• •

• ••••

••

••

••

•••

••

••

••

••

• •

••

• •

••

• •••

••

•••

•• • •

••

• •

•••

••

••

•••

••

••

•••

•• ••

•• •

• •

••

••

•• •

••

••• •

•••

••

••

••

••

•••

•••

•••

• •

• •

••

•••

••

••

• ••

•• •

••

•••

••

••

••

••

•• •

••

••• •

•••

••

••

••

••

•••

•••

•••

• •

• •

••

•••

••

••

• ••

•• •

••

•••

••

••

••

• •

••

•••

• •••

••

• •

•••

••

••

••

••

••

•• •••

• • •

•••

••

••

••

••

••

• •

• ••••

••

••

••

••

••

••

• •

•••

••

••

••

•••

•••

••

•• •

••

••

••

••

••

• ••

••

• •

•• •

•• ••

••

•••

••••

••

••

••

• •••

•• •

••

••

•••

•• •

•• •

••

•••

••

•••••

••

••

•• •

••

••

•• •

••

••• •

• ••

••

••

••

••

•••

••

••••

• •

• •

••

••

••

••

• ••

•••

••

• ••

••

• •••

••••

••

••

••

• •••

•• •

••

••

•••

•• •

•• •

••

•••

••

•••••

••

••

•• •

•••

••••

••

••

••

• •••

•• •

••

••

•••

•• •

•• •

••

•••

••

•••••

••

••

•• •

••

••

••

• •

••

•••

• •••

••

••

•••

••

• •

••

••

••

•• •

••

•••

•••

••

••

••

••

••

• •

• ••••

••

••

••

• ••

••

••

••

••

••

••

••

••

• •••

••

••••

•• • •

••

• •

•••

••

••

•• •

••

••

••

•• ••

•••

• •

••

••

•• •

••

••• •

• ••

••

••

••

••

•••

••

••••

• •

• •

••

••

••

••

• ••

•••

••

• ••

••

••

••

••

•• •

••

••• •

• ••

••

••

••

••

•••

••

••••

• •

• •

••

••

••

••

• ••

•••

••

• ••

••

••

••

• •

••

•••

• •••

••

••

•••

••

• •

••

••

••

•• •

••

•••

•••

••

••

••

••

••

• •

• ••••

••

••

••

• •

••

••

••

•••

••

• •

••

•••

•••

• •

•• •

••

••

••

••

••

• •••

••

• •

• ••

•••

••

••

••

•• •

• ••

••

• •

••

•••

••

••

••

• •

••

••

••

• •

• •

• •••

••••

••

•••• •

••

• •• •

•••

••

••

••

•••

• •

• •• •

••

•• •

•• •

••

••

••

• •

••

•• •

•••

••

••

•••

••

• •

••

••

••

••

•• •

••

•• •

• ••

••

• •

••

•••

••

••

••

• •

••

••

••

• •

• •

• •••

••••

••

•••• •

••

• •• •

•••

••

••

•• •

• ••

••

• •

••

•••

••

••

••

• •

••

••

••

• •

• •

• •••

••••

••

•••• •

••

• •• •

•••

••

• ••

• ••

••

••••

••

••

••

• •••

••

•••

••

••

• ••• •

••

••

• •

•••

••

••

••

••

• • •••

••••

••

••

••

••

••

••

••

• •

••

• ••

•• • •

• •

••

•• •

•• •

••

••

•••

••

••

• •

••

••

••

••

••• •

••

••

•••

• •

• •• •

••

•• •

•• •

••

••

••

• •

••

•• •

•••

••

••

•••

••

• •

••

••

••

••

•• •

••

••

•••

• •

• •• •

••

•• •

•• •

••

••

••

• •

••

•• •

•••

••

••

•••

••

• •

••

••

••

••

•• •

• ••

• ••

••

••••

••

••

••

• •••

••

•••

••

••

• ••• •

••

••

• •

•••

••

••

••

••

• • •••

••••

••

••

••

••

• •• ••

••

••

••••

•••

••

••

••

•• •••••

•••

••

•• •

•• ••• •••

••

••

••

••

••

•••

• •

••

••

••

••

••

•• •

•••

••

••

••

• ••

••

••

••

••

••

••

••

••

••

• •• •

• •• •

• •

•• •••

••

• •••

•••

•••

••

••

• ••

••

••••

••

•• •

•• •

••

• •

••

••

••

•••

•••

• •

••

• ••

• •

• •

• •

••

••

••

•• •

••

•• •

•••

••

••

••

• ••

••

••

••

••

••

••

••

••

••

• •• •

• •• •

• •

•• •••

••

• •••

•••

•••

••

•• •

•••

••

••

••

• ••

••

••

••

••

••

••

••

••

••

• •• •

• •• •

• •

•• •••

••

• •••

•••

•••

•••

•••

••

••

• ••

••

••

••

• •••

••

• ••

••

••

•• •• •

• •

••• •

•••

••

•••

••

••

•• •••

••• •

••

••

••

••

••

••

••

• •

••

•••

•• ••

• •

••

•••

•••

••

••

•••

•••

••

••

••

••

• •

••

• •• •

••

••

• ••

••

••••

••

•• •

•• •

••

• •

••

••

••

•••

•••

• •

••

• ••

• •

• •

• •

••

••

••

•• •

••

••

• ••

••

••••

••

•• •

•• •

••

• •

••

••

••

•••

•••

• •

••

• ••

• •

• •

• •

••

••

••

•• •

•••

•••

••

••

• ••

••

••

••

• •••

••

• ••

••

••

•• •• •

• •

••• •

•••

••

•••

••

••

•• •••

••• •

••

••

••

••

• •••

••

•••

• •• ••

••

••

••

••

•••• •••

•• •

••

• ••

•••• •••

••

••

••

••

••

•••

• •

••

••

• •

• •

•••

•• •

••••

••

• •

•••

•••

••

••

••

••

••

••

••

••

• •

••••

•• ••

••

•••• •

••

••• •

••

••

••

••

•••

• •

•• ••

••

•••

•• •

••

••

•••

••

••

•• •

•• •

••

••

••••

••

• •

••

• •

••

••

•••

•••

•• •

••••

••

• •

•••

•••

••

••

••

••

••

••

••

••

• •

••••

•• ••

••

•••• •

••

••• •

••

••

•••

•• •

••••

••

• •

•••

•••

••

••

••

••

••

••

••

••

• •

••••

•• ••

••

•••• •

••

••• •

••

••

•••

• ••

••

••

•••

••

••

••

• •••

••

•••

••

••

• ••••

• •

••

••

• ••

••

••

••

••

• ••••

••••

••

••

••

••

••

••

••

• •

••

• ••

••• •

••

••

•• •

•••

••

••

• ••

••

••

••

••

• •

••

••

•• ••

••

••

•••

• •

•• ••

••

•••

•• •

••

••

•••

••

••

•• •

•• •

••

••

••••

••

• •

••

• •

••

••

•••

••

••

•••

• •

•• ••

••

•••

•• •

••

••

•••

••

••

•• •

•• •

••

••

••••

••

• •

••

• •

••

••

•••

•••

• ••

••

••

•••

••

••

••

• •••

••

•••

••

••

• ••••

• •

••

••

• ••

••

••

••

••

• ••••

••••

••

••

••

••

•••••

••

••

••••

•• •

••

••

••

• ••••••

•••

••

•• •

•• ••• • •

••

•••

••

••

•• •

••

••

••

••

••

••

•••

•••

••

• •

••

•••

••

••

••

• •

••

• •

• •

• •

• •

••••

•• ••

••

••• ••

••

••• •

••

••

••

••

•••

••

••••

••

•••

•••

••

••

•••

••

••

•• •

•• •

••

••

•••

••

••

••

• •

••

••

•••

••

•••

•••

••

• •

••

•••

••

••

••

• •

••

• •

• •

• •

• •

••••

•• ••

••

••• ••

••

••• •

••

••

••

•••

•••

••

• •

••

•••

••

••

••

• •

••

• •

• •

• •

• •

••••

•• ••

••

••• ••

••

••• •

••

••

• ••

•••

••

••

•••

••

••

••

•• ••

••

•••

••

••

• ••••

••

••

••

• ••

••

•••

••

••

• •• ••

•• ••

••

••

••

••

••

••

••

••

••

• •••

•••

••

••

•• •

•• •

••

• •

• ••

••

••

• •

••

• •

••

••

•• ••

••

••

•••

••

••••

••

•••

•••

••

••

•••

••

••

•• •

•• •

••

••

•••

••

••

••

• •

••

••

•••

••

••

•••

••

••••

••

•••

•••

••

••

•••

••

••

•• •

•• •

••

••

•••

••

••

••

• •

••

••

•••

• ••

•••

••

••

•••

••

••

••

•• ••

••

•••

••

••

• ••••

••

••

••

• ••

••

•••

••

••

• •• ••

•• ••

••

••

••

••

•• ••

••

••

••••

•• •

••

••

••

• • • ••••

•••••

•••

•• ••• • •

••

••

••

••

••

•• •

••

••

••

••

••

••

••

•• ••

••

••

• •

••

••

••••

••

• •

••

••

••

•••

•••

••

••

••

•••

••

••

• • • •

••

•• •• ••

•• •

••

••

• •

• •••

••

•• •

•• •

•• •

••

• •

••

••

• •

• ••

• ••

••

• •••

••

• •

•• •••

••

•• ••

••

••

• •

••

••

••••

••

• •

••

••

••

•••

•••

••

••

••

•••

••

••

• • • •

••

•• •• ••

••

••

••

•• ••

••

••

• •

••

••

••••

••

• •

••

••

••

•••

•••

••

••

••

•••

••

••

• • • •

••

•• •• ••

••

••

••

• ••• •

••

• • •

•••

••

•••

••

••

••

••

••

•••

••

••

•• •

••

••

• •

••

••

••

••

••

•••

••

• ••

••

••

•••

• •••

••

•••

•• •

•• •

••

••• •

••

••

••

• ••

••

•••

••

••

••

• •

• •••

••

•• •

•• •

•• •

••

• •

••

••

• •

• ••

• ••

••

• •••

••

• •

•• ••

••

••

• •

• •••

••

•• •

•• •

•• •

••

• •

••

••

• •

• ••

• ••

••

• •••

••

• •

•• •

••

••

• ••• •

••

• • •

•••

••

•••

••

••

••

••

••

•••

••

••

•• •

••

••

• •

••

••

••

••

••

• ••

• ••

••

• •

••

••

•••

••••

••

••

••••

•••

• ••

•••

••

••

••

•• •

•••

••

••

••

••

••

••

•• •

••

•• ••

••

••

• •

• •

••

••

•••

••

••

••

••

••

•••

• •••

•••••

••

•••

• •

••

• •• •

••

• • • •••

•• •

••

••

••

••••

••

•• •

•• •

•• •

••

• •

••

••

••

•••

• ••

• •

• ••

••

• •

•• •• •

••

•• ••

••

••

• •

• •

••

••

•••

••

••

••

••

••

•••

• •••

•••••

••

•••

• •

••

• •• •

••

• • • •••

••

• •

••

•• ••

••

••

• •

• •

••

••

•••

••

••

••

••

••

•••

• •••

•••••

••

•••

• •

••

• •• •

••

• • • •••

••

••

••

• ••• •

••

• • •

•• •

•••••

• •

••

• •

• •

••

•••

••

••

•• •

• •

••

• •

••

••

••

••

••

•• •

••

• ••

••

••

•••

••••

• •

•• •

•••

•• •

••

••• •

••

••

••

•••

••

•••

••

••

••

••

••••

••

•• •

•• •

•• •

••

• •

••

••

••

•••

• ••

• •

• ••

••

• •

•• ••

••

••

••

••••

••

•• •

•• •

•• •

••

• •

••

••

••

•••

• ••

• •

• ••

••

• •

•• •

••

••

• ••• •

••

• • •

•• •

•••••

• •

••

• •

• •

••

•••

••

••

•• •

• •

••

• •

••

••

••

••

••

• ••

• ••

••

••

••

••

•••

••••

••

••

••• •

•••

• •••

••

••

••

••

•• •

•••

••

••

••

••

••

••

•••

••

•••••

••

••

••

••

•••••

••

••

• •

••

••

•••

•••

•••

••

•••

••

••

••

• •• •

••

•• ••••

•• •

••

••

• •

•• ••

••

•••

•• •

•• •

••

••

••

••

••

•••

• ••

••

•••

••

••

•••••

••

•••••

••

••

••

••

•••••

••

••

• •

••

••

•••

•••

•••

••

•••

••

••

••

• •• •

••

•• ••••

••

••

••

•••••

••

••

••

••

•••••

••

••

• •

••

••

•••

•••

•••

••

•••

••

••

••

• •• •

••

•• ••••

••

••

••

• ••• •

••

•••

•• •

••

•••

••

••

••

••

••

•••

••

••

•• •

••

••

• •

••

••

••

•••

••

• •

••

•••

••

• •

•••

• • ••

••

•• •

•••

•••

••

••• •

••

••

••

•••

••

•••

••

••

••

• •

•• ••

••

•••

•• •

•• •

••

••

••

••

••

•••

• ••

••

•••

••

••

••••

••

••

• •

•• ••

••

•••

•• •

•• •

••

••

••

••

••

•••

• ••

••

•••

••

••

•••

••

••

• ••• •

••

•••

•• •

••

•••

••

••

••

••

••

•••

••

••

•• •

••

••

• •

••

••

••

•••

••

•••

• ••

••

• •

••

••

•• •

••••

••

••

•••

•••

• ••

••

••

••

••

•• •

•••

••

••

••

••

••

••

•••

••

• •••

••

••

••

••

••

••• •

• •

••

• •

••

••

• ••

•••

•••

••

•••

• •

• •

••

•• ••

••

•••• ••

•••

••

••

••

••••

••

•••

•• •

•••

••

••

••••

••

• ••

•••

••

••••

••

••

• ••••

••

• •••

••

••

••

••

••

••• •

• •

••

• •

••

••

• ••

•••

•••

••

•••

• •

• •

••

•• ••

••

•••• ••

••

••

••

• •••

••

••

••

••

••

••• •

• •

••

• •

••

••

• ••

•••

•••

••

•••

• •

• •

••

•• ••

••

•••• ••

••

••

••

••• ••

••

• ••

•••

••• • •

••

••

••

••

••

•• •

••

• •

•••

• •

••

••

••

••

••

•••

••

••

••

• ••

••

• •

• ••

• •••

••

• ••

•• •

••••

••

•• ••

••

••

••

• ••

••

• ••

••

••

••

••

••••

••

•••

•• •

•••

••

••

••••

••

• ••

•••

••

••••

••

••

• •••

••

••

••

••••

••

•••

•• •

•••

••

••

••••

••

• ••

•••

••

••••

••

••

• ••

••

••

••• ••

••

• ••

•••

••• • •

••

••

••

••

••

•• •

••

• •

•••

• •

••

••

••

••

••

•••

••

•••

•• •

••

••

••

••

•••

• •••

••

••

•••

•••

••••

••

••

••

••

• ••

••

••

••

••

••

••

••

•• • •

••••

••

•••

•• ••

••••

•• ••• ••

••

•• •••

• ••

•• ••

• •

••

••

••

••

• ••

••

••

••

•••

••

••

••

••

•• ••

••

• •

• ••

••

••

••

••

••

• • •

••

•• •

• •

• •

•••

••

••

••

•• ••

•••

••

••

• •••

••

••

•• • •

••••

••

•••

•• ••

••••

•• ••• ••

••

•• •••

• ••

•• ••

• •

••

••

••

••

• ••

••

••

••

•••

••

••

••

•• • •

••••

••

•••

•• ••

••••

•• ••• ••

••

•• •••

• ••

•• ••

• •

••

••

••

••

• ••

••

••

••

•••

••

••

••

••

••

••

••

•• •

••

••

•••

••

••

•• ••

•••

• •

••

••

••

• •

• ••

••••

••

•••

•••

•• ••

••

•••

•••

•••

• •• ••

••

••• •

••

• •

• •••

••

••

••

• ••

• •

••

•• ••

••

•• •

••

••

••

•• ••

••

• •

• ••

••

••

••

••

••

• • •

••

•• •

• •

• •

•••

••

••

••

•• ••

•••

••

••

• •••

••

••

••

•• ••

••

• •

• ••

••

••

••

••

••

• • •

••

•• •

• •

• •

•••

••

••

••

•• ••

•••

••

••

• •••

••

••

••

••

••

••

•• •

••

••

•••

••

••

•• ••

•••

• •

••

••

••

• •

• ••

••••

••

•••

•••

• •

•••

••

•• •

•••

•••

••

••

••

••

••

••

••

••

•••

••

••

••

••

• •

••

••

••

••

••

••

••

••• •

• •• •

••

• ••

•• • •

••••

•••• •••

••

•••••

•••

•• ••

••

••

• •

••

••

•••

••

••

••

•• •

• •

••

••

• •

••••

••

••

• ••

••

••

••

• •

••

• ••

••

•• •

••

••

• ••

•••

••

••

••••

•••

••

••

• •••

••

••

••• •

• •• •

••

• ••

•• • •

••••

•••• •••

••

•••••

•••

•• ••

••

••

• •

••

••

•••

••

••

••

•• •

• •

••

••

••• •

• •• •

••

• ••

•• • •

••••

•••• •••

••

•••••

•••

•• ••

••

••

• •

••

••

•••

••

••

••

•• •

• •

••

••

••

••

••

• •

•• •

••

••

•••

• •

••

••• •

••

••

••

••

••

• •

•••

•••

••• ••

•••

••••

••

•••

•••• •

••

• ••••

••

•• ••

••

••

•••

••

• •

••

• ••

• •

••

• •• •

••

•• ••

••

••

• •

••••

••

••

• ••

••

••

••

• •

••

• ••

••

•• •

••

••

• ••

•••

••

••

••••

•••

••

••

• •••

••

••

• •

••••

••

••

• ••

••

••

••

• •

••

• ••

••

•• •

••

••

• ••

•••

••

••

••••

•••

••

••

• •••

••

••

••

••

••

• •

•• •

••

••

•••

• •

••

••• •

••

••

••

••

••

• •

•••

•••

••• ••

•••

• •

• ••

••

•••

••

••

• ••

••

• •

••

• •

••

••

••

••

•••

••

••

••

••

••

••

••

••

••

••

••

••

••• •

••• •

••

• ••

••••

•• ••

•••• •• •

••

•••••

•••

••••

• •

••

••

• •

• •

• ••

••

••

••

•••

••

••

••

••

•• ••

••

••

• ••

••

••

••

••

••

•• •

••

• ••

••

••

•••

••

••

••

•• ••

•••

••

••

•• ••

••

••

••• •

••• •

••

• ••

••••

•• ••

•••• •• •

••

•••••

•••

••••

• •

••

••

• •

• •

• ••

••

••

••

•••

••

••

••

••• •

••• •

••

• ••

••••

•• ••

•••• •• •

••

•••••

•••

••••

• •

••

••

• •

• •

• ••

••

••

••

•••

••

••

••

••

••

••

••

•••

••

••

••••

• •

••

••••

••

• •

••

••

••

• •

•••

•••

••

•• •

• ••

•• ••

••

•••

• ••

• ••

•• •••

• •

••• •

•••

••

•••

• •

• •

••

• ••

• •

••

••• •

••

•••

••

• •

••

•• ••

••

••

• ••

••

••

••

••

••

•• •

••

• ••

••

••

•••

••

••

••

•• ••

•••

••

••

•• ••

••

••

••

•• ••

••

••

• ••

••

••

••

••

••

•• •

••

• ••

••

••

•••

••

••

••

•• ••

•••

••

••

•• ••

••

••

••

••

••

••

•••

••

••

••••

• •

••

••••

••

• •

••

••

••

• •

•••

•••

••

•• •

• ••

••

•••

••

•• •

••

••

• ••

••

••

••

• •

••

••

••

••

•• •

••

••

••

•••

••

••

••

• •

••

• •

• •

••

•• ••

• •••

••

•••

••• •

• •••

••• •• ••

• •

• ••• •

• ••

• •••

••

••

••

••

••

•••

••

••

• •

•••

••

••

• •

• •

••••

••

••

•••

••

••

••

• •

••

•••

• •

•••

••

• •

• ••

••

••

•••

••• •

•• •

••

••

••••

• •

••

•• ••

• •••

••

•••

••• •

• •••

••• •• ••

• •

• ••• •

• ••

• •••

••

••

••

••

••

•••

••

••

• •

•••

••

••

• •

•• ••

• •••

••

•••

••• •

• •••

••• •• ••

• •

• ••• •

• ••

• •••

••

••

••

••

••

•••

••

••

• •

•••

••

••

• •

••

••

••

• •

•• •

••

••

•• •

••

••

• •••

•••

••

••

••

••

••

• ••

• ••

••

•••

• ••

• •••

••

•••

•••

•••

• • • ••

••

••••

••

• •

• ••

• •

••

••

•••

••

••

• • ••

••

•• •

••

••

• •

••••

••

••

•••

••

••

••

• •

••

•••

• •

•••

••

• •

• ••

••

••

•••

••• •

•• •

••

••

••••

• •

••

• •

••••

••

••

•••

••

••

••

• •

••

•••

• •

•••

••

• •

• ••

••

••

•••

••• •

•• •

••

••

••••

• •

••

••

••

••

• •

•• •

••

••

•• •

••

••

• •••

•••

••

••

••

••

••

• ••

• ••

••

•••

• ••

••

• ••

••

•••

••

••

•••

••

• •

••

••

••

• •

••

••

•• •

••

• •

••

••

• •

••

••

••

••

••

••

••

••

•••

••

•• •

• ••

•••

•••

••

••

•• •

•••

• •

••

• ••

• ••

••

••

••

••• •

••

••

••

••

••

•••••

• •

•• •

••

••

• •• •

• ••

•••• •

••

••

•••

••

• •

•••• •

•••

••

••

•• •

••

••

••••

•• •

••

••

•••

•••

••

•• •

• ••

•••

•••

••

••

•• •

•••

• •

••

• ••

• ••

••

••

••

••• •

••

••

••

••

••

•••••

• •

••

••

•••

••

•• •

• ••

•••

•••

••

••

•• •

•••

• •

••

• ••

• ••

••

••

••

••• •

••

••

••

••

••

•••••

• •

••

• •• •

• •

••

••

••

••

••

••

••

••

••

••

•• •

•••

•••• •

••

•••

• •

••

••

••

••

••

••

• ••

••

••

•••

••

••

••

• ••

•••

•••

•• •

••

••

••

•• •

••

••

••

••

•• ••

••

••

•••

••

••

•• •

••

• ••

••

••

• •• •

• ••

•••• •

••

••

•••

••

• •

•••• •

•••

••

••

•• •

••

••

••••

•• •

••

••

••

••

••

• •• •

• ••

•••• •

••

••

•••

••

• •

•••• •

•••

••

••

•• •

••

••

••••

•• •

••

••

••

• •• •

• •

••

••

••

••

••

••

••

••

••

••

•• •

•••

•••• •

••

•••

• •

••

••

••

••

••

••

• ••

••

• •

••

••

••

••

••

••

•••

••

••• •

••

••••

•••

•••

•• •

• ••

••

••• • •• •

••

•••

•••

• • •

• ••

••

•• •

• ••

•••

•••

• •

••

•••

•••

••

••

• ••

•••

••

• •

••

••• •

••

••

••

••

••

•• •

• •

••

•• •

••

••

••••••

•••

• ••

••

••

• ••

••

• •

••

•• •••••

••

••

•••

••

••

• •••

•••

••

••

•• •

• ••

••

•• •

• ••

•••

•••

• •

••

•••

•••

••

••

• ••

•••

••

• •

••

••• •

••

••

••

••

••

•• •

• •

••

••

• •

• ••

••

•• •

• ••

•••

•••

• •

••

•••

•••

••

••

• ••

•••

••

• •

••

••• •

••

••

••

••

••

•• •

• •

••

••

• •••

• •

••

••

• •

••

••

•••

••

••

••

••

•• •

•••

••• ••

• •

•••

• •

• •

••

••

••

• •

••

• ••

••

••

•••

••

••

••

• ••

•••

•• •

•••

• •

••

••

•• •

••

••

••

••

•• ••

••

••

•••

••

••

•• •

••

• ••

••

••

••••••

•••

• ••

••

••

• ••

••

• •

••

•• •••••

••

••

•••

••

••

• •••

•••

••

••

••

••

••

••••••

•••

• ••

••

••

• ••

••

• •

••

•• •••••

••

••

•••

••

••

• •••

•••

••

••

••

• •••

• •

••

••

• •

••

••

•••

••

••

••

••

•• •

•••

••• ••

• •

•••

• •

• •

••

••

••

• •

••

• ••

••

••

••

••

••

••

••

••

•••

••

••• •

••

•••

• ••

•••

•• •

• ••

••

••• ••

••

••

•••

•••

• ••

•• •

• •

•••

•••

• ••

• ••

••

••

•••

• •••

••

••••

•• •

••

••

••

• •••

• •

••

••

••

••

•••

••

• •

•••

••

••

••••

•••

• ••

•••

••

••

•••

••

••

••

• ••• •

••

••

••

•• •

••

••

••• •

•• •

••

•••

•••

•• •

• •

•••

•••

• ••

• ••

••

••

•••

• •••

••

••••

•• •

••

••

••

• •••

• •

••

••

••

••

•••

••

• •

••

••

•• •

• •

•••

•••

• ••

• ••

••

••

•••

• •••

••

••••

•• •

••

••

••

• •••

• •

••

••

••

••

•••

••

• •

••

••••

••

••

••

••

••

••

•••

••

••

••

••

•••

•• ••

••••

••

• ••

••

••

••

••

••

••

••

•••

••

••

•••

••

••

••

•••

• •••• •

•••

••

• •

••

•••

••

••

••

••

••• •

••

••

•• •

••

••

• ••

••

•••

••

••

••••

•••

• ••

•••

••

••

•••

••

••

••

• ••• •

••

••

••

•• •

••

••

••• •

•• •

••

•••

••

••

••

••••

•••

• ••

•••

••

••

•••

••

••

••

• ••• •

••

••

••

•• •

••

••

••• •

•• •

••

•••

••

••••

••

••

••

••

••

••

•••

••

••

••

••

•••

•• ••

••••

••

• ••

••

••

••

••

••

••

••

•• •

••

• •

••• •

••

••

••

••

•• •••

• •••

••

•••

•••

•• •

• ••

•••

••

•• •• •

• •

••

•• •

•• •

• ••

• ••

••

•• •

••••

•••

• ••

••

••

•••

••••

••

•••••

•••

••

••

••

••• •

••

••

••

••

••

•••

••

••

•• •

••

••

• •••

•••

•••

• ••

••

••

• ••

••

••

••

••••••

••

••

•••

••

••

• •••

•••

••

••

•••

• ••

••

•• •

••••

•••

• ••

••

••

•••

••••

••

•••••

•••

••

••

••

••• •

••

••

••

••

••

•••

••

••

••

••

• ••

••

•• •

••••

•••

• ••

••

••

•••

••••

••

•••••

•••

••

••

••

••• •

••

••

••

••

••

•••

••

••

••

• •• •

• •

••

••

• •

••

••

••

••

••

••

••

•• •

•••

••

• • ••

• •

•••

• •

••

••

••

••

••

•••

• ••

••

••

•••

••

••

••

•••

•••

•••

•• •

••

••

••

•••

••

••

••

••

•• ••

••

••

•••

•••

••

•• •

••

• ••

••

••

• •••

•••

•••

• ••

••

••

• ••

••

••

••

••••••

••

••

•••

••

••

• •••

•••

••

••

••

••

••

• •••

•••

•••

• ••

••

••

• ••

••

••

••

••••••

••

••

•••

••

••

• •••

•••

••

••

••

• •• •

• •

••

••

• •

••

••

••

••

••

••

••

•• •

•••

••

• • ••

• •

•••

• •

••

••

••

••

••

•••

• ••

••

••

••

••

••

• •

••

••

•••

••

••• •

••

•••

• ••

•••

•• •

• ••

••

••• ••

••

••

•••

•••

ENAR - March 2006 6

Page 143: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Results from R program

# antimonyc(mean(sample.mu[,1]),sqrt(var(sample.mu[,1])))[1] 265.237302 1.218594quantile(sample.mu[,1],probs=c(0.025,0.05,0.5,0.95,0.975))

2.5% 5.0% 50.0% 95.0% 97.5%262.875 263.3408 265.1798 267.403 267.7005# copperc(mean(sample.mu[,2]),sqrt(var(sample.mu[,2])))[1] 259.36335 5.20864quantile(sample.mu[,2],probs=c(0.025,0.05,0.5,0.95,0.975))

2.5% 5.0% 50.0% 95.0% 97.5%249.6674 251.1407 259.5157 268.0606 269.3144# arsenicc(mean(sample.mu[,3]),sqrt(var(sample.mu[,3])))[1] 231.196891 8.390751quantile(sample.mu[,3],probs=c(0.025,0.05,0.5,0.95,0.975))

ENAR - March 2006 7

Page 144: Bayesian Methods for Data Analysis - ISU Public Homepage Server

2.5% 5.0% 50.0% 95.0% 97.5%213.5079 216.7256 231.6443 243.5567 247.1686# bismuthc(mean(sample.mu[,4]),sqrt(var(sample.mu[,4])))[1] 127.248741 1.553327quantile(sample.mu[,4],probs=c(0.025,0.05,0.5,0.95,0.975))

2.5% 5.0% 50.0% 95.0% 97.5%124.3257 124.6953 127.2468 129.7041 130.759# silverc(mean(sample.mu[,5]),sqrt(var(sample.mu[,5])))[1] 38.2072916 0.7918199quantile(sample.mu[,5],probs=c(0.025,0.05,0.5,0.95,0.975))

2.5% 5.0% 50.0% 95.0% 97.5%36.70916 36.93737 38.1971 39.54205 39.73169

ENAR - March 2006 8

Page 145: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Posterior of λ1/λ2 of Σ

ENAR - March 2006 9

Page 146: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Summary statistics of posterior dist. of ratio

c(mean(ratio.l),sqrt(var(ratio.l)))[1] 5.7183875 0.8522507quantile(ratio.l,probs=c(0.025,0.05,0.5,0.95,0.975))

2.5% 5.0% 50.0% 95.0% 97.5%4.336424 4.455404 5.606359 7.203735 7.619699

ENAR - March 2006 10

Page 147: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Correlation of copper with other elements

ENAR - March 2006 11

Page 148: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Summary statistics for correlations

# j = antimonyc(mean(sample.rho[,1]),sqrt(var(sample.rho[,1])))[1] 0.50752063 0.05340247quantile(sample.rho[,1],probs=c(0.025,0.05,0.5,0.95,0.975))

2.5% 5.0% 50.0% 95.0% 97.5%0.4014274 0.4191201 0.5076544 0.5901489 0.6047564# j = arsenicc(mean(sample.rho[,3]),sqrt(var(sample.rho[,3])))[1] -0.56623609 0.04896403quantile(sample.rho[,3],probs=c(0.025,0.05,0.5,0.95,0.975))

2.5% 5.0% 50.0% 95.0% 97.5%-0.6537461 -0.6448817 -0.570833 -0.4857224 -0.465808# j = bismuthc(mean(sample.rho[,4]),sqrt(var(sample.rho[,4])))[1] 0.63909311 0.04087149quantile(sample.rho[,4],probs=c(0.025,0.05,0.5,0.95,0.975))

ENAR - March 2006 12

Page 149: Bayesian Methods for Data Analysis - ISU Public Homepage Server

2.5% 5.0% 50.0% 95.0% 97.5%0.5560137 0.5685715 0.6429459 0.7013519 0.7135556# j = silverc(mean(sample.rho[,5]),sqrt(var(sample.rho[,5])))[1] 0.03642082 0.07232010quantile(sample.rho[,5],probs=c(0.025,0.05,0.5,0.95,0.975))

2.5% 5.0% 50.0% 95.0% 97.5%-0.09464575 -0.0831816 0.03370379 0.164939 0.1765523

ENAR - March 2006 13

Page 150: Bayesian Methods for Data Analysis - ISU Public Homepage Server

S-Plus code

# Cascade example# We need to define two functions, one to generate random vectors from# a multivariate normal distribution and the other to generate random# matrices from a Wishart distribution.# Note, that a function that generates random matrices from a# inverse Wishart distribution is not necessary to be defined# since if W ~ Wishart(S) then W^(-1) ~ Inv-Wishart(S^(-1))#

# A function that generates random observations from a# Multivariate Normal distribution: rmnorm(a,b)# the parameters of the function are# a = a column vector of kx1# b = a definite positive kxk matrix#rmnorm_function(a,b) {

k_nrow(b)zz_t(chol(b))%*%matrix(rnorm(k),nrow=k)+azz }# A function that generates random observations from a# Wishart distribution: rwishart(a,b)# the parameters of the function are# a = df# b = a definite positive kxk matrix# Note a must be > k

ENAR - March 2006 14

Page 151: Bayesian Methods for Data Analysis - ISU Public Homepage Server

rwishart_function(a,b) {k_ncol(b)m_matrix(0,nrow=k,ncol=1)

cc_matrix(0,nrow=a,ncol=k)for (i in 1:a) { cc[i,]_rmnorm(m,b) }

w_t(cc)%*%ccw }

##Read the data#y_scan(file="cascade.data")y_matrix(y,ncol=5,byrow=T)y[,1]_y[,1]/100means_rep(0,5)for (i in 1:5){ means[i]_mean(y[,i]) }means_matrix(means,ncol=1,byrow=T)n_nrow(y)##Assign values to the prior parameters#v0_7Delta0_diag(c(100,5000,15000,1000,100))k0_10mu0_matrix(c(200,200,200,100,50),ncol=1,byrow=T)# Calculate the values of the parameters of the posterior#mu.n_(k0*mu0+n*means)/(k0+n)v.n_v0+nk.n_k0+n

ENAR - March 2006 15

Page 152: Bayesian Methods for Data Analysis - ISU Public Homepage Server

ones_matrix(1,nrow=n,ncol=1)S = t(y-ones%*%t(means))%*%(y-ones%*%t(means))Delta.n_Delta0+S+(k0*n/(k0+n))*(means-mu0)%*%t(means-mu0)## Draw Sigma and mu from their posteriors#samplesize_200lambda.samp_matrix(0,nrow=samplesize,ncol=5) #this matrix will store

#eigenvalues of Sigmarho.samp_matrix(0,nrow=samplesize,ncol=5)mu.samp_matrix(0,nrow=samplesize,ncol=5)for (j in 1:samplesize) {

# SigmaSS = solve(Delta.n)

# The following makes sure that SS is symmetricfor (pp in 1:5) { for (jj in 1:5) {

if(pp<jj){SS[pp,jj]=SS[jj,pp]} } }Sigma_solve(rwishart(v.n,SS))

# The following makes sure that Sigma is symmetricfor (pp in 1:5) { for (jj in 1:5) {

if(pp<jj){Sigma[pp,jj]=Sigma[jj,pp]} } }# Eigenvalue of Sigma

lambda.samp[j,]_eigen(Sigma)$values# Correlation coefficientsfor (pp in 1:5){rho.samp[j,pp]_Sigma[pp,2]/sqrt(Sigma[pp,pp]*Sigma[2,2]) }

# mumu.samp[j,]_rmnorm(mu.n,Sigma/k.n)

ENAR - March 2006 16

Page 153: Bayesian Methods for Data Analysis - ISU Public Homepage Server

}

# Graphics and summary statisticssink("cascade.output")# Calculate the ratio between max(eigenvalues of Sigma)# and max({eigenvalues of Sigma}\{max(eigenvalues of Sigma)})#ratio.l_sample.l[,1]/sample.l[,2]# "Histogram of the draws of the ratio"postscript("cascade_lambda.eps",height=5,width=6)hist(ratio.l,nclass=30,axes = F,

xlab="eigenvalue1/eigenvalue2",xlim=c(3,9))axis(1)dev.off()# Sumary statistics of the ratio of the eigenvaluesc(mean(ratio.l),sqrt(var(ratio.l)))quantile(ratio.l,probs=c(0.025,0.05,0.5,0.95,0.975))## correlations with copper#postscript("cascade_corr.eps",height=8,width=8)par(mfrow=c(2,2))# "Histogram of the draws of corr(antimony,copper)"hist(sample.rho[,1],nclass=30,axes = F,xlab="corr(antimony,copper)",

xlim=c(0.3,0.7))axis(1)# "Histogram of the draws of corr(arsenic,copper)"hist(sample.rho[,3],nclass=30,axes = F,xlab="corr(arsenic,copper)")axis(1)

ENAR - March 2006 17

Page 154: Bayesian Methods for Data Analysis - ISU Public Homepage Server

# "Histogram of the draws of corr(bismuth,copper)"hist(sample.rho[,4],nclass=30,axes = F,xlab="corr(bismuth,copper)",

xlim=c(0.5,.8))axis(1)# "Histogram of the draws of corr(silver,copper)"hist(sample.rho[,5],nclass=25,axes = F,xlab="corr(silver,copper)",

xlim=c(-.2,.3))axis(1)dev.off()# Summary statistics of the dsn. of correlation of copper with# antimonyc(mean(sample.rho[,1]),sqrt(var(sample.rho[,1])))quantile(sample.rho[,1],probs=c(0.025,0.05,0.5,0.95,0.975))# arsenicc(mean(sample.rho[,3]),sqrt(var(sample.rho[,3])))quantile(sample.rho[,3],probs=c(0.025,0.05,0.5,0.95,0.975))# bismuthc(mean(sample.rho[,4]),sqrt(var(sample.rho[,4])))quantile(sample.rho[,4],probs=c(0.025,0.05,0.5,0.95,0.975))# silverc(mean(sample.rho[,5]),sqrt(var(sample.rho[,5])))quantile(sample.rho[,5],probs=c(0.025,0.05,0.5,0.95,0.975))## Means#mulabels=c("Antimony","Copper","Arsenic","Bismuth","Silver")postscript("cascade_means.eps",height=8,width=8)pairs(sample.mu,labels=mulabels)dev.off()

ENAR - March 2006 18

Page 155: Bayesian Methods for Data Analysis - ISU Public Homepage Server

# Summary statistics of the dsn. of mean of# antimonyc(mean(sample.mu[,1]),sqrt(var(sample.mu[,1])))quantile(sample.mu[,1],probs=c(0.025,0.05,0.5,0.95,0.975))# copperc(mean(sample.mu[,2]),sqrt(var(sample.mu[,2])))quantile(sample.mu[,2],probs=c(0.025,0.05,0.5,0.95,0.975))# arsenicc(mean(sample.mu[,3]),sqrt(var(sample.mu[,3])))quantile(sample.mu[,3],probs=c(0.025,0.05,0.5,0.95,0.975))# bismuthc(mean(sample.mu[,4]),sqrt(var(sample.mu[,4])))quantile(sample.mu[,4],probs=c(0.025,0.05,0.5,0.95,0.975))# silverc(mean(sample.mu[,5]),sqrt(var(sample.mu[,5])))quantile(sample.mu[,5],probs=c(0.025,0.05,0.5,0.95,0.975))

q()

ENAR - March 2006 19

Page 156: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Advanced Computation

• Approximations based on posterior modes

• Simulation from posterior distributions

• Markov chain simulation

• Why do we need advanced computational methods?

• Except for simpler cases, computation is not possible with availablemethods:

– Logistic regression with random effects– Normal-normal model with unknown sampling variances σ2

j

– Poisson-lognormal hierarchical model for counts.

ENAR - March 26, 2006 1

Page 157: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Strategy for computation

• If possible, work in the log-posterior scale.

• Factor posterior distribution: p(γ, φ|y) = p(γ|φ, y)p(φ|y)

– Reduces to lower-dimensional problem– May be able to sample on a grid– Helps to identify parameters most influenced by prior

• Re-parametrizing sometimes helps:

– Create parameters with easier interpretation– Permit normal approximation (e.g., log of variance or log of Poisson

rate or logit of probability)

ENAR - March 26, 2006 2

Page 158: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Normal approximation to the posterior

• It is often reasonable to approximate a complicated posteriordistribution using a normal (or a mixture of normals) approximation.

• Approximation may be a good starting point for more sophisticatedmethods.

• Computational strategy:

1. Find joint posterior mode or mode of marginal posterior distributions(better strategy if possible)

2. Fit normal approximations at the mode (or use mixture of normals ifposterior is multimodal)

• Notation:

– p(θ|y): joint posterior of interest (target distribution)

ENAR - March 26, 2006 3

Page 159: Bayesian Methods for Data Analysis - ISU Public Homepage Server

– q(θ|y): un-normalized density, typically p(θ)p(y|θ).– θ = (γ, φ), φ typically lower dimensional than γ.

• Practical advice: computations are often easier with log posteriors thanwith posteriors.

ENAR - March 26, 2006 4

Page 160: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Finding posterior modes

• To find the mode of a density, we maximize the function with respectto the parameter(s). Optimization problem.

• Modes are not interesting per se, but provide the first step foranalytically approximating a density.

• Modes are easier to find than means: no integration, and can workwith un-normalized density

• Multi-modal posteriors pose a problem: only way to find multiple modesis to run mode-finding algorithms several times, starting from differentlocations in parameter space

• Whenever possible, shoot for finding the mode of marginal and

ENAR - March 26, 2006 5

Page 161: Bayesian Methods for Data Analysis - ISU Public Homepage Server

conditional posteriors. If θ = (γ, φ), and

p(θ, y) = p(γ|φ, y)p(φ|y)

then

1. Find mode φ2. Then find γ by maximizing p(γ|φ = φ, y)

• Many algorithms to find modes. Most popular include Newton-Raphsonand Expectation-Maximization (EM).

ENAR - March 26, 2006 6

Page 162: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Taylor approximation to p(θ|y)

• Second order expansion of log p(θ|y) around the mode θ is

log p(θ|y) ≈ log p(θ|y) +12(θ − θ)′

[d2

dθ2log p(θ|y)

]θ=θ

(θ − θ).

• Linear term in expansion vanishes because log posterior has a zeroderivative at the mode.

• Considering log p(θ|y) approximation as a function of θ, we see that:

1. log p(θ|y) is constant and

log p(θ|y) ≈ 12(θ − θ)′

[d2

dθ2log p(θ|y)

]θ=θ

(θ − θ)

is proportional to a log-normal density

ENAR - March 26, 2006 7

Page 163: Bayesian Methods for Data Analysis - ISU Public Homepage Server

• Then, for large n and θ in interior of parameter space:

p(θ|y) ≈ N(θ, [I(θ)]−1)

with

I(θ) = − d2

dθ2log p(θ|y),

the observed information matrix.

ENAR - March 26, 2006 8

Page 164: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Normal and normal-mixture approximation

• For one mode, we saw that

pn−approx(θ|y) = N(θ, Vθ)

withVθ = [−L′′(θ)]−1 or [I(θ)]−1

• L′′(θ) is called the curvature of the log posterior at the mode.

• Suppose now that p(θ|y) has K modes.

• Approximation to p(θ|y) is now a mixture of normals:

pn−approx(θ|y) ∝∑

k

ωk N(θk, Vθk)

ENAR - March 26, 2006 9

Page 165: Bayesian Methods for Data Analysis - ISU Public Homepage Server

• The ωk reflect the relative mass associated to each mode

• It can be shown that ωk = q(θk|y)|Vθk|1/2.

• The mixture-normal approximation is then

pn−approx(θ|y) ∝∑

k

q(θk|y) exp{−12(θ − θk)′V −1

θk(θ − θk)}.

• A more robust approximation can be obtained by using a t−kernelinstead of the normal kernel:

pt−approx(θ|y) ∝∑

k

q(θk|y)[ν + (θ − θk)′V −1θk

(θ − θk)]−(d+ν)/2,

with d the dimension of θ and ν relatively low. For most problems,ν = 4 works well.

ENAR - March 26, 2006 10

Page 166: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Sampling from a mixture approximation

• To sample from the normal-mixture approximation:

1. First choose one of the K normal components, using relativeprobability masses ωk as multinomial probabilities.

2. Given a component, draw a value θ from either the normal or the tdensity.

• Reasons not to use normal approximation:

– When mode of parameter is near edge of parameter space (e.g., τ inSAT example)

– When even transformation of parameter makes normal approximationcrazy.

– Can do better than normal approximation using more advancedmethods

ENAR - March 26, 2006 11

Page 167: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Simulation from posterior - Rejection sampling

• Idea is to draw values of p(θ|y), perhaps by making use of aninstrumental or auxiliary distribution g(θ|y) from which it is easier tosample.

• The target density p(θ|y) need only be known up to the normalizingconstant.

• Let q(θ|y) = p(θ)p(y; θ) be the un-normalized posterior distribution sothat

p(θ|y) =q(θ|y)∫

p(θ)p(y; θ)dθ.

• Simpler notation:

1. Target density: f(x)

ENAR - March 26, 2006 12

Page 168: Bayesian Methods for Data Analysis - ISU Public Homepage Server

2. Instrumental density: g(x)3. Constant M such that f(x) ≤ Mg(x)

• The following algorithm produces a variable Y that is distributedaccording to f :

1. Generate X ∼ g and U ∼ U[0,1]

2. Set Y = X if U ≤ f(X)/Mg(X)3. Reject draw otherwise.

• Proof: The distribution function of Y is given by:

P (Y ≤ y) = P

(X ≤ y|U ≤ f(X)

Mg(X)

)

=P

(X ≤ y, U ≤ f(X)

Mg(X)

)P

(U ≤ f(X)

Mg(X)

)ENAR - March 26, 2006 13

Page 169: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Then

P (Y ≤ y) =

∫ y

−∞∫ f(x)/Mg(x)

0dug(x)dx∫∞

−∞∫ f(x)/Mg(x)

0dug(x)dx

=M−1

∫ y

−∞ f(x)dx

M−1∫∞−∞ f(x)dx

.

• Since the last expression equals∫ y

−∞ f(x)dx, we have proven the result.

• In the Bayesian context, q(θ|y) (the un-normalized posterior) plays therole of f(x) above.

• When both f and g are normalized densities:

1. The probability of accepting a draw is 1/M , and expected numberof draws until one is accepted is M .

ENAR - March 26, 2006 14

Page 170: Bayesian Methods for Data Analysis - ISU Public Homepage Server

2. For each f , there will be many instrumental densities g1, g2, ....Choose the g that requires smallest bound M

3. M is necessarily larger than 1, and will approach minimum value 1when g closely imitates f .

• In general, g needs to have thicker tails than f for f/g to remainbounded for all x.

• Cannot use a normal g to generate values from a Cauchy f .

• Can do the opposite, however.

• Rejection sampling can be used within other algorithms, such as theGibbs sampler (see later).

ENAR - March 26, 2006 15

Page 171: Bayesian Methods for Data Analysis - ISU Public Homepage Server
Page 172: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Rejection sampling - Getting M

• Finding the bound M can be a problem, but consider the followingimplementation approach recalling that

q(θ|y) = p(θ)p(y; θ).

– Let g(θ) = p(θ) and draw a value θ∗ from the prior.– Draw U ∼ U(0, 1).– Let M = p(y; θ), where θ is the mode of p(y; θ).– Accept the draw θ∗ if

u ≤ q(θ|y)Mp(θ)

=p(θ)p(y; θ)

p(y; θ)p(θ)=

p(y; θ)

p(y; θ).

• Those θ from the prior that are likely under the likelihood are kept inthe posterior sample.

ENAR - March 26, 2006 16

Page 173: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Importance sampling

• Also known as SIR (Sampling Importance Resampling), the method isno more than a weighted bootstrap.

• Suppose we have a sample of draws (θ1, θ2, ..., θn) from the proposaldistribution g(θ). Can we ‘convert it’ to a sample from q(θ|y)?

• For each θi, compute

φi = q(θi|y)/g(θi)

wi = φi/∑

j

φj.

• Draw θ∗ from the discrete distribution over {θ1, ..., θn} with weight wi

on θi, without replacement.

ENAR - March 26, 2006 17

Page 174: Bayesian Methods for Data Analysis - ISU Public Homepage Server

• The sample of re-sampled θ’s is approximately a sample from q(θ|y).

• Proof: suppose that θ is univariate (for convenience). Then:

Pr(θ∗ ≤ a) =n∑

i=1

wi1−∞,a(θi)

=n−1

∑i φi1−∞,a(θi)

n−1∑

i φi

→Eg

q(θ|y)g(θ) 1−∞,a(θi)

Egq(θ|y)g(θ)

→∫∞−a

q(θ|y)dθ∫∞−∞ q(θ|y)dθ

→∫ a

−∞p(θ|y)dθ.

ENAR - March 26, 2006 18

Page 175: Bayesian Methods for Data Analysis - ISU Public Homepage Server

• The size of the re-sample can be as large as desired.

• The more g resembles q, the smaller the sample size that is needed forthe re-sample to approximate well the target distribution p(θ|y).

• A consistent estimator of the normalizing constant is

n−1∑

i

φi.

ENAR - March 26, 2006 19

Page 176: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Importance sampling in a different context

• Suppose that we wish to estimate E[h(θ)] for θ ∼ p(θ|y) (e.g., aposterior mean).

• If N values θi can be drawn from p(θ|y), can compute Monte Carloestimate:

E[h(θ)] =1N

∑h(θi).

• Sampling from p(θ|y) may be difficult. But note:

E[h(θ)|y] =∫

h(θ)p(θ|y)g(θ)

g(θ)dθ.

ENAR - March 26, 2006 20

Page 177: Bayesian Methods for Data Analysis - ISU Public Homepage Server

• Generate values from g(θ), and estimate

E[h(θ)] =1N

∑h(θi)

p(θi|y)g(θi)

• The w(θi) = p(θi|y)/g(θi) are the same importance weights fromearlier.

• Will not work if tails of g are short relative to tails of p.

ENAR - March 26, 2006 21

Page 178: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Importance sampling (cont’d)

• Importance sampling is most often used to improve normalapproximation to posterior.

• Example: suppose that you have a normal (or t) approximation to theposterior and use it to draw a sample that you hope approximates asample from p(θ|y).

• Importance sampling can be used to improve sample:

1. Obtain large sample of size L: (θ1, ..., θL) from the approximationg(θ).

2. Construct importance weights w(θl) = q(θl|y)/g(θl)3. Sample k < L values from (θ1, ..., θL) with probability proportional

to the weights, without replacement.

ENAR - March 26, 2006 22

Page 179: Bayesian Methods for Data Analysis - ISU Public Homepage Server

• Why without replacement? If weights are approximately constant,can re-sample with replacement. If some weights are large, re-samplewould favor values of θ with large weights repeteadly unless we samplewithout replacement.

• Difficult to determine whether importance sampling draws approximatedraws from posterior

• Diagnostic: monitor weights and look for outliers

• Note: draws from g(θ) can be reused for other p(θ|y)!

• Given draws (θ1, ..., θm) from g(θ), can investigate sensitivity to prior byre-computing importance weights. For posteriors p1(θ|y) and p2(θ|y),compute (for the same draws of θ)

wj(θi) =pj(θi|y)g(θi)

, j = 1, 2.

ENAR - March 26, 2006 23

Page 180: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Markov chain Monte Carlo

• Methods based on stochastic process theory, useful for approximatingtarget posterior distributions.

• Iterative: must decide when convergence has happened

• Quite general, can be implemented where other methods fail

• Can be used even in high dimensional examples

• Three methods: Metropolis-Hastings, Metropolis and Gibbs sampler.

• M and GS special cases of M-H.

ENAR - March 26, 2006 24

Page 181: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Markov chains

• A process Xt in discrete time t = 0, 1, 2, ..., T where

E(Xt|X0, X1, ..., Xt−1) = E(Xt|Xt−1)

is called a Markov chain.

• A Markov chain is irreducible if it is possible to reach all states fromany other state:

pn(j|i) > 0, pm(i|j) > 0, m, n > 0

where pn(j|i) denotes probability of getting to state j from state i inn steps.

ENAR - March 26, 2006 25

Page 182: Bayesian Methods for Data Analysis - ISU Public Homepage Server

• A Markov chain is periodic with period d if pn(i|i) = 0 unless n = kdfor some integer k. If d = 2, the chain returns to i in cycles of 2ksteps.

• If d = 1, the chain is aperiodic.

• Theorem: Finite-state, irreducible, aperiodic Markov chains have alimiting distribution: limn→∞ pn(j|i) = π, with pn(j|i) the probabilitythat we reach j from i after n steps or transitions.

ENAR - March 26, 2006 26

Page 183: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Properties of Markov chains (cont’d)

• Ergodicity: If a MC is irreducible and aperiodic and has stationarydistribution π then we have ergodicity:

an =1n

∑t

a(Xt)

→ E{a(X)} as n →∞.

• an is an ergodic average.

• Also, rate of convergence can be calculated and is geometric.

ENAR - March 26, 2006 27

Page 184: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Numerical standard error

• Sequence {X1, X2, ..., Xn} is not iid.

• The asymptotic standard error of an is approximately√σ2

a

n{1 + 2

∑i

ρi(a)}

where ρi is the ith lag autocorrelation in the function a{Xt}.

• First term σ2a/n is the usual sampling variance under iid sampling.

• Second term is a ’penalty’ for the fact that sample is not iid and isusually bigger than 1.

• Often, the standard error is computed assuming a finite number of lags.

ENAR - March 26, 2006 28

Page 185: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Markov chain simulation

• Idea: Suppose that sampling from p(θ|y) is hard, but that we cangenerate (somehow) a Markov chain {θ(t), t ∈ T} with stationarydistribution p(θ|y).

• Situation is different from the usual stochastic process case:

– Here we know the stationary distribution.– We seek an algorithm to transition from θ(t) to θ(t+1)) and that will

take us to the stationary distribution.

• Idea: start from some initial guess θ0 and let the chain run for n steps(n large), so that it reaches its stationary distribution.

• After convergence, all additional steps in the chain are draws from thestationary distribution p(θ|y).

ENAR - March 26, 2006 29

Page 186: Bayesian Methods for Data Analysis - ISU Public Homepage Server

• MCMC methods all based on the same idea; difference is just in howthe transitions in the MC are created.

• In MCMC simulation, we generate at least one MC for each parameterin the model. Often, more than one (independent) chain for eachparameter

ENAR - March 26, 2006 30

Page 187: Bayesian Methods for Data Analysis - ISU Public Homepage Server

The Gibbs Sampler

• An iterative algorithm that produces Markov chains with joint stationarydistribution p(θ|y) by cycling through all possible conditional posteriordistributions.

• Example: suppose that θ = (θ1, θ2, θ3), and that the target distributionis p(θ|y). Steps in the Gibbs sampler are:

1. Start with a guess (θ01, θ

02, θ

03)

2. Draw θ11 from p(θ1|θ2 = θ0

2, θ3 = θ03, y)

3. Draw θ12 from p(θ2|θ1 = θ1

1, θ3 = θ03, y)

4. Draw θ13 from p(θ3|θ1 = θ1

1, θ2 = θ12, y)

• Steps above complete one iteration of the GS

• Repeat the steps above n times, and after convergence (see later),draws (θn+1, θn+2, ...) are sample from stationary distribution p(θ|y).

ENAR - March 26, 2006 31

Page 188: Bayesian Methods for Data Analysis - ISU Public Homepage Server

The Gibbs Sampler (cont’d)

• Baby example: θ = (θ1, θ2) are bivariate normal with mean y = (y1, y2),variances σ2

1 = σ22 = 1 and covariance ρ. Then:

p(θ1|θ2, y) ∝ N(y1 + ρ(θ2 − y2), 1− ρ2)

andp(θ2|θ1, y) ∝ N(y2 + ρ(θ1 − y1), 1− ρ2)

• In this case, would not need GS, just for illustration

• See figure: y1 = 0, y2 = 0, and ρ = 0.8.

ENAR - March 26, 2006 32

Page 189: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Example: Gibbs sampling in the bivariatenormal

• Just as illustration, we consider the trivial problem of sampling fromthe posterior distribution of a bivariate normal mean vector.

• Suppose that we have a single observation vector (y1, y2) where(y1

y2

)∼ N

([θ1

θ2

],

[1 ρρ 1

]). (1)

• With a uniform prior on (θ1, θ2), the joint posterior distribution is(θ1

θ2|y

)∼ N

([y1

y2

],

[1 ρρ 1

]). (2)

KSU - April 2005 1

Page 190: Bayesian Methods for Data Analysis - ISU Public Homepage Server

• The conditional distributions are

θ1|θ2, y ∼ N(y1 + ρ(θ2 − y2), 1− ρ2)

θ2|θ1, y ∼ N(y2 + ρ(θ1 − y1), 1− ρ2)

• We generate four independent chains, starting from four differentcorners of the 2-dimensional parameter space.

KSU - April 2005 2

Page 191: Bayesian Methods for Data Analysis - ISU Public Homepage Server

After 50 iterations…..

Page 192: Bayesian Methods for Data Analysis - ISU Public Homepage Server

After 1000 iterations, and eliminating the path lines….

Page 193: Bayesian Methods for Data Analysis - ISU Public Homepage Server

The non-conjugate normal example

• Let y ∼ N(µ, σ2), with (µ, σ2) unknown.

• Consider the semi-conjugate prior:

µ ∼ N(µ0, τ20 )

σ2 ∼ Inv − χ2(ν0, σ20).

• Joint posterior distribution is

p(µ, σ2) ∝ (σ2)n2e

(− 12σ2

P(yi−µ)2)

e(− 1

2τ20

(µ−µ0)2)

(σ2)−(ν02 +1))e

−(ν0σ2

02σ2 )

.

ENAR - March 26, 2006 33

Page 194: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Non-conjugate example (cont’d)

• Derive full conditional distributions p(µ|σ2, y) and p(σ2|µ, y).

• For µ:

p(µ|σ2, y) ∝ exp{−12[1σ2

∑(yi − µ)2 +

1τ20

(µ− µ0)2]}

∝ exp{−12[(nτ2

0 + σ2)µ2 − 2(nτ20 y + σ2µ0)µ]}

∝ exp{−12nτ2

0 + σ2

σ2τ20

[µ2 − 2(

nτ20 y + σ2µ0

nτ20 + σ2

)µ]}

∝ N(µn, τ2n),

ENAR - March 26, 2006 34

Page 195: Bayesian Methods for Data Analysis - ISU Public Homepage Server

where

µn =nσ2 y + 1

τ20µ0

nσ2 + 1

τ20

and τ2n =

1nσ2 + 1

τ20

.

ENAR - March 26, 2006 35

Page 196: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Non-conjugate normal example (cont’d)

• The full conditional for σ2 is

p(σ2|µ, y) ∝ (σ2)−(n+ν0

2 +1) exp{−12[∑

(yi − µ)2 + ν0σ20]}

∝ Inv − χ2(νn, σ2n),

where

νn = ν0 + n

σ20 = nS2 + ν0σ

20,

andS = n−1

∑(yi − µ)2.

ENAR - March 26, 2006 36

Page 197: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Non-conjugate normal example (cont’d)

• For the non-conjugate normal model, Gibbs sampling consists offollowing steps:

1. Start with a guess for σ2, σ2(0).2. Draw µ(1) from a normal with mean and variance

(µn(σ2(0)), τ2n(σ2(0))), where

µn(σ2(0)) =n

σ2(0) y + 1τ20µ0

nσ2(0) + 1

τ20

,

and

τ2n(σ2(0)) =

1n

σ2(0) + 1τ20

.

ENAR - March 26, 2006 37

Page 198: Bayesian Methods for Data Analysis - ISU Public Homepage Server

3. Next draw σ2(1) from Inv − χ2(νn, σ2n(µ(1))), where

σ2n(µ(1)) = nS2(µ(1)) + ν0σ

20,

andS2(µ(1)) = n−1

∑(yi − µ(1))2.

4. Repeat many times.

ENAR - March 26, 2006 38

Page 199: Bayesian Methods for Data Analysis - ISU Public Homepage Server

A more interesting example

• Poisson counts, with a change point: consider a sample of size n ofcounts (y1, ..., yn) distributed as Poisson with some rate, and supposethat the rate changes, so that

For i = 1, ...,m , yi ∼ Poi(λ)

For i = m + 1, ..., n , yi ∼ Poi(φ)

and m is unknown.

• Priors on (λ, φ,m):

λ ∼ Ga(α, β)

φ ∼ Ga(γ, δ)

m ∼ U[1,n]

ENAR - March 26, 2006 39

Page 200: Bayesian Methods for Data Analysis - ISU Public Homepage Server

• Joint posterior distribution:

p(λ, φ,m|y) ∝m∏

i=1

e−λλyi

n∏i=m+1

e−φφyiλα−1e−λβφγ−1e−φδn−1

∝ λy∗1+α−1e−λ(m+β)φy∗2+γ−1e−φ(n−m+δ)

with y∗1 =∑m

i=1 yi and y∗2 =∑n

i=m+1 yi.

• Note: if we knew m, problem would be trivial to solve. Thus Gibbs,where we condition on all other parameters to determine Markov chains,appears to be the right approach.

• To implement Gibbs sampling, need full conditional distributions. Pickand choose pieces that depend on each parameter

ENAR - March 26, 2006 40

Page 201: Bayesian Methods for Data Analysis - ISU Public Homepage Server

• Conditional of λ:

p(λ|m,φ, y) ∝ λPm

i=1 yi+α−1 exp(−λ(m + β))

∝ Ga(α +m∑

i=1

yi,m + β)

• Conditional of φ:

p(φ|λ, m, y) ∝ φPn

i=m+1 yi+γ−1 exp(−φ(n−m + δ))

∝ Ga(γ +n∑

i=m+1

yi, n−m + δ)

• Conditional of m = 1, 2, ..., n:

p(m|λ, φ, y) = c−1q(m|λ, φ)

ENAR - March 26, 2006 41

Page 202: Bayesian Methods for Data Analysis - ISU Public Homepage Server

= c−1λy∗1+α−1 exp(−λ(m + β))φy∗2+γ−1 exp(−φ(n−m + δ)).

• Note that all terms in joint posterior depend on m, so conditional ofm is proportional to joint posterior.

• Distribution does not look like any standard form so cannot sampledirectly. Need to obtain normalizing constant, easy to do for relativelysmall n and for this discrete case:

c =n∑

k=1

q(k|λ, φ, y).

• To sample from p(m|λ, φ, y) can use inverse cdf on a grid, or othermethods.

ENAR - March 26, 2006 42

Page 203: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Non-standard distributions

• It may happen that one or more of the full conditionals is not a standarddistribution

• What to do then?

– Try direct simulation: grid approximation, rejection sampling– Try approximation: normal or t approximation, need mode at each

iteration (see later)– Try more general Markov chain algorithms: Metropolis or Metropolis-

Hastings.

ENAR - March 26, 2006 43

Page 204: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Metropolis-Hastings algorithm

• More flexible transition kernel: rather than requiring sampling fromconditional distributions, M-H permits using many other “proposal”densities

• Idea: instead of drawing sequentially from conditionals as in Gibbs,M-H “jumps” around the parameter space

• The algorithm is the following:

1. Given a draw θt in iteration t, sample a candidate draw θ∗ from aproposal distribution J(θ∗|θ)

2. Accept the draw with probability

r =p(θ∗|y)/J(θ∗|θ)p(θ|y)/J(θ|θ∗)

.

ENAR - March 26, 2006 44

Page 205: Bayesian Methods for Data Analysis - ISU Public Homepage Server

3. Stay in place (do not accept the draw) with probability 1 − r, i.e.,θ(t+1) = θ(t).

• Remarkably, the proposal distribution (text calls it jump distribution)can have just about any form.

• When proposal distribution is symmetric, i.e.

J(θ∗|θ) = J(θ|θ∗),

Metropolis-Hastings acceptance probability is

r =p(θ∗|y)p(θ|y)

.

This is the Metropolis algorithm

ENAR - March 26, 2006 45

Page 206: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Proposal distributions

• Convergence does not depend on J , but rate of convergence does.

• Optimal J is p(θ|y) in which case r = 1.

• Else, how do we choose J?

1. It is easy to get samples from J2. It is easy to compute r3. It leads to rapid convergence and mixing: jumps should be large

enough to take us everywhere in the parameter space but not toolarge so that draw is accepted (see figure from Gilks et al. (1995)).

• Three main approaches: random walk M-H (most popular),independence sampler, and approximation M-H

ENAR - March 26, 2006 46

Page 207: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Independence sampler

• Proposal distribution Jt does not depend on θt.

• Just find a distribution g(θ) and generate values from it

• Can work very well if g(.) is a good approximation to p(θ|y) and g hasheavier tails than p.

• Can work awfully bad otherwise

• Acceptance probability is

r =p(θ∗|y)/g(θ∗)p(θ|y)/g(θ)

ENAR - March 26, 2006 47

Page 208: Bayesian Methods for Data Analysis - ISU Public Homepage Server

• With large samples (central limit theorem operating) proposal could benormal distribution centered at mode of p(θ|y) and with variance largerthan inverse of Fisher information evaluated at mode.

ENAR - March 26, 2006 48

Page 209: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Random walk M-H

• Most popular, easy to use.

• Idea: generate candidate using random walk.

• Proposal is often normal centered at current draw:

J(θ|θ(t−1)) = N(θ|θ(t−1), V )

• Symmetric: think of drawing values θ∗− θ(t−1) from a N(0, V ). Thus,r simplifies.

• Difficult to choose V :

1. V too small: takes long to explore parameter space

ENAR - March 26, 2006 49

Page 210: Bayesian Methods for Data Analysis - ISU Public Homepage Server

2. V too large: jumps to extremes are less likely to be accepted. Stayin the same place too long

3. Ideal V : posterior variance. Unknown, so might do some trial anderror runs

• Optimal acceptance rate (from some theory results) are between 25%and 50% for this type of proposal distribution. Gets lower with higherdimensional problem.

ENAR - March 26, 2006 50

Page 211: Bayesian Methods for Data Analysis - ISU Public Homepage Server
Page 212: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Approximation M-H

• Idea is to improve approximation of J to p(θ|y) as we know more aboutθ.

• E.g., in random walk M-H, can perhaps increase acceptance rate byconsidering

J(θ|θ(t−1)) = N(θ|θ(t−1), Vθ(t−1))

variance also depends on current draw

• Proposals here typically not symmetric, requires full r expression.

ENAR - March 26, 2006 51

Page 213: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Starting values

• If chain is irreducible, choice of θ0 will not affect convergence.

• With multiple chains (see later), can choose over-dispersed startingvalues for each chain. Possible algorithm:

1. Find posterior modes2. Create over-dispersed approximation to posterior at mode (e.g., t4)3. Sample values from that distribution

• Not much research on this topic.

ENAR - March 26, 2006 52

Page 214: Bayesian Methods for Data Analysis - ISU Public Homepage Server

How many chains?

• Even after convergence, draws from stationary distribution arecorrelated. Sample is not i.i.d.

• An iid sample of size n can be obtained by keeping only last draw fromeach of n independent chains. Too inefficient.

• Compromise: If autocorrelation at lag k and larger is negligible, thencan generate an almost i.i.d. sample by keeping only every kth draw inthe chain after convergence. Need a very long chain.

• To check convergence (see later) multiple chains can be generatedindependently (in parallel).

• Burn-in: iterations required to reach stationary distribution(approximately). Not used for inference.

ENAR - March 26, 2006 53

Page 215: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Convergence

• Impossible to decide whether chain has converged, can only monitorbehavior.

• Easiest approach: graphical displays (trace plots in WinBUGS).

• A bit more formal (and most popular in terms of use):√

(R) of Gelmanand Rubin (1992).

• To use the G-R diagnostic, must generate multiple independent chainsfor each parameter.

• The G-R diagnostic compares the within-chain sd to the between-chainsd.

ENAR - March 26, 2006 54

Page 216: Bayesian Methods for Data Analysis - ISU Public Homepage Server

• Before convergence, the within-chain sd is an underestimate of thesd(θ|y), and between-chain sd is overestimate.

• If chains converge, all draws are from stationary distribution, so withinand between chain variances should be similar.

• Consider scalar parameter η and let ηij be jth draw in ith chain, withi = 1, ..., n and j = 1, ..., J

• Between-chain variance:

B =n

J − 1

∑(ηj − η..)2, ηj =

1n

∑ηij, η.. =

1J

ηj

• Within-chain variance:

W =1

J(n− 1)

∑(ηij − ηj)2.

ENAR - March 26, 2006 55

Page 217: Bayesian Methods for Data Analysis - ISU Public Homepage Server

• An unbiased estimate of var(η|y) is weighted average:

ˆvar(η|y) =n− 1

nW +

1nB

• Early in iterations, ˆvar(η|y) overestimates true posterior variance

• Diagnostic measures potential reduction in the scale if iterations arecontinued: √

(R) =√

(ˆvar(η|y)

W)

which goes to 1 as n →∞.

ENAR - March 26, 2006 56

Page 218: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Hierarchical models - Introduction

• Consider the following study:

– J counties are sampled from the population of 99 counties in thestate of Iowa.

– Within each county, nj farms are selected to participate in a fertilizertrial

– Outcome yij corresponds to ith farm in jth county, and is modeledas p(yij|θj).

– We are interested in estimating the θj.

• Two obvious approaches:

1. Conduct J separate analyses and obtain p(θj|yj): all θ’s areindependent

2. Get one posterior p(θ|y1, y2, ..., yJ): point estimate is same for allθ’s.

ENAR - March 26, 2006 1

Page 219: Bayesian Methods for Data Analysis - ISU Public Homepage Server

• Neither approach appealing.

• Hierarchical approach: view the θ’s as a sample from a commonpopulation distribution indexed by a parameter φ.

• Observed data yij can be used to draw inferences about θ’s even thoughthe θj are not observed.

• A simple hierarchical model:

p(y, θ, φ) = p(y|θ)p(θ|φ)p(φ).

p(y|θ) = Usual sampling distribution

p(θ|φ) = Population dist. for θ or prior

p(φ) = Hyperprior

ENAR - March 26, 2006 2

Page 220: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Hierarchical models - Rat tumor example

• Imagine a single toxicity experiment performed on rats.

• θ is probability that a rat receiving no treatment develops a tumor.

• Data: n = 14, and y = 4 develop a tumor.

• From the sample: tumor rate is 4/14 = 0.286.

• Simple analysis:

y|θ ∼ Bin(n, θ)

θ|α, β ∼ Beta(α, β)

ENAR - March 26, 2006 3

Page 221: Bayesian Methods for Data Analysis - ISU Public Homepage Server

• Posterior for θ is

p(θ|y) = Beta(α + 4, β + 10)

• Where can we get good “guesses” for α and β?

• One possibility: from the literature, look at data from similarexperiments. In this example, information from 70 other experimentsis available.

• In jth study, yj is the number of rats with tumors and nj is the samplesize, j = 1, .., 70.

• See Table 5.1 and Figure 5.1 in Gelman et al..

• Model the yj as independent binomial data given the study-specific θj

and sample size nj

ENAR - March 26, 2006 4

Page 222: Bayesian Methods for Data Analysis - ISU Public Homepage Server

From Gelman et al.: hierarchical representation of tumor experiments

Page 223: Bayesian Methods for Data Analysis - ISU Public Homepage Server

• Representation of hierarchical model in Figure 5.1 assumes that theBeta(α, β) is a good population distribution for the θj.

• Empirical Bayes as a first step to estimating mean tumor rate inexperiment number 71:

1. Get point estimates of α, β from earlier 70 experiments as follows:2. Mean tumor rate is r = (70)−1

∑70j=1 yj/nj and standard deviation is

[(69)−1∑

j(yj/nj− r)2]1/2. Values are 0.136 and 0.103 respectively.

• Using method of moments:

α

α + β= 0.136, −→ α + β = α/0.136

αβ

(α + β)2(α + β + 1)= 0.1032

• Resulting estimate for (α, β) is (1.4, 8.6).

ENAR - March 26, 2006 5

Page 224: Bayesian Methods for Data Analysis - ISU Public Homepage Server

• Then posterior isp(θ|y) = Beta(5.4, 18.6)

with posterior mean 0.223, lower than the sample mean, and posteriorstandard deviation 0.083.

• Posterior point estimate is lower than crude estimate; indicates that incurrent experiment, number of tumors was unusually high.

• Why not go back now and use the prior to obtain better estimates fortumor rates in earlier 70 experiments?

• Should not do that because:

1. Can’t use the data twice: we use the historical data to get theprior, and cannot now combine that prior with the data from theexperiments for inference

2. Using a point estimate for (α, β) suggests that there is no uncertaintyabout (α, β): not true

ENAR - March 26, 2006 6

Page 225: Bayesian Methods for Data Analysis - ISU Public Homepage Server

3. If the prior Beta(α, β) is the appropriate prior, shouldn’t we knowthe values of the parameters prior to observing any data?

• Approach: place a hyperprior on the tumor rates θj with parameters(α, β).

• Can still use all the data to estimate the hyperparameters.

• Idea: Bayesian analysis on the joint distribution of all parameters(θ1, ..., θ71, α, β|y)

ENAR - March 26, 2006 7

Page 226: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Hierarchical models - Exchangeability

• J experiments. In experiment j, yj ∼ p(yj|θj).

• To create joint probability model, use idea of exchangeability

• Recall: a set of random variables (z1, ..., zK) are exchangeable if theirjoint distribution p(z1, ..., zK) is invariant to permutations of theirlabels.

• In our set-up, we have two opportunities for assuming exchangeability:

1. At the level of data: conditional on θj, the yij’s are exchangeable ifwe cannot “distinguish” between them

2. At the level of the parameters: unless something (other than thedata) distinguishes the θ’s, we assume that they are exchangeable aswell.

ENAR - March 26, 2006 8

Page 227: Bayesian Methods for Data Analysis - ISU Public Homepage Server

• In rat tumor example, we have no information to distinguish betweenthe 71 experiments, except sample size. Since nj is presumably notrelated to θj, we choose an exchangeable model for the θj.

• Simplest exchangeable distribution for the θj:

p(θ|φ) = ΠJj=1p(θj|φ).

• The hyperparameter φ is typically unknown, so marginal for θ must beobtained by averaging:

p(θ) =∫

ΠJj=1p(θj|φ)p(φ)dφ.

• Exchangeable distribution for (θ1, ..., θJ) is written in the form of amixture distribution

ENAR - March 26, 2006 9

Page 228: Bayesian Methods for Data Analysis - ISU Public Homepage Server

• Mixture model characterizes parameters (θ1, ..., θJ) as independentdraws from a superpopulation that is determined by unknownparameters φ

• Exchangeability does not hold when we have additional information oncovariates xj to distinguish between the θj.

• Can still model exchangeability with covariates:

p(θ1, ..., θJ |x1, ..., xJ) =∫

ΠJj=1p(θj|xj, φ)p(φ|x)dφ,

with x = (x1, ..., xJ). In Iowa example, we might know that differentcounties have very different soil quality.

• Exchangeability does not imply that θj are all the same; just thatthey can be assumed to be draws from some common superpopulationdistribution.

ENAR - March 26, 2006 10

Page 229: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Hierarchical models - Bayesian treatment

• Since φ is unknown, posterior is now p(θ, φ|y).

• Model formulation:

p(φ, θ) = p(φ)p(θ|φ) = joint prior

p(φ, θ|y) ∝ p(φ, θ)p(y|φ, θ) = p(φ, θ)p(y|θ)

• The hyperparameter φ gets its own prior distribution.

• Important to check whether posterior is proper when using improperpriors in hierarchical models!

ENAR - March 26, 2006 11

Page 230: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Posterior predictive distributions

• There are two posterior predictive distributions potentially of interest:

1. Distribution of future observations y corresponding to an existing θj.Draw y from the posterior predictive given existing draws for θj.

2. Distribution of new observations y corresponding to new θj’s drawnfrom the same superpopulation. First draw φ from its posterior, thendraw θ for a new experiment, and then draw y from the posteriorpredictive given the simulated θ.

• In rat tumor example:

1. More rats from experiment #71, for example2. Experiment #72 and then rats from experiment #72

ENAR - March 26, 2006 12

Page 231: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Hierarchical models - Computation

• Harder than before because we have more parameters

• Easiest when population distribution p(θ|φ) is conjugate to thelikelihood p(θ|y).

• In non-conjugate models, must use more advanced computation.

• Usual steps:

1. Write p(φ, θ|y) ∝ p(φ)p(θ|φ)p(y|θ)2. Analytically determine p(θ|y, φ). Easy for conjugate models.3. Derive the marginal p(φ|y) by

p(φ|y) =∫

p(φ, θ|y)dθ,

ENAR - March 26, 2006 13

Page 232: Bayesian Methods for Data Analysis - ISU Public Homepage Server

or, if convenient, by p(φ|y) = p(φ,θ|y)p(θ|y,φ).

• Normalizing “constant” in denominator may depend on φ as well as y.

• Steps for computation in hierarchical models are:

1. Draw vector φ from p(φ|y). If φ is low-dimensional, can use inversecdf method as before. Else, need more advanced methods

2. Draw θ from conditional p(θ|φ, y). If the θj are conditionallyindependent, p(θ|φ, y) factors as

p(θ|φ, y) = Πjp(θj|φ, y)

so components of θ can be drawn one at a time.3. Draw y from appropriate posterior predictive distribution.

ENAR - March 26, 2006 14

Page 233: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Rat tumor example

• Sampling distribution for data from experiments j = 1, ..., 71:

yj ∼ Bin(nj, θj)

• Tumor rates θj assumed to be independent draws from Beta:

θj ∼ Beta(α, β)

• Choose a non-informative prior for (α, β) to indicate prior ignorance.

• Since hyperprior will be non-informative, and perhaps improper, mustcheck integrability of posterior.

ENAR - March 26, 2006 15

Page 234: Bayesian Methods for Data Analysis - ISU Public Homepage Server

• Defer choice of p(α, β) until a bit later.

• Joint posterior distribution:

p(θ, α, β|y) ∝ p(α, β)p(θ|α, β)p(y|θ, α, β)

∝ p(α, β) ΠjΓ(α + β)Γ(α)Γ(β)

θα−1j (1− θj)β−1Πjθ

yj

j (1− θj)nj−yj.

• Conditional of θ: notice that given (α, β), the θj are independent, withBeta distributions:

p(θj|α, β, y) =Γ(α + β + nj)

Γ(α + yj)Γ(β + nj − yj)θ

α+yj−1

j (1− θj)β+nj−yj−1.

ENAR - March 26, 2006 16

Page 235: Bayesian Methods for Data Analysis - ISU Public Homepage Server

• Marginal posterior distribution of hyperparameters is obtained using

p(α, β|y) =p(α, β, θ|y)p(θ|y, α, β)

• Substituting into expression above, we get

p(α, β|y) ∝ p(α, β)ΠjΓ(α + β)Γ(α)Γ(β)

Γ(α + yj)Γ(β + nj − yj)Γ(α + β + nj)

• Not a standard distribution, but easy to evaluate and only in twodimensions.

• Can evaluate p(α, β|y) over a grid of values of (α, β), and then useinverse cdf method to draw α from marginal and β from conditional.

ENAR - March 26, 2006 17

Page 236: Bayesian Methods for Data Analysis - ISU Public Homepage Server

• But what is p(α, β)? As it turns out, many of the ’obvious’ choices forthe prior lead to improper posterior. (See solution to Exercise 5.7, onthe course web site.)

• Some obvious choices such as p(α, β|y) ∝ 1 lead to non-integrableposterior.

• Integrability can be checked analytically by evaluating the behavior ofthe posterior as α, β (or functions) go to ∞.

• An empirical assessment can be made by looking at the contour plotof p(α, β|y) over a grid. Significant mass extending towards infinitysuggests that the posterior will not integrate to a constant.

• Other choices leading to non-integrability of p(α, β|y) include:

ENAR - March 26, 2006 18

Page 237: Bayesian Methods for Data Analysis - ISU Public Homepage Server

– A flat prior on prior guess and “degrees of freedom”

p(α

α + β, α + β) ∝ 1.

– A flat prior on the log(mean) and log(degrees of freedom)

p(log(α/β), log(α + β)) ∝ 1.

• A reasonable choice for the prior is a flat distribution on the prior meanand square root of inverse of the degrees of freedom:

p(α

α + β, (α + β)−1/2) ∝ 1.

• This is equivalent to

p(α, β) ∝ (α + β)−5/2.

ENAR - March 26, 2006 19

Page 238: Bayesian Methods for Data Analysis - ISU Public Homepage Server

• Also equivalent to

p(log(α/β), log(α + β) ∝ αβ(α + β)−5/2.

• We use the parameterization (log(α/β), log(α + β)).

• Idea for drawing values of α and β from their posterior is the usual:evaluate p(log(α/β), log(α + β)|y) over grid of values of (α, β), andthen use inverse cdf method.

• For this problem, easier to evaluate the log posterior and thenexponentiate.

• Grid: From earlier estimates, potential centers for the grid are α =1.4, β = 8.6. In the new parameterization, this translates into u =log(α/β) = −1.8, v = log(α + β) = 2.3.

ENAR - March 26, 2006 20

Page 239: Bayesian Methods for Data Analysis - ISU Public Homepage Server

• Grid too narrow leaves mass outside. Try [−2.3,−1.3]× [1, 5].

• Steps for computation:

1. Given draws (u, v), transform back to (α, β).2. Finally, sample θj from Beta(α + yj, β + nj − yj)

ENAR - March 26, 2006 21

Page 240: Bayesian Methods for Data Analysis - ISU Public Homepage Server

From Gelman et al.: Contours of joint posterior distribution of alpha and beta in reparametrized scale

Page 241: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Posterior means and 95% credible sets for θi

Page 242: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Normal hierarchical model

• What we know as one-way random effects models

• Set-up: J independent experiments, in each want to estimate a meanθj.

• Sampling model:yij|θj, σ

2 ∼ N(θj, σ2)

for i = 1, ..., nj, and j = 1, ..., J .

• Assume for now that σ2 is known.

• If y.j = n−1j

∑i yij, then

y.j|θj ∼ N(θj, σ2j )

ENAR - March 26, 2006 22

Page 243: Bayesian Methods for Data Analysis - ISU Public Homepage Server

with σ2j = σ2/nj.

• Sampling model for y.j is quite general. For nj large, y.j is normaleven if yij is not.

• Need now to think of priors for the θj.

• What type of posterior estimates for θj might be reasonable?

1. θj = y.j: reasonable if nj large

2. θj = y.. = (∑

j σ−2)−1∑

j σ−2j y.j: pooled estimate, reasonable if

we believe that all means are the same.

• To decide between those two choices, do an F-test for differencesbetween groups.

• ANOVA approach (for nj = n):

ENAR - March 26, 2006 23

Page 244: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Source df SS MS E(MS)Between groups J-1 SSB SSB / (J-1) nτ2 + σ2

Within groups J(n-1) SSE SSE / J(n-1) σ2

where τ2, the variance of the group means can be estimated as

τ2 =MSB −MSE

n

• If MSB >>> MSE then τ2 > 0 and F-statistic is significant: do notpool and use θj = y.j.

• Else, F-test cannot reject H0 : τ2 = 0, and must pool.

• Alternative: why not a more general estimator:

θj = λjy.j + (1− λj)y..

ENAR - March 26, 2006 24

Page 245: Bayesian Methods for Data Analysis - ISU Public Homepage Server

for λj ∈ [0, 1]. Factor (1− λj) is a shrinkage factor.

• All three estimates have a Bayesian justification:

1. θj = y.j is posterior mean of θj if sample means are normal andp(θj) ∝ 1

2. θj = y.. is posterior mean if θ1 = ... = θJ and p(θ) ∝ 1.3. θj = λjy.j + (1 − λj)y.. is posterior mean if θj ∼ N(µ, τ2)

independent of other θ’s and sampling distribution for y.j is normal

• Latter is called the Normal-Normal model.

ENAR - March 26, 2006 25

Page 246: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Normal-normal model

• Set-up (for σ2 known):

y.j|θj ∼ N(θj, σ2j )

θ1, ..., θJ |µ, τ ∼∏j

N(µ, τ2)

p(µ, τ2)

• It follows that

p(θ1, ..., θJ) =∫ ∏

j

N(θj;µ, τ2)p(µ, τ2)dµdτ2

• The joint prior can be written as p(µ, τ) = p(µ|τ)p(τ). For µ, we will

ENAR - March 26, 2006 26

Page 247: Bayesian Methods for Data Analysis - ISU Public Homepage Server

consider a contidional flat prior, so that

p(µ, τ) ∝ p(τ)

• Joint posterior

p(θ, µ, τ |y) ∝ p(µ, τ)p(θ|µ, τ)p(y|θ)

∝ p(µ, τ)∏j

N(θj;µ, τ)∏j

N(y.j; θj, σ2j )

Conditional distribution of group means

• For p(µ|τ) ∝ 1, the conditional distributions of the θj given µ, τ, y.j

are independent, and

p(θj|µ, τ, y.j) = N(θj, Vj)

ENAR - March 26, 2006 27

Page 248: Bayesian Methods for Data Analysis - ISU Public Homepage Server

with

θj =σ2

jµ + τ2y.j

σ2j + τ2

, V −1j =

1σ2

j

+1τ2

Marginal distribution of hyperparameters

• To get p(µ, τ |y) we would typically either integrate p(µ, τ, θ|y) withrespect to θ1, ..., θJ , or would use the algebraic approach

p(µ, τ |y) =p(µ, τ, θ|y)p(θ|µ, τ, y)

• In the normal-normal model, data provide information about µ, τ , asshown below.

• Consider the marginal posterior:

p(µ, τ |y) ∝ p(µ, τ)p(y|µ, τ)

ENAR - March 26, 2006 28

Page 249: Bayesian Methods for Data Analysis - ISU Public Homepage Server

where p(y|µ, τ) is the marginal likelihood:

p(y|µ, τ) =∫

p(y|θ, µ, τ)p(θ|µ, τ)dθ

=∫ ∏

j

N(y.j; θj, σ2j )

∏j

N(θj;µ, τ)dθ1, ..., dθJ

• Integrand above is the product of quadratic functions in y.j and θj, sothey are jointly normal.

• Then the y.j|µ, τ are also normal, with mean and variance:

E(y.j|µ, τ) = E[E(y.j|θj, µ, τ)|µ, τ ]

= E(θj|µ, τ) = µ

var(y.j|µ, τ) = E[var(y.j|θj, µ, τ)|µ, τ ]

ENAR - March 26, 2006 29

Page 250: Bayesian Methods for Data Analysis - ISU Public Homepage Server

+var[E(y.j|θj, µ, τ)|µ, τ ]

= E(σ2j |µ, τ) + var(θj|µ, τ)

= σ2j + τ2

• Thereforep(µ, τ |y) ∝ p(τ)

∏j

N(y.j;µ, σ2j + τ2)

• We know that

p(µ, τ |y) ∝ p(τ)p(µ|τ)

×∏j

(σ2j + τ2)−1/2 exp

[− 1

2(σ2j + τ2)

(y.j − µ)2]

• Now fix τ , and think of p(y.j|µ) as the “likelihood” in a problem in

ENAR - March 26, 2006 30

Page 251: Bayesian Methods for Data Analysis - ISU Public Homepage Server

which µ is the only parameter.

• Recall that p(µ|τ) ∝ 1.

• From earlier, we know that p(µ|y, τ) = N(µ, Vµ) where

µ =

∑j y.j/(σ2

j + τ2)∑j 1/(σ2

j + τ2), Vµ =

∑j

1σ2

j + τ2

• To get p(τ |y) we use the old trick:

p(τ |y) =p(µ, τ |y)p(µ|τ, y)

∝p(τ)

∏j N(y.j;µ, σ2

j + τ2)N(µ; µ, Vµ)

ENAR - March 26, 2006 31

Page 252: Bayesian Methods for Data Analysis - ISU Public Homepage Server

• Expression holds for any µ, so set µ = µ. Denominator is then just

V−1/2µ .

• Thenp(τ |y) ∝ p(τ)V 1/2

µ

∏j

N(y.j; µ, σ2j + τ2).

• What do we use for p(τ)?

• Safe choice is always a proper prior. For example, consider

p(τ) = Inv− χ2(ν0, τ20 ),

where

– τ20 can be “best guess” for τ

– ν0 can be small to reflect prior uncertainty.

ENAR - March 26, 2006 32

Page 253: Bayesian Methods for Data Analysis - ISU Public Homepage Server

• Non-informative (and improper) choice:

– Beware. Improper prior in hierarchical model can easily lead tonon-integrable posterior.

– In normal-normal model, “natural” non-informative p(τ) ∝ τ−1 orequivalently p(log(τ)) ∝ 1 results in improper posterior.

– p(τ) ∝ 1 leads to proper posterior p(τ |y).

ENAR - March 26, 2006 33

Page 254: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Normal-normal model - Computation

1. Evaluate the one-dimensional p(τ |y) on a grid.

2. Sample τ from p(τ |y) using inverse cdf method.

3. Sample µ from N(µ, Vµ)

4. Sample θj from N(θj, Vj)

ENAR - March 26, 2006 34

Page 255: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Normal-normal model - Prediction

• Predicting future data y from the current experiments with meansθ = (θ1, ..., θJ):

1. Obtain draws of (τ, µ, θ1, ..., θJ)2. Draw y from N(θj, σ

2j )

• Predicting future data y from a future experiment with mean θ andsample size n:

1. Draw µ, τ from their posterior2. Draw θ from the population distribution p(θ|µ, τ) (also known as

the prior for θ)3. Draw y from N(θ, σ2), where σ2 = σ2/n

ENAR - March 26, 2006 35

Page 256: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Example: effect of diet on coagulation times

• Normal hierarchical model for coagulation times of 24 animalsrandomized to four diets. Model:

– Observations: yij with i = 1, ..., nj, j = 1, ..., J .– Given θ, coagulation times yij are exchangeable– Treatment means are normal N(µ, τ2), and variance σ2 is constant

across treatments– Prior for (µ, log σ, log τ) ∝ τ . A uniform prior on log τ leads to

improper posterior.

p(θ, µ, log σ, log τ |y) ∝ τΠj N(θj|µ, τ2)

ΠjΠi N(yij|θj, σ2)

• Crude initial estimates: θj = n−1j

∑yij = y.j, µ = J−1

∑y.j = y..,

ENAR - March 26, 2006 36

Page 257: Bayesian Methods for Data Analysis - ISU Public Homepage Server

σ2j = (nj − 1)−1

∑(yij − y.j)2, σ2 = J−1

∑σ2

j , and τ2 = (J −1)−1

∑(y.j − y..)2.

Data

The following table contains data that represents coagulation timein seconds for blood drawn from 24 animals randomly allocated to fourdifferent diets. Different treatments have different numbers of observationsbecause the randomization was unrestricted.

Diet MeasurementsA 62,60,63,59

B 63,67,71,64,65,66

C 68,66,71,67,68,68

D 56,62,60,61,63,64,63,59

ENAR - March 26, 2006 37

Page 258: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Conditional maximization for joint mode

• Conjugacy makes conditional maximization easy.

• Conditional modes of treatment means are

θj|µ, σ, τ, y ∼ N(θj, Vj)

with

θj =µ/τ2 + njy.j/σ2

1/τ2 + nj/σ2,

and V −1j = 1/τ2 + nj/σ2.

• For j = 1, ..., J , maximize conditional posteriors by using θj in place ofcurrent estimate θj.

ENAR - March 26, 2006 38

Page 259: Bayesian Methods for Data Analysis - ISU Public Homepage Server

• Conditional mode of µ:

µ|θ, σ, τ, y ∼ N(µ, τ2/J), µ = J−1∑

θj.

• Conditional maximization: replace current estimate µ with µ.

ENAR - March 26, 2006 39

Page 260: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Conditional maximization

• Conditional mode of log σ: first derive conditional posterior for σ2:

σ2|θ, µ, τ, y ∼ Inv− χ2(n, σ2),

with σ2 = n−1∑∑

(yij − θj)2.

• The mode of the Inv − χ2 is σ2n/(n + 2). To get mode for log σ,use transformation. Term n/(n + 2) disappears with Jacobian, soconditional mode of log σ is log σ.

• Conditional mode of log τ : same reasoning. Note that

τ2|θ, µ, σ2, y ∼ Inv− χ2(J − 1, τ2),

with τ2 = (J − 1)−1∑

(θj − µ)2. After accounting for Jacobian oftransformation, conditional mode of log τ is log τ .

ENAR - March 26, 2006 40

Page 261: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Conditional maximization

• Starting from crude estimates, conditional maximization required onlythree iterations to converge approximately (see table)

• Log posterior increased at each step

• Values in final iteration is approximate joint mode.

• When J is large relative to the nj, joint mode may not provide goodsummary of posterior. Try to get marginal modes.

• In this problem, factor:

p(θ, µ, log σ, log τ |y) = p(θ|µ, log σ, log τ, y)p(µ, log σ, log τ |y).

• Marginal of µ, log σ, log τ is three-dimensional regardless of J and nj.

ENAR - March 26, 2006 41

Page 262: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Conditional maximization results

Table shows iterations for conditional maximization.

Stepwise ascentCrude First Second Third Fourth

Parameter estimate iteration iteration iteration iterationθ1 61.000 61.282 61.288 61.290 61.290θ2 66.000 65.871 65.869 65.868 65.868θ3 68.000 67.742 67.737 67.736 67.736θ4 61.000 61.148 61.152 61.152 61.152µ 64.000 64.010 64.011 64.011 64.011σ 2.291 2.160 2.160 2.160 2.160τ 3.559 3.318 3.318 3.312 3.312

log p(params.|y) -61.604 -61.420 -61.420 -61.420 -61.420

ENAR - March 26, 2006 42

Page 263: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Marginal maximization

• Recall algebraic trick:

p(µ, log σ, log τ |y) =p(θ, µ, log σ, log τ |y)p(θ|µ, log σ, log τ, y)

∝ τΠj N(θj|µ, τ2)ΠjΠi N(yij|θj, σ2)

Πj N(θj|θj, Vj)

• Using θ in place of θ:

p(µ, log σ, log τ |y) ∝ τΠjN(θj|µ, τ2)ΠjΠiN(yij|θj, σ2)ΠjV

1/2j

with θ and Vj as earlier.

ENAR - March 26, 2006 43

Page 264: Bayesian Methods for Data Analysis - ISU Public Homepage Server

• Can maximize p(µ, log σ, log τ |y) using EM. Here, (θj) are the “missingdata”.

• Steps in EM:

1. Average over missing data θ in E-step2. Maximize over (µ, log σ, log τ) in M-step

ENAR - March 26, 2006 44

Page 265: Bayesian Methods for Data Analysis - ISU Public Homepage Server

EM algorithm for marginal mode

Log joint posterior:

log p(θ, µ, log σ, log τ |y) ∝ −n log σ − (J − 1) log τ

− 12τ2

∑j

(θj − µ)2 − 12σ2

∑j

∑i

(yij − θj)2

• E-step: Average over θ using conditioning trick (and p(θ| rest). Needtwo expectations:

Eold[(θj − µ)2] = E[(θj − µ)2|µold, σold, τold, y]

= [Eold(θj − µ)]2 + varold(θj)

= (θj − µ)2 + Vj

Similarly:Eold[(yij − θj)2] = (yij − θj)2 + Vj.

ENAR - March 26, 2006 45

Page 266: Bayesian Methods for Data Analysis - ISU Public Homepage Server

EM algorithm for marginal mode (cont’d)

• M-step: Maximize the expected log posterior (expectations just takenin E-step) with respect to (µ, log σ, log τ). Differentiate and equate tozero to get maximizing values (µnew, log σnew, log τnew). Expressionsare:

µnew =1J

∑j

θj.

σnew =

1n

∑j

∑i

[(yij − θj)2 + Vj]

1/2

and

τnew =

1J − 1

∑j

[(θj − µnew)2 + Vj]

1/2

.

ENAR - March 26, 2006 46

Page 267: Bayesian Methods for Data Analysis - ISU Public Homepage Server

EM algorithm for marginal mode (cont’d)

• Beginning from joint mode, algorithm converged in three iterations.

• Important to check that log posterior increases at each step. Else,programming or formulation error!

• Results:

Parameter Value at First Second Thirdjoint mode iteration iteration iteration

µ 64.01 64.01 64.01 64.01σ 2.17 2.33 2.36 2.36τ 3.31 3.46 3.47 3.47

ENAR - March 26, 2006 47

Page 268: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Using EM results in simulation

• In a problem with standard form such as normal-normal, can do thefollowing:

1. Use EM to find marginal and conditional modes2. Approximate posterior at the mode using N or t approximation3. Draw values of parameters from papprox, and act as if they were

from p.

• Importance re-sampling can be used to improve accuracy of draws. Foreach draw, compute importance weights, and re-sample draws (withoutreplacement) using probability proportional to weight.

• In diet and coagulation example, approximate approach as above likelyto produce quite reasonable results.

• But must be careful, specially with scale parameters.

ENAR - March 26, 2006 48

Page 269: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Gibbs sampling in normal-normal case

• The Gibbs sampler can be easily implemented in the normal-normalexample because all posterior conditionals are of standard form.

• Refer back to Gibbs sampler discussion, and see derivation ofconditionals in hierarchical normal example.

• Full conditionals are:

– θj|all ∼ N (J of them)– µ|all ∼ N– σ2|all ∼ Inv− χ2

– τ2|all ∼ Inv− χ2.

• Starting values for the Gibbs sampler can be drawn from, e.g., a t4approximation to the marginal and conditional mode.

ENAR - March 26, 2006 49

Page 270: Bayesian Methods for Data Analysis - ISU Public Homepage Server

• Multiple parallel chains for each parameter permit monitoringconvergence using potential scale reduction statistic.

ENAR - March 26, 2006 50

Page 271: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Results from Gibbs sampler

• Summary of posterior distributions in coagulation example.

• Posterior quantiles and estimated potential scale reductions computedfrom the second halves of then Gibbs sampler sequences, each of length1000.

• Potential scale reductions for τ and σ were computed on the log scale.

• The hierarchical variance τ2, is estimated less precisely than the unit-level variance, σ2, as is typical in hierarchical models with a smallnumber of batches.

ENAR - March 26, 2006 51

Page 272: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Posterior quantiles from Gibbs sampler

Burn-in = 50% of chain length.

Posterior quantilesParameter 2.5% 25.0% 50.0% 75.0% 97.5%θ1 58.92 60.44 61.23 62.08 63.69θ2 63.96 65.26 65.91 66.57 67.94θ3 65.72 67.11 67.77 68.44 69.75θ4 59.39 60.56 61.14 61.71 62.89µ 55.64 62.32 63.99 65.69 73.00σ 1.83 2.17 2.41 2.7 3.47τ 1.98 3.45 4.97 7.76 24.60log p(µ, log σ, log τ |y) -70.79 -66.87 -65.36 -64.20 -62.71log p(θ, µ, log σ, log τ |y) -71.07 -66.88 -65.25 -64.00 -62.42

ENAR - March 26, 2006 52

Page 273: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Checking convergence

Parameter R upperθ1 1.001 1.003θ2 1.001 1.003θ3 1.000 1.002θ4 1.000 1.001µ 1.005 1.005σ 1.000 1.001τ 1.012 1.013

log p(µ, log σ, log τ |y) 1.000 1.000log p(θ, µ, log σ, log τ |y) 1.000 1.001

ENAR - March 26, 2006 53

Page 274: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Chains for diet means

iteration

thet

a1500 600 700 800 900 1000

5862

66

iteration

thet

a2

500 600 700 800 900 1000

6365

67

iteration

thet

a3

500 600 700 800 900 1000

6567

69

iteration

thet

a3

500 600 700 800 900 1000

5961

6365

ENAR - March 26, 2006 54

Page 275: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Posteriors for diet means

theta1

56 58 60 62 64 66

theta2

62 64 66 68 70

theta3

64 66 68 70 72

theta4

58 60 62 64

ENAR - March 26, 2006 55

Page 276: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Chains for µ, σ, τ

iteration

mu

500 600 700 800 900 1000

5060

7080

9010

0

iteration

sigm

a2

500 600 700 800 900 1000

510

15

iteration

tau2

500 600 700 800 900 1000

020

0060

0010

000

ENAR - March 26, 2006 56

Page 277: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Posteriors for µ, σ, τ

mu

50 55 60 65 70 75 80

sigma

1.5 2.0 2.5 3.0 3.5 4.0 4.5

tau

0 5 10 15 20 25

ENAR - March 26, 2006 57

Page 278: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Hierarchical modeling for meta-analysis

• Idea: summarize and integrate the results of research studies in aspecific area.

• Example 5.6 in Gelman et al.: 22 clinical trials conducted to study theeffect of beta-blockers on reducing mortality after cardiac infarction.

• Considering studies separately, no obvious effect of beta-blockers.

• Data are 22 2×2 tables: in jth study, n0j and n1j are numbers ofindividuals assigned to control and treatment groups, respectively, andy0j and y1j are number of deaths in each group.

• Sampling model for jth experiment: two independent Binomials, withprobabilities of death p0j and p1j.

ENAR - March 26, 2006 58

Page 279: Bayesian Methods for Data Analysis - ISU Public Homepage Server

• Possible quantities of interest:

1. difference p1j − p0j

2. probability ratio p1j/p0j

3. odds ratioρj = [p1j/(1− p1j)]/[p0j/(1− p0j)]

• We parametrize in terms of the log odds ratios: θj = log ρj becausethe posterior is almost normal even for small samples.

ENAR - March 26, 2006 59

Page 280: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Normal approximation to the likelihood

• Consider estimating θj by the empirical logits:

yj = log(

y1j

n1j − y1j

)− log

(y0j

n0j − y0j

)

• Approximate sampling variance (e.g., using a Taylor expansion):

σ2j =

1y1j

+1

n1j − y1j+

1y0j

+1

n0j − y0j

• See Table 5.4 for the estimated log-odds ratios and their estimatedstandard errors.

ENAR - March 26, 2006 60

Page 281: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Raw data Log-Study Control Treated odds, sd,j deaths total deaths total yj σj1 3 39 3 38 0.0282 0.85032 14 116 7 114 -0.7410 0.48323 11 93 5 69 -0.5406 0.56464 127 1520 102 1533 -0.2461 0.13825 27 365 28 355 0.0695 0.28076 6 52 4 59 -0.5842 0.67577 152 939 98 945 -0.5124 0.13878 48 471 60 632 -0.0786 0.20409 37 282 25 278 -0.4242 0.2740

10 188 1921 138 1916 -0.3348 0.117111 52 583 64 873 -0.2134 0.194912 47 266 45 263 -0.0389 0.229513 16 293 9 291 -0.5933 0.425214 45 883 57 858 0.2815 0.205415 31 147 25 154 -0.3213 0.297716 38 213 33 207 -0.1353 0.260917 12 122 28 251 0.1406 0.364218 6 154 8 151 0.3220 0.552619 3 134 6 174 0.4444 0.716620 40 218 32 209 -0.2175 0.259821 43 364 27 391 -0.5911 0.257222 39 674 22 680 -0.6081 0.2724

ENAR - March 26, 2006 61

Page 282: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Goals of analysis

• Goal 1: if studies can be assumed to be exchangeable, we wish toestimate the mean of the distribution of effect sizes, or “overall averageeffect”.

• Goal 2: the average effect size in each of the exchangeable studies

• Goal 3: the effect size that could be expected if a new, exchangeablestudy, were to be conducted.

ENAR - March 26, 2006 62

Page 283: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Normal-normal model formeta-analysis

• First level: sampling distribution yj|θj, σ2j ∼ N(θj, σ

2j )

• Second level: population distribution θj|µ, τ ∼ N(µ, τ2)

• Third level: prior for hyperparameters µ, τ p(µ, τ) = p(µ|τ)p(τ) ∝ 1mostly for convenience. Can incorporate information if available.

ENAR - March 26, 2006 63

Page 284: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Posterior quantiles of effect θjStudy normal approx. (on log-odds scale)j 2.5% 25.0% 50.0% 75.0% 97.5%

1 -0.58 -0.32 -0.24 -0.15 0.132 -0.63 -0.36 -0.28 -0.20 -0.033 -0.64 -0.34 -0.26 -0.18 0.064 -0.44 -0.31 -0.25 -0.18 -0.045 -0.44 -0.28 -0.21 -0.12 0.136 -0.62 -0.36 -0.27 -0.19 0.047 -0.62 -0.44 -0.36 -0.27 -0.168 -0.43 -0.28 -0.20 -0.13 0.089 -0.56 -0.36 -0.27 -0.20 -0.05

10 -0.48 -0.35 -0.29 -0.23 -0.1211 -0.47 -0.31 -0.24 -0.17 -0.0112 -0.42 -0.28 -0.21 -0.12 0.0913 -0.65 -0.37 -0.27 -0.20 0.0214 -0.34 -0.22 -0.12 0.00 0.3015 -0.54 -0.32 -0.26 -0.17 0.0016 -0.50 -0.30 -0.23 -0.15 0.0617 -0.46 -0.28 -0.21 -0.11 0.1518 -0.53 -0.30 -0.22 -0.13 0.1519 -0.51 -0.31 -0.22 -0.13 0.1720 -0.51 -0.32 -0.24 -0.17 0.0421 -0.67 -0.40 -0.30 -0.23 -0.0922 -0.69 -0.40 -0.30 -0.22 -0.07

ENAR - March 26, 2006 64

Page 285: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Posterior quantilesEstimand 2.5% 25.0% 50.0% 75.0% 97.5%

Mean, µ -0.38 -0.29 -0.25 -0.21 -0.12Standard deviation, τ 0.01 0.08 0.13 0.18 0.32

Predicted effect, θj -0.57 -0.33 -0.25 -0.17 0.08

Marginal posterior density p(τ |y)

tau

0.0 0.5 1.0 1.5 2.0

0.0

0.02

0.04

ENAR - March 26, 2006 65

Page 286: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Conditional posterior means E(θj|τ, y)

tau

Est

imat

ed T

reat

men

t Effe

cts

0.0 0.5 1.0 1.5 2.0

-0.6

-0.4

-0.2

0.0

0.2

0.4

1

2

3

4

5

67

8

9

10

11

12

13

14

15

16

17

18

19

20

2122

ENAR - March 26, 2006 66

Page 287: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Conditional posterior standard deviations sd(θj|τ, y)

tau

Est

imat

ed s

tand

ard

devi

atio

ns

0.0 0.5 1.0 1.5 2.0

0.2

0.4

0.6

0.8

1

2

3

4 7

5922

6

8

10

11

12

13

14

15

162021

17

18

19

ENAR - March 26, 2006 67

Page 288: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Histogram of 1000 simulations of θ2, θ18, θ7, and θ14

Study 2

-1.0 -0.6 -0.2 0.2

Study 18

-0.8 -0.4 0.0 0.4

Study 7

-0.8 -0.4 0.0

Study 14

-0.6 -0.2 0.2 0.6

ENAR - March 26, 2006 68

Page 289: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Overall mean effect, E(µ|τ, y)

tau

E(m

u|ta

u,y)

0.0 0.5 1.0 1.5 2.0

-0.2

6-0

.25

-0.2

4-0

.23

ENAR - March 26, 2006 69

Page 290: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Example: Mixed model analysis

• Sheffield Food Company produces dairy products.

• Government cited company because the actual content of fat in yogurtsold by Sheffield appers to be higher than the label amount.

• Sheffield believes that the discrepancy is due to the method employedby the goverment to measure fat content, and conducted a multi-laboratory study to investigate.

• Four laboratories in the United States were randomly chosen.

• Each laboratory received 12 carefully mixed samples of yogurt, withinstructions to analyze 6 using the government method and 6 usingSheffield’s method.

KSU - April 2005 1

Page 291: Bayesian Methods for Data Analysis - ISU Public Homepage Server

• Fat content in all samples was known to be very close to 3%.

• Because of technical difficulties, none of the labs managed to analyzeall six samples using the government method within the alloted time.

Method Lab 1 Lab 2 Lab 3 Lab 4Government 5.19 4.09 4.62 3.71

5.09 3.0 4.32 3.863.75 4.35 3.794.04 4.59 3.634.06

Sheffield 3.26 3.02 3.08 2.983.38 3.32 2.95 2.893.24 2.83 2.98 2.753.41 2.96 2.74 3.043.35 3.23 3.07 2.883.04 3.07 2.70 3.20

KSU - April 2005 2

Page 292: Bayesian Methods for Data Analysis - ISU Public Homepage Server

• We fitted a mixed linear model to the data, where

– Method is a fixed effect with two levels– Laboratory is a random effect with four levels– Method by laboratory interaction is random with six levels

yijk = αi + βj + (αβ)ij + eijk,

and αi = µ + γi and eijk ∼ N(0, σ2).

• Priors:

p(αi) ∝ 1

p(βj|σ2β|σ2

β) ∝ N(0, σ2β)

p((αβ)ij|σ2αβ) ∝ N(0, σ2

αβ)

• The three variance components were assigned diffuse inverted gammapriors.

KSU - April 2005 3

Page 293: Bayesian Methods for Data Analysis - ISU Public Homepage Server

alpha[1] chains 1:3

iteration5001 6000 8000 10000

3.0 3.5 4.0 4.5 5.0 5.5

alpha[2] chains 1:3

iteration5001 6000 8000 10000

2.0 2.5 3.0 3.5 4.0 4.5

Page 294: Bayesian Methods for Data Analysis - ISU Public Homepage Server

beta[1] chains 1:3 sample: 15000

-1.0 0.0 1.0

0.0 1.0 2.0 3.0

beta[2] chains 1:3 sample: 15000

-1.0 -0.5 0.0 0.5

0.0 1.0 2.0 3.0 4.0

beta[3] chains 1:3 sample: 15000

-2.0 -1.0 0.0 1.0

0.0 1.0 2.0 3.0 4.0

beta[4] chains 1:3 sample: 15000

-2.0 -1.0 0.0

0.0 1.0 2.0 3.0 4.0

Page 295: Bayesian Methods for Data Analysis - ISU Public Homepage Server

beta[1] chains 1:3

lag0 20 40

-1.0 -0.5 0.0 0.5 1.0

beta[3] chains 1:3

lag0 20 40

-1.0 -0.5 0.0 0.5 1.0

beta[4] chains 1:3

lag0 20 40

-1.0 -0.5 0.0 0.5 1.0

beta[2] chains 1:3

lag0 20 40

-1.0 -0.5 0.0 0.5 1.0

Page 296: Bayesian Methods for Data Analysis - ISU Public Homepage Server

beta[1] chains 1:3

lag0 20 40

-1.0 -0.5 0.0 0.5 1.0

beta[2] chains 1:3

lag0 20 40

-1.0 -0.5 0.0 0.5 1.0

beta[3] chains 1:3

lag0 20 40

-1.0 -0.5 0.0 0.5 1.0

beta[4] chains 1:3

lag0 20 40

-1.0 -0.5 0.0 0.5 1.0

Page 297: Bayesian Methods for Data Analysis - ISU Public Homepage Server

beta[1] chains 1:3

iteration5001 6000 8000

0.0 0.5 1.0 1.5

beta[2] chains 1:3

iteration5001 6000 8000

0.0 0.5 1.0 1.5

beta[3] chains 1:3

iteration5001 6000 8000

0.0 0.5 1.0 1.5

beta[4] chains 1:3

iteration5001 6000 8000

0.0

0.5

1.0

Page 298: Bayesian Methods for Data Analysis - ISU Public Homepage Server

[1]

[2]

box plot: alpha

2.0

3.0

4.0

5.0

[1]

[2]

[3]

[4]

box plot: beta

-1.0

-0.5

0.0

0.5

1.0

Page 299: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Model checking

• What do we need to check?

– Model fit: does the model fit the data?– Sensitivity to prior and other assumptions– Model selection: is this the best model?– Robustness: do conclusions change if we change data?

• Remember: models are never true; they may just fit the data well andallow for useful inference.

• Model checking strategies must address various parts of models:

– priors– sampling distribution– hierarchical structure

ENAR - March 26, 2006 1

Page 300: Bayesian Methods for Data Analysis - ISU Public Homepage Server

– other model characteristics such as covariates, form of dependencebetween response variable and covariates, etc.

• Classical approaches to model checking:

– Do parameter estimates make sense– Does the model generate data like observed sample– Are predictions reasonable– Is model “best” in some sense (e.g., AIC, likelihood ratio, etc.)

• Bayesian approach to model checking

– Does the posterior distribution of parameters correspond with whatwe know from subject-matter knowledge?

– Predictive distribution for future data must also be consistent withsubstantive knowledge

– Future data generated by predictive distribution compared to currentsample

ENAR - March 26, 2006 2

Page 301: Bayesian Methods for Data Analysis - ISU Public Homepage Server

– Sensitivity to prior and other model components

• Most popular model checking approach has a frequentist flavor: wegenerate replications of the sample from posterior and observe thebehavior of sample summaries over repeated sampling.

ENAR - March 26, 2006 3

Page 302: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Posterior predictive model checking

• Basic idea is that data generated from the model must look likeobserved data.

• Posterior predictive model checks based on replicated data yrep

generated from the posterior predictive distribution:

p(yrep|y) =∫

p(yrep|θ)p(θ|y)dθ.

• yrep is not the same as y. Predictive outcomes y can be anything(e.g., a regression prediction using different covariates x). But yrep isa replication of y.

• yrep are data that could be observed if we repeated the exact sameexperiment again tomorrow (if in fact the θs in our analysis gave riseto the data y).

ENAR - March 26, 2006 4

Page 303: Bayesian Methods for Data Analysis - ISU Public Homepage Server

• Definition of a replicate in a hierarchical model:

p(φ|y) → p(θ|φ, y) → p(yrep|θ) : rep from same units

p(φ|y) → p(θ|φ) → p(yrep|θ) : rep from new units

• Test quantity: T (y, θ) is a discrepancy statistic used as a standard.

• We use T (y, θ) to determine discrepancy between model and data onsome specific aspect we wish to check.

• Example of T (y, θ): proportion of standardized residuals outside of(−3, 3) in a regression model to check for outliers

• In classical statistics, use T (y) a test-statistic that depends only on thedata. Special case of Bayesian T (y, θ).

• For model checking:

ENAR - March 26, 2006 5

Page 304: Bayesian Methods for Data Analysis - ISU Public Homepage Server

– Determine appropriate T (y, θ)– Compare posterior predictive distribution of T (yrep, θ) to posterior

distribution of T (y, θ)

ENAR - March 26, 2006 6

Page 305: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Bayes p-values

• p-values attempt to measure tail-area probabilities.

• Classical definition:

class p− value = Pr(T (yrep) ≥ T (y)|θ)

– Probability is taken over distribution of yrep with θ fixed– Point estimate θ typically used to compute the p−value.

• Posterior predictive p-values:

Bayes p− value = Pr(T (yrep, θ) ≥ T (y, θ)|y)

ENAR - March 26, 2006 7

Page 306: Bayesian Methods for Data Analysis - ISU Public Homepage Server

• Probability taken over joint posterior distribution of (θ, yrep):

Bayes p− value =∫ ∫

IT (yrep,θ)≥T (y,θ)p(θ|y)p(yrep|θ)dθdyrep

ENAR - March 26, 2006 8

Page 307: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Relation between p-values

• Small example:

– Sample y1, ..., yn ∼ N(µ, σ2)– Fit N(0, σ2) and check whether µ = 0 is good fit

• In real life, would fit more general N(µ, σ2) and would decide whetherµ = 0 is plausible.

• Classical approach:

– Test statistic is sample mean T (y) = y

p− value = Pr(T (yrep) ≥ T (y)|σ2)

= Pr(yrep ≥ y|σ2)

ENAR - March 26, 2006 9

Page 308: Bayesian Methods for Data Analysis - ISU Public Homepage Server

= Pr

(√nyrep

S≥√

ny

S|σ2

)= P

(tn−1 ≥

√ny

S

)• This is special case: not always possible to get rid of nuisance

parameters.

• Bayes approach:

p− value = Pr(T (yrep) ≥ T (y)|y)

=∫ ∫

IT (yrep)≥T (y)p(yrep|σ2)p(σ2|y)dyrepdσ2

• Note that

IT (yrep)≥T (y)p(yrep|σ2) = P (T (yrep) ≥ T (y)|σ2)

ENAR - March 26, 2006 10

Page 309: Bayesian Methods for Data Analysis - ISU Public Homepage Server

= classical p-value

• Then:Bayes p− value = E{p− valueclass|y}

where the expectation is taken with respect to p(σ2|y).

• In this example, classical p−value and Bayes p−value are the same.

• In general, Bayes can handle easily nuisance parameters.

ENAR - March 26, 2006 11

Page 310: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Interpreting posterior predictive p−values

• We look for tail-area probabilities that are not too small or too large.

• The ideal posterior predictive p-value is 0.5: the test quantity falls rightin the middle of its posterior predictive.

• Posterior predictive p-values are actual posterior probabilities

• Wrong interpretation: Pr(model is true |data).

ENAR - March 26, 2006 12

Page 311: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Example: Independence of Bernoulli trials

• Sequence of binary outcomes y1, ..., yn modeled as iid Bernoulli trialswith probability of success θ.

• Uniform prior on θ leads to p(θ|y) ∝ θs(1− θ)n−s with s =∑

yi.

• Sample:1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0

• Is the assumption of independence warranted?

• Consider discrepancy statistic

T (y, θ) = T (y) = number of switches between 0 and 1.

ENAR - March 26, 2006 13

Page 312: Bayesian Methods for Data Analysis - ISU Public Homepage Server

• In sample, T (y) = 3.

• Posterior is Beta(8,14).

• To test assumption of independence, do:

1. For j = 1, ...,M draw θj from Beta(8,14).

2. Draw {yrep,j1 , ..., y

rep,j20 } independent Bernoulli variables with

probability θj.3. In each of the M replicate samples, compute T (y), the number of

switches between 0 and 1.

pB = Prob(T (yrep) ≥ T (y)|y) = 0.98.

• If observed trials were independent, expected number of switches in 20trials is approximately 8 and 3 is very unlikely.

ENAR - March 26, 2006 14

Page 313: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Posterior predictive dist. of number of switches

ENAR - March 26, 2006 15

Page 314: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Example: Newcomb’s speed of light

• Newcomb obtained 66 measurements of the speed of light. (Chapter3). One measurement of -44 is a potential outlier under a normalmodel.

• From data: y = 26.21 and s = 10.75.

• Model: yi|µ, σ2 ∼ N(µ, σ2), (µ, σ2) ∼ σ−2.

• Posteriorsp(σ2|y) = Inv− χ2(65, s2)

p(µ|σ2, y) = N(y, σ2/66)

orp(µ|y) = t65(y, s2/66).

ENAR - March 26, 2006 16

Page 315: Bayesian Methods for Data Analysis - ISU Public Homepage Server

• For posterior predictive checks, do for i = 1, ...,M

1. Generate (µ(i), σ2(i)) from p(µ, σ2|y)2. Generate y

rep(i)1 , ..., y

rep(i)66 from N(µ(i), σ2(i))

ENAR - March 26, 2006 17

Page 316: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Example: Newcomb’s data

ENAR - March 26, 2006 18

Page 317: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Example: Replicated datasets

ENAR - March 26, 2006 19

Page 318: Bayesian Methods for Data Analysis - ISU Public Homepage Server

• Is observed minimum value -44 consistent with model?

– Define T (y) = min{yi}– get distribution of T (yrep).– Large negative value of -44 inconsistent with distribution of

T (yrep) = min{yrepi } because observed T (y) very unlikely under

distribution of T (yrep).– Model inadequate, must account for long tail to the left.

ENAR - March 26, 2006 20

Page 319: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Posterior predictive of smallest observation

ENAR - March 26, 2006 21

Page 320: Bayesian Methods for Data Analysis - ISU Public Homepage Server

• Is sampling variance captured by model?

– Define

T (y) = s2y =

165

Σi(yi − y)2

– obtain distribution of T (yrep).– Observed variance of 115 very consistent with distribution (across

reps) of sample variances– Probability of observing a larger sample variance is 0.48: observed

variance in middle of distribution of T (yrep)– But this is meaningless: T (y) is sufficient statistic for σ2, so model

MUST have fit it well.

• Symmetry in center of distribution

– Look at difference in the distance of 10th and 90th percentile fromthe mean.

– Approx 10th percentile is 6th order statistic y(6), and approx 90thpercentile is y(61).

ENAR - March 26, 2006 22

Page 321: Bayesian Methods for Data Analysis - ISU Public Homepage Server

– Define T (y, θ) = |y(61) − θ| − |y(6) − θ|– Need joint distribution of T (y, θ) and of T (yrep, θ).– Model accounts for symmetry in center of distribution– Joint distribution of T (y, θ) and T (yrep, θ) about evenly distributed

along line T (y, θ) = T (yrep, θ).– Probability that T (yrep, θ) > T (y, θ) about 0.26, plausible under

sampling variability.

ENAR - March 26, 2006 23

Page 322: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Posterior predictive of variance and symmetry

ENAR - March 26, 2006 24

Page 323: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Choice of discrepancy measure

• Often, choose more than one measure, to investigate different attributesof the model.

– Can smoking behavior in adolescents be predicted using informationon covariates? (Example on page 172 of Gelman et al.)

– Data: six observations of smoking habits in 2,000 adolescents everysix months

• Two models:

– Logistic regression model– Latent-class model: propensity to smoke modeled as nonlinear

function of predictors. Conditional on smoking, fit same logisticregression model as above.

• Three different test statistics chosen:

ENAR - March 26, 2006 25

Page 324: Bayesian Methods for Data Analysis - ISU Public Homepage Server

– % who never smoked– % who always smoked– % of “incident smokers”: began not smoking but then switched to

smoking and continued doing so.

• Generate replicate datasets from both models.

• Compute the three test statistics in each of the replicated datasets

• Compare the posterior predictive distributions of each of the threestatistics with the value of the statistic in the observed dataset

• Results: Model 2 slightly better than 1 at predicting % of alwayssmokers, but both models fail to predict % of incident smokers.

95% CI 95% CITest variable T (y) for T (yrep) p-value for T (yrep) p-value

% never smokers 77.3 [75.5, 78.2] 0.27 [74.8, 79.9] 0.53% always-smokers 5.1 [5.0, 6.5] 0.95 [3.8, 6.3] 0.44% incident smokers 8.4 [5.3, 7.9] 0.005 [4.9, 7.8] 0.004

ENAR - March 26, 2006 26

Page 325: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Omnibus tests

• It is often useful to also consider summary statistics:

χ2-discrepancy: T (y, θ) =∑ (yi−E(yi|θ))2

var(yi|θ)

Deviance: T (y, θ) = −2 log p(y|θ)

• Deviance is proportional to mean squared error if model is normal andwith constant variance.

• In classical approach, insert θnull or θmle in place of θ in T (y, θ) andcompare test statistic to reference distribution derived under asymptoticarguments.

• In Bayesian approach, reference distribution for test statistic isautomatically calculated from posterior predictive simulations.

ENAR - March 26, 2006 27

Page 326: Bayesian Methods for Data Analysis - ISU Public Homepage Server

• A Bayesian χ2 test is carried out as follows:

– Compute the distribution of T (y, θ) for many draws of θ fromposterior and for observed data.

– Compute the distribution of T (yrep, θ) for replicated datasets andposterior draws of θ

– Compute Bayesian p-value: probability of observing a more extremevalue of T (yrep, θ).

– Calculate p-value empirically from simulations over θ and yrep.

ENAR - March 26, 2006 28

Page 327: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Criticisms of posterior predictive checks

• Too conservative because data get used twice

• Posterior predictive p-values are not really p-values: distribution undernull hypothesis is not uniform, as it should be

• What is high/low if pp p-values not uniform?

• Some Bayesians object to frequentist slant of using unobserved datafor anything

• Lots of work recently on improving posterior predictive p-values: Bayarriand Berger, JASA, 2000, is good reference

• In spite of criticisms, pp checks are easy to use in very general cases,and intuitively appealing.

ENAR - March 26, 2006 29

Page 328: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Pps can be conservative

• One criticism for pp model checks is that they tend to reject modelsonly in the face of extreme evidence.

• Example (from Stern):

– y ∼ N(µ, 1) and µ ∼ N(0, 9).– Observation: yobs = 10– Posterior p(µ|y) = N(0.9yobs, 0.9) = N(9, 0.9)– Posterior predictive dist: N(9, 1.9)– Posterior predictive p-value is 0.23, we would not reject the model

• Effect of prior is minimized because 9 > 1 and posterior predictivemean is “close” to observed datapoint.

• To reject, we would need to observe y ≥ 23.

ENAR - March 26, 2006 30

Page 329: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Graphical posterior predictive checks

• Idea is to display the data together with simulated data.

• Three kinds of checks:

– Direct data display– Display of data summaries or parameter estimates– Graphs of residuals or other measures of discrepancy

ENAR - March 26, 2006 31

Page 330: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Model comparison

• Which model fits the data best?

• Often, models are nested: a model with parameters θ is nested withina model with parameters (θ, φ).

• Comparison involves deciding whether adding φ to the model improvesits fit. Improvement in fit may not justify additional complexity.

• It is also possible to compare non-nested models.

• We focus on predictive performance and on model posterior probabilitiesto compare models.

ENAR - March 26, 2006 32

Page 331: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Expected deviance

• We compare the observed data to several models to see which predictsmore accurately.

• We summarize model fit using the deviance

D(y, θ) = −2 log p(y|θ).

• It can be shown that the model with the lowest expected deviance isbest in the sense of minimizing the (Kullback-Leibler) distance betweenthe model p(y|θ) and the true distribution of y, f(y).

• We compute the expected deviance by simulation:

Davg(y) = Eθ|y(D(y, θ)|y).

ENAR - March 26, 2006 33

Page 332: Bayesian Methods for Data Analysis - ISU Public Homepage Server

• An estimate is

Davg(y) =1M

∑j

D(y, θj).

ENAR - March 26, 2006 34

Page 333: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Deviance information criterion - DIC

• Idea is to estimate the error that would be expected when applying themodel to future data.

• Expected mean square predictive error:

Dpredavg = E

[1n

∑i

(yrepi − E(yrep

i |y))2]

.

• A model that minimizes the expected predictive deviance is best in thesense of out-of-sample predictive power.

• An approximation to Dpredavg is the Deviance Information Criterion or

DIC

DIC = Dpredavg = 2Davg(y)−Dθ(y),

ENAR - March 26, 2006 35

Page 334: Bayesian Methods for Data Analysis - ISU Public Homepage Server

where Dθ(y) = D(y, θ(y)) and θ(y) is a point estimator of θ such asthe posterior mean.

ENAR - March 26, 2006 36

Page 335: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Example: SAT study

• We compare three models: no pooling, complete pooling andhierarchical model (partial pooling).

• Deviance is

D(y, θ) = −2∑

j

log N(yj|θj, σ2j )

= log(2πJσ2) +∑

j

((yj − θj)2/σ2j )

• The DIC for the three models were: 70.3 (no pooling), 61.5 (completepooling), 63.4 (hierarchical model).

• Based on DIC, we would pick the complete pooling model.

ENAR - March 26, 2006 37

Page 336: Bayesian Methods for Data Analysis - ISU Public Homepage Server

• We still prefer the hierarchical model because assumption that allschools have identical mean is strong.

ENAR - March 26, 2006 38

Page 337: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Bayes Factors

• Suppose that we wish to decide between two models M1 and M2

(different prior, sampling distribution, parameters).

• Priors on models are p(M1) and p(M2) = 1− p(M1).

• The posterior odds favoring M1 over M2 are

p(M1|y)p(M2|y)

=p(y|M1)p(y|M2)

p(M1)p(M2)

.

• The ratio p(y|M1)/p(y|M2) is called a Bayes factor.

• It tells us how much the data support one model over the other.

ENAR - March 26, 2006 39

Page 338: Bayesian Methods for Data Analysis - ISU Public Homepage Server

• The BF12 is computed as

BF12 =∫

p(y|θ1,M1)p(θ1|M1)dθ1∫p(y|θ2,M2)p(θ2|M2)dθ2

.

• Note that we need p(y) under each model to compute p(θi|Mi).Therefore, the BF is only defined when the marginal distribution of thedata p(y) is proper, and therefore, when the prior distribution in eachmodel is proper.

• For example, if y ∼ N(θ, 1) with p(θ) ∝ 1, we get

p(y) ∝ (2π)1/2

∫exp{−1

2(y − θ)2}dθ = 1,

which is constant for any y and thus is not a proper distribution. Thiscreates an indeterminacy because we can increase or decrease the BFsimply by using a different constant in the numerator and denominator.

ENAR - March 26, 2006 40

Page 339: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Computation of Bayes Factors

• Difficulty lies in the computation of p(y)

p(y) =∫

p(y|θ)p(θ)dθ.

• The simplest approach is to draw many values of θ from p(θ) and geta Monte Carlo approximation to the integral.

• May not work well because the θ’s from the prior may not come fromthe parameter space region where the sampling distribution has mostmass.

• A better Monte Carlo approximation is similar to importance sampling.

ENAR - March 26, 2006 41

Page 340: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Note that

p(y)−1 =∫

h(θ)p(y)

=∫

h(θ)p(y|θ)p(θ)

p(θ|y)dθ.

• h(θ) can be anything (e.g., a normal approximation to the posterior).

• Draw values of θ from the posterior and evaluate the integralnumerically.

• Computations can get tricky because denominator can be small.

• As n →∞,

log(BF ) ≈ log(p(y|θ2,M2)− log(p(y|θ1,M1)−12(d1 − d2) log(n),

ENAR - March 26, 2006 42

Page 341: Bayesian Methods for Data Analysis - ISU Public Homepage Server

where θi is the posterior mode under model Mi and di is the dimensionof θi.

• Notice that the criterion penalizes a model for additional parameters.

• Using log(BF ) is equivalent to ranking models using the BIC criterion,given by

BIC = − log(p(y|θ, M) +12d log(n).

ENAR - March 26, 2006 43

Page 342: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Ordinary Linear Regression - Introduction

• Question of interest: how does an outcome y vary as a function of avector of covariates X.

• We want the conditional distribution p(y|θ, x), where observations(y, x)i are assumed exchangeable.

• Covariates (x1, x2, ..., xk) can be discrete or continuous, and typicallyx1 = 1 for all n units. The matrix X, n× k is the model matrix

• Most common version of the model is the normal linear model

E(yi|β, X) =k∑

j=1

βjxij

ENAR - March 2006 1

Page 343: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Ordinary Linear Regression - Model

• In ordinary linear regression models, the conditional variance is equalacross observations: var(yi|θ, X) = σ2. Thus, θ = (β1, β2, ..., βk, σ

2).

• Likelihood: For the ordinary normal linear regression model, we have

p(y|X, β, σ2) = N(Xβ, σ2I)

with In and n× n identity matrix.

• Priors: A non-informative prior distribution for (β, σ2) is

p(β, σ2) ∝ σ−2

(more on informative priors later)

ENAR - March 2006 2

Page 344: Bayesian Methods for Data Analysis - ISU Public Homepage Server

• Joint posterior:

p(β, σ2|y) ∝ (σ2)−n2−1 exp

[−1

2σ−2(y −Xβ)′(y −Xβ)

]

• Joint posterior

p(β, σ2|y) ∝ (σ2)−n2−1 exp

[−1

2σ−2(y −Xβ)′(y −Xβ)

]

• Consider

p(β, σ2|y) = p(β|σ2, y)p(σ2|y)

• Conditional posterior for β: Expand and complete the squares in

ENAR - March 2006 3

Page 345: Bayesian Methods for Data Analysis - ISU Public Homepage Server

p(β|σ2, y) (viewed as function of β). We get:

p(β|σ2, y) ∝ exp{− 1

2σ2

[β′X ′Xβ − 2β′X ′Xβ

]}∝ exp

{− 1

2σ2(β − β)′X ′X(β − β)

}for

β = (X ′X)−1X ′y

• Then:β|σ2, y ∼ N (β, σ2(X ′X)−1)

ENAR - March 2006 4

Page 346: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Ordinary Linear Regression - p(σ2|y)

• The marginal posterior distribution for σ2 is obtained by integratingthe joint posterior with respect to β:

p(σ2|y) =∫

p(β, σ2|y)dβ

∝ (σ2)−n2−1

∫exp

[−σ−21

2(y −Xβ)′(y −Xβ)

]dβ

• By expanding square in integrand, and adding and subtracting2y′X(X ′X)−1X ′y, we can write:arge

p(σ2|y) = (σ2)−n2−1 ×

ENAR - March 2006 5

Page 347: Bayesian Methods for Data Analysis - ISU Public Homepage Server

∫exp

{− 1

2σ2

[(y −Xβ)′(y −Xβ) + (β − β)′X ′X(β − β)

]}dβ

∝ (σ2)−n2−1 exp

[− 1

2σ2(n− k)S2

] ∫exp

[− 1

2σ2(β − β)′X ′X(β − β)

]dβ

• Integrand is kernel of k− dimensional normal, so result of integrationis proportional to (σ2)k/2. Then

p(σ2|y) ∝ (σ2)−n−k

2 +1 exp[−σ−2(n− k)S2

]proportional to an Inv-χ2(n− k, S2).

ENAR - March 2006 6

Page 348: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Ordinary Linear Regression - Notes

• Note that β and S2 are the MLEs of β and σ2

• When is the posterior proper? p(β, σ2|y) is proper if

1. n > k2. rank(X) = k: columns of X are linearly independent or |X ′X| 6= 0

• To sample from the joint posterior:

1. Draw σ2 from Inv-χ2(n− k, S2)2. Given σ2, draw vector β from N(β, σ2(X ′X)−1)

• For efficiency, compute β, S2, (X ′X)−1 once, before starting repeateddrawing.

ENAR - March 2006 7

Page 349: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Regression - Posterior predictive distribution

• In regression, we often want to predict the outcome y for a new set ofcovariates x. Thus, we wish to draw values from p(y|y, X)

• By simulation:

1. Draw σ2 from Inv-χ2(n− k, S2)2. Draw β from N(β, σ2(X ′X)−1)3. Draw yi for i = 1, ...,m from N(x′iβ, σ2)

ENAR - March 2006 8

Page 350: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Regression - Posterior predictive distribution

• We can derive p(y|y) in steps, first considering p(y|y, σ2).

• Note that

p(y|y, σ2) =∫

p(y|β, σ2)p(β|σ2, y)dβ

is normal because exponential in integrand is quadratic function in(β, y).

• To get mean and variance use conditioning trick:

E(y|σ2, y) = E[E(y|β, σ2, y)|σ2, y

]= E(Xβ|σ2, y)

= Xβ

ENAR - March 2006 9

Page 351: Bayesian Methods for Data Analysis - ISU Public Homepage Server

where inner expectation averages over y conditional on β and outeraverages over β.

var(y|σ2, y) = E[var(y|β, σ2, y)|σ2, y

]+ var

[E(y|β, σ2, y)|σ2, y

]= E[σ2I|σ2, y] + var[Xβ|σ2, y]

= σ2(I + X(X ′X)−1X ′)

• Var has two terms: σ2I is sampling variation, and σ2X(X ′X)−1X ′ isdue to uncertainty about β

• To complete specification of p(y|y) must integrate p(y|y, σ2) withrespect to marginal posterior distribution of σ2.

• Result is:p(y|y) = tn−k(Xβ, S2[I + X(X ′X)−1X])

ENAR - March 2006 10

Page 352: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Regression example: radon measurements inMinnesota

• Radon measurements yi were taken in three counties in Minnesota:Blue Earth, Clay and Goodhue.

• 14 houses in each of Blue Earth and Clay county and 13 houses inGoodhue were sampled.

• Measurements were taken in the basement and in the first floor.

• We fit an ordinary regression model to the log radon measurements,without an intercept.

• We define dummy variables as follows: X1 = 1 if county is Blue Earth

KSU - April 2005 1

Page 353: Bayesian Methods for Data Analysis - ISU Public Homepage Server

and is 0 otherwise. Similarly, X2 and X3 are dummies for Clay andGoodhue counties, respectively.

• X4 = 1 if measurement was taken in the first floor.

• The model is

log(yi) = β1x1i + β2x2i + β3x3i + β4x4i + ei,

with e ∼ N(0, σ2).

• Thus

E(y|Blue Earth, basement) = exp(β1)

E(y|Blue Earth, first floor) = exp(β1 + β4)

E(y|Clay, basement) = exp(β2)

KSU - April 2005 2

Page 354: Bayesian Methods for Data Analysis - ISU Public Homepage Server

E(y|Clay, first floor) = exp(β2 + β4)

E(y|Goodhue, basement) = exp(β3)

E(y|Goodhue, first floor) = exp(β3 + β4)

• We used noninformative priors N(0, 1000) for the regression coefficientsand a noninformative prior Gamma(0.01, 0.01) for the error variance.

KSU - April 2005 3

Page 355: Bayesian Methods for Data Analysis - ISU Public Homepage Server

beta[1]

iteration3001 3500 4000 4500 5000

0.5 1.0 1.5 2.0 2.5 3.0

beta[2]

iteration3001 3500 4000 4500 5000

0.5 1.0 1.5 2.0 2.5 3.0

Page 356: Bayesian Methods for Data Analysis - ISU Public Homepage Server

beta[3]

iteration3001 3500 4000 4500 5000

1.0

1.5

2.0

2.5

3.0

beta[4]

iteration3001 3500 4000 4500 5000

-2.0

-1.0

0.0

1.0

Page 357: Bayesian Methods for Data Analysis - ISU Public Homepage Server

[1] [2] [3]

[4]

box plot: beta

-1.0

0.0

1.0

2.0

3.0

Page 358: Bayesian Methods for Data Analysis - ISU Public Homepage Server

becb sample: 2000

0.0 5.0 10.0 15.0

0.0 0.1 0.2 0.3

becff sample: 2000

0.0 5.0 10.0 15.0

0.0 0.1 0.2 0.3

ccb sample: 2000

0.0 5.0 10.0

0.0 0.1 0.2 0.3 0.4

ccff sample: 2000

0.0 5.0 10.0

0.0 0.1 0.2 0.3 0.4

gcb sample: 2000

0.0 5.0 10.0 15.0

0.0 0.1 0.2 0.3

gcff sample: 2000

0.0 5.0 10.0 15.0

0.0 0.1 0.2 0.3

Page 359: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Regression - Posterior predictive checks

• For regression models, there are well-known methods for checkingmodel and assumptions using estimated residuals. Residuals:

εi = yi − x′iβ

• Two useful test statistics are:

– Proportion of outliers among residuals– Correlation between squared residuals and fitted values y

• Posterior predictive distributions of both statistics can be obtained viasimulation

• Define a standardized residual as

ei = (yi − x′iβ)/σ

ENAR - March 2006 13

Page 360: Bayesian Methods for Data Analysis - ISU Public Homepage Server

• If normal model is correct, standardized residuals should be aboutN(0, 1) and therefore, |e|i > 3 suggests that ith observation may bean outlier.

• To derive the posterior predictive distribution for the proportion ofoutliers q and for ρ, the correlation between (e2, y), do:

1. Draw (σ2, β) from the joint posterior distribution2. Draw yrep from N(Xβ, σ2I) given the existing X3. Run the regression of yrep on X, and save residuals4. Compute ρ5. Compute proportion of “large” standardized residuals q6. Repeat for another yrep

• Approach is frequentist in nature: we act as if we could repeat theexperiment many times.

• Inspection of posterior predictive distribution of ρ and of q provides

ENAR - March 2006 14

Page 361: Bayesian Methods for Data Analysis - ISU Public Homepage Server

information about model adequacy.

ENAR - March 2006 15

Page 362: Bayesian Methods for Data Analysis - ISU Public Homepage Server

• Results of posterior predictive checks suggest that there are no outliersand that the correlation between the squared residuals and the fittedvalues is negligible.

• The 95% credible set for the proportion of absolute residuals (in the41 observations) above 3 was (0, 4.88) with a mean of 0.59% and amedian of 0.

• The 95% credible set for ρ was (−0.31, 0.32) with a mean of 0.011.

• Notice that the posterior distribution of the proportion of outliersis skewed. This sometimes happens when the quantity of interestis bounded and most of the mass of its distribution is close to theboundary.

KSU - April 2005 4

Page 363: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Regression with unequal variances

• Consider now the case where y ∼ N(Xβ, Σy), with Σy 6= σ2I.

• Covariance matrix Σy is n×n and has n(n + 1)/2 distinct parameters,and cannot be estimated from n observations. Must either specify Σy

or it must be assigned an informative prior distribution.

• Typically, some structure is imposed on Σy to reduce the number offree parameters.

ENAR - March 2006 16

Page 364: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Regression - known Σy

• As before, let p(β) ∝ 1 be the non-informative prior

• Since Σy is positive definite and symmetric, it has an upper-triangular

square root matrix (Cholesky factor) Σ1/2y such that Σ1/2

y Σ1/2′y = Σy

so that if:

y = Xβ + e, e ∼ N(0,Σy)

then

Σ−1/2y y = Σ−1/2

y Xβ + Σ−1/2y e, Σ−1/2

y e ∼ N(0, I)

• With Σy known, just proceed as in ordinary linear regression, but usethe transformed y and X as above and fix σ2 = 1. Algebraically, this

ENAR - March 2006 17

Page 365: Bayesian Methods for Data Analysis - ISU Public Homepage Server

is equivalent to computing

β = (X ′Σ−1y X)−1XΣ−1

y y

Vβ = (X ′Σ−1y X)−1

• Note: By using the Cholesky factor you avoid computing the n × ninverse Σ−1

y .

ENAR - March 2006 18

Page 366: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Prediction of new y with known Σy

• Even if we know Σy, prediction of new observations is more complicated:must know covariance matrix of old and new data.

• Example: heights of children from same family are correlated. Topredict height of a new child when a brother is in the old dataset, wemust include that correlation in the prediction.

• y are n new observations given a n× k matrix of regressors X.

• Joint distribution of y, y is:

yy|X, X, θ ∼ N

([Xβ

],

[Σy Σyy

Σyy Σyy

])y|y, β,Σy ∼ N(µy, Vy)

ENAR - March 2006 19

Page 367: Bayesian Methods for Data Analysis - ISU Public Homepage Server

where

µy = Xβ + ΣyyΣ−1yy (y −Xβ)

Vy = Σyy − ΣyyΣ−1yy Σyy

ENAR - March 2006 20

Page 368: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Regression - unknown Σy

• To draw inferences about (β, Σy), we proceed in steps: derivep(β|Σy, y) and then p(Σy|y).

• Let p(β) ∝ 1 as before

• We know that

p(Σy|y) =p(β, Σy|y)p(β|Σy, y)

∝ p(Σy)p(y|β, Σy)p(β|Σy, y)

∝ p(Σy) N(y|β, Σy)

N(β|β, Vβ),

where (β, Vβ) depend on Σy.

ENAR - March 2006 21

Page 369: Bayesian Methods for Data Analysis - ISU Public Homepage Server

• Expression for p(Σy|y) must hold for any β, so we set β = β.

• Note thatp(β|Σy, y) = N(β, Vβ) ∝ |Vβ|1/2

for β = β. Then

p(Σy|y) ∝ p(Σy)|Vβ|1/2|Σy|−1/2 exp[−12(y −Xβ)′Σ−1

y (y −Xβ)]

• In principle, p(Σy|y) could be evaluated for a range of values of Σy.However:

1. It is difficult to determine a prior for an n× n unstructured matrix.2. β, Vβ depend on Σy and it is very difficult to draw values of Σy from

p(Σy|y).

• Need to put some structure on Σy

ENAR - March 2006 22

Page 370: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Regression - Σy = σ2Qy

• Suppose that we know Σy upto a scalar factor σ2.

• Non-informative prior on β, σ2 is p(β, σ2) ∝ σ−2

• Results follow directly from ordinary linear regression, by using

transformation Q−1/2y y and Q

−1/2y X, which is equivalent to computing

β = (X ′Q−1y X)−1X ′Q−1

y y

Vβ = (X ′Q−1y X)−1

s2 = (n− k)−1(y −Xβ)′Q−1y (y −Xβ)

• To estimate the joint posterior distribution p(β, σ2|y) do

ENAR - March 2006 23

Page 371: Bayesian Methods for Data Analysis - ISU Public Homepage Server

1. Draw σ2 from Inv-χ2(n− k, s2)2. Draw β from n(β, Vβσ2)

• In large datasets, use Cholesky factor transformation and unweightedregression to avoid computation of Q−1

y .

ENAR - March 2006 24

Page 372: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Regression - other covariance structures

Weighted regression

• In some applications, Σy = diag(σ2/wi), for known weights and σ2

unknown.

• Inference is the same as before, but now Q−1y = diag(wi)

Parametric model for unequal variances:

• Variances may depend on weights in non-linear fashion: Σii =σ2v(wi, φ) for unknown parameter φ ∈ (0, 1) and known function

v such as v = w−φi

• Note that for that function v:

φ = 0 −→ Σii = σ2 ∀i

ENAR - March 2006 25

Page 373: Bayesian Methods for Data Analysis - ISU Public Homepage Server

φ = 1 −→ Σii = σ2/wi ∀i

• In practice, to uncouple φ from the scale of the weights, we multiplyweights by a factor so that their product equals 1.

ENAR - March 2006 26

Page 374: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Parametric model for unequal variances

• Three parameters to estimate: β, σ2, φ.

• Possible prior for φ is uniform in [0, 1]

• Non-informative prior for β, σ2 is ∝ σ−2

• Joint posterior distribution:

p(β, σ2, φ|y) ∝ p(φ)p(β, σ2)Πni=1 N(yi; (Xβ)i, σ

2v(wi, φ))

• For a given φ, we are back in the earlier weighted regression case with

Q−1y = diag(v(w1, φ), ..., v(wn, φ))

ENAR - March 2006 27

Page 375: Bayesian Methods for Data Analysis - ISU Public Homepage Server

• This suggests the following scheme

1. Draw φ from marginal posterior p(φ|y)2. Compute Q

−1/2y y and Q

−1/2y X

3. Draw σ2 from p(σ2|φ, y) as in ordinary regression4. Draw β from p(β|σ2, φ, y) as in ordinary regression

• Marginal posterior distribution of φ:

p(φ|y) =p(β, σ2, φ|y)p(β, σ2|φ, y)

=p(β, σ2, φ|y)

p(β|σ2, φ, y)p(σ2|φ, y)

∝ p(φ)σ−2Πi N(yi|(Xβ)i, σ2v(wi, φ))

Inv-χ2(n− k, s2) N(β, Vβ)

• Expression must hold for any (β, σ2), so we set β = β and σ2 = s2.

ENAR - March 2006 28

Page 376: Bayesian Methods for Data Analysis - ISU Public Homepage Server

• Recall that β and s2 depend on φ.

• Also recall that weights are scaled to have product equal to 1.

• Thenp(φ|y) ∝ p(φ)|Vβ|1/2(s2)−(n−k)/2

• To carry out computation

– First sample φ from p(φ|y):1. Evaluate p(φ|y) for a range of values of φ ∈ [0, 1]2. Use inverse cdf method to sample φ

– Given φ, compute y∗ = Q−1/2y y and X∗ = Q

−1/2y X

– Compute

β = (X∗′X∗)−1X∗′y

Vβ = (X∗′X∗)−1

ENAR - March 2006 29

Page 377: Bayesian Methods for Data Analysis - ISU Public Homepage Server

s2 = (n− k)−1(y∗ −X∗β)′(y∗ −X∗β)

– Draw σ2 from Inv-χ2(n− k, s2)– Draw β from N(β, σ2Vβ)

ENAR - March 2006 30

Page 378: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Including prior information about β

• Suppose we wish to add prior information about a single regressioncoefficient βj of the form:

βj ∼ N(βj0, σ2βj

),

with βj0, σ2βj

known.

• Prior information can be added in the form of an additional ‘datapoint’.

• An ordinary observation y is normal with mean xβ and variance σ2.

• As a function of βj, the prior can be viewed as an ‘observation’ with

– 0 on all x’s except xj

ENAR - March 2006 31

Page 379: Bayesian Methods for Data Analysis - ISU Public Homepage Server

– variance σ2βj

.

• To include prior information, do the following:

1. Append one more ‘data point’ to vector y with value βj0.2. Add one row to X with zeroes except in jth column.3. Add a diagonal element with value σ2

βjto Σy.

• Now apply computational methods for non-informative prior.

• Given Σy, posterior for β obtained by weighted linear regression.

ENAR - March 2006 32

Page 380: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Adding prior information for several βs

• Suppose that for the entire vector β,

β ∼ N(β0,Σβ).

• Proceed as before: add k ‘data points’ and draw posterior inferenceby weighted linear regression applied to ‘observations’ y∗, explanatoryvariables X∗ and variance matrix Σ∗:

y∗ =[

yβ0

], X∗ =

[XIk

], Σ∗ =

[Σy 00 Σβ

]

• Computation can be carried out conditional on Σ∗ first and then inversecdf for Σ∗ or using the Gibbs sampling.

ENAR - March 2006 33

Page 381: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Prior information about σ2

• Typically we do not wish to include prior information about σ2.

• If we do, we can use the conjugate prior

σ2 ∼ Inv-χ2(n0, σ20).

• The marginal posterior of σ2 is

σ2|y ∼ Inv-χ2(n0 + n,n0σ

20 + nS2

n0 + n).

• If prior information on β is also incorporated, S2 is replaced bycorresponding value from regression of y∗ on X∗ and Σ∗, and n isreplaced by length of y∗.

ENAR - March 2006 34

Page 382: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Inequality constraints on β

• Sometimes we wish to impose inequality constraints such as

β1 > 0

orβ2 < β3 < β4.

• The easiest way is to ignore the constraint until the end:

– Simulate (β, σ2) from posterior– discard all the draws that do not satisfy the constraint.

• Typically a reasonably efficient way to proceed unless constrainteliminates a large portion of unconstrained posterior distribution.

• If so, data tend to contradict the model.

ENAR - March 2006 35

Page 383: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Generalized Linear Models

• Generalized linear models are an extension of linear models to the casewhere relationship between E(y|X) and X is not linear or normalassumption is not appropriate.

• Sometimes a transformation suffices to return to the linear setup.Consider the multiplicative model

yi = xb1i1xb2

i2xb3i3εi

A simple log transformation leads to

log(yi) = b1 log(xi1) + b2 log(xi2) + b3 log(xi3) + ei

• When simple approaches do not work, we use GLIMs.

ENAR - March 26, 2006 1

Page 384: Bayesian Methods for Data Analysis - ISU Public Homepage Server

• There are three main components in the model:

1. Linear predictor η = Xβ2. Link function g(.) relating linear predictor to mean of outcome

variable: E(y|X) = µ = g−1(η) = g−1(Xβ)3. Distribution of outcome variable y with mean µ = E(y|X).

Distribution can also depend on a dispersion parameter φ:

p(y|X, β, φ) = Πni=1p(yi|(Xβ)i, φ)

• In standard GLIMs for Poisson and binomial data, φ = 1.

• In many applications, however, excess dispersion is present.

ENAR - March 26, 2006 2

Page 385: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Some standard GLIMs

• Linear model:

– Simplest GLIM, with identity link function g(µ) = µ.

• Poisson model:

– Mean and variance µ and link function log(µ) = Xβ, so that

µ = exp(Xβ) = exp(η)

– For y = (y1, ..., yn):

p(y|β) = Πni=1

1y!

exp(− exp(ηi))(exp(ηi))yi

with ηi = (Xβ)i.

ENAR - March 26, 2006 3

Page 386: Bayesian Methods for Data Analysis - ISU Public Homepage Server

• Binomial model: Suppose that yi ∼ Bin(ni, µi), ni known. Standardlink function is logit of probability of success µ:

g(µi) = log(

µi

1− µi

)= (Xβ)i = ηi

• For a vector of data y:

p(y|β) = Πni=1

(ni

yi

)(exp(ηi)

1 + exp(ηi)

)yi(

11 + exp(ηi)

)ni−yi

• Another link used in econometrics is the probit link:

Φ−1(µi) = ηi

with Φ(.) the normal cdf.

ENAR - March 26, 2006 4

Page 387: Bayesian Methods for Data Analysis - ISU Public Homepage Server

• In practice, inference from logit and probit models is almost the same,except in extremes of the tails of the distribution.

ENAR - March 26, 2006 5

Page 388: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Overdispersion

• In many applications, the model can be formulated to allow for extravariability or overdispersion.

• E.g. in Poisson model, the variance is constrained to be equal to themean.

• As an example, suppose that data are the number of fatal car accidentsat K intersections over T years. Covariates might include intersectioncharacteristics and traffic control devices (stop lights, etc).

• To accommodate overdispersion we model the (log)rate as a linearcombination of covariates and add a random effect for intersection withits own population distribution.

ENAR - March 26, 2006 6

Page 389: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Setting up GLIMs

• Canonical link functions: Canonical link is function of mean thatappears in exponent of exponential family form of sampling distribution.

• All links discussed so far are canonical except for the probit.

• Offset: Arises when counts are obtained from different population sizesor volumes or time periods and we need to use an exposure. Offset isa covariate with a known coefficient.

• Example: Number of incidents in a given exposure time T are Poissonwith rate µ per unit of time. Mean number of incidents is µT .

• Link function would be log(µ) = ηi, but here mean of y is not µ butµT .

ENAR - March 26, 2006 7

Page 390: Bayesian Methods for Data Analysis - ISU Public Homepage Server

• To apply the Poisson GLIM, add a column to X with values log(T )and fix the coefficient to 1. This is an offset.

ENAR - March 26, 2006 8

Page 391: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Interpreting GLIMs

• In linear models, βj represents the change in the outcome when xj ischanged by one unit.

• Here, βj reflects changes in g(µ) when xj is changed.

• The effect of changing xj depends of current value of x.

• To translate effects into the scale of y, measure changes relative to abaseline

y0 = g−1(x0β).

• A change in x of ∆x takes outcome from y0 to y where

g(y0) = x0β −→ y0 = g−1(x0β)

ENAR - March 26, 2006 9

Page 392: Bayesian Methods for Data Analysis - ISU Public Homepage Server

andy = g−1(g(y0) + (∆x)β)

ENAR - March 26, 2006 10

Page 393: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Priors in GLIM

• We focus on priors for β although sometimes φ is present and has itsown prior.

• Non-informative prior for β:

– With p(β) ∝ 1, posterior mode = MLE for β– Approximate posterior inference can be based on normal

approximation to posterior at mode.

• Conjugate prior for β:

– As in regression, express prior information about β in terms ofhypothetical data obtained under same model.

– Augment data vector and model matrix with y0 hypotheticalobservations and X0n0×k

hypothetical predictors.

ENAR - March 26, 2006 11

Page 394: Bayesian Methods for Data Analysis - ISU Public Homepage Server

– Non-informative prior for β in augmented model.

• Non-conjugate priors:

– Often more natural to model p(β|β0,Σ0) = N(β0,Σ0) with (β0,Σ0)known.

– Approximate computation based on normal approximation (see next)particularly suitable.

• Hierarchical GLIM:

– Same approach as in linear models.– Model some of the β as exchangeable with common population

distribution with unknown parameters. Hyperpriors for parameters.

ENAR - March 26, 2006 12

Page 395: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Computation

• Posterior distributions of parameters can be estimated using MCMCmethods in WinBUGS or other software.

• Metropolis within Gibbs will often be necessary: in GLIM, most oftenfull conditionals do not have standard form.

• An alternative is to approximate the sampling distribution with acleverly chosen approximation.

• Idea:

– Find mode of likelihood (β, φ) perhaps conditional onhyperparameters

– Create pseudo-data with their pseudo-variances (see later)– Model pseudo-data as normal with known (pseudo-)variances.

ENAR - March 26, 2006 13

Page 396: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Normal approximation to likelihood

• Objective: find zi and σ2i such that normal likelihood

N(zi|(Xβ)i, σ2i )

is good approximation to GLIM likelihood p(yi|(Xβ)i, φ).

• Let (β, φ) be mode of (β, φ) so that ηi is the mode of ηi.

• For L the loglikelihood, write

p(y1, ..., yn) = Πip(yi|ηi, φ)

= Πi exp(L(yi|ηi, φ))

ENAR - March 26, 2006 14

Page 397: Bayesian Methods for Data Analysis - ISU Public Homepage Server

• Approximate factor in exponent by normal density in ηi:

L(yi|ηi, φ) ≈ − 12σ2

i

(zi − ηi)2,

where (zi, σ2i ) depend on (yi, ηi, φ).

• Now need to find expressions for (zi, σ2i ).

• To get (zi, σ2i ), match first and second order terms in Taylor approx

around ηi to (ηi, σ2i ) and solve for zi and for σ2

i .

• Let L′ = δL/δηi:

L′ =1σ2

i

(zi − ηi)

ENAR - March 26, 2006 15

Page 398: Bayesian Methods for Data Analysis - ISU Public Homepage Server

• Let L′′ = δ2L/δη2i :

L′′ = − 1σ2

i

• Then

zi = ηi −L′(yi|ηi, φ)

L′′(yi|ηi, φ)

σ2i = − 1

L′′(yi|ηi, φ)

• Example: binomial model with logit link:

L(yi, |ηi) = yi log(

exp(ηi)1 + exp(ηi)

)+(ni − yi) log

(1

1 + exp(ηi)

)ENAR - March 26, 2006 16

Page 399: Bayesian Methods for Data Analysis - ISU Public Homepage Server

= yiηi − ni log(1 + exp(ηi))

• Then

L′ = yi − niexp(ηi)

1 + exp(ηi)

L′′ = −niexp(ηi)

(1 + exp(ηi))2

• Pseudo-data and pseudo-variances:

zi = ηi +(1 + exp(ηi))2

exp(ηi)

(yi

ni− exp(ηi)

1 + exp(ηi)

)σ2

i =1ni

(1 + exp(ηi))2

exp(ηi)

ENAR - March 26, 2006 17

Page 400: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Models for multinomial responses

• Multinomial data: outcomes y = (yi, ..., yK) are counts in K categories.

• Examples:

– Number of students receiving grades A, B, C, D or F– Number of alligators that prefer to eat reptiles, birds, fish,

invertebrate animals, or other (see example later)– Number of survey respondents who prefer Coke, Pepsi or tap water.

• In Chapter 3, we saw non-hierarchical multinomial models:

p(y|α) ∝ Πkj=1α

yj

j

with αj: probability of jth outcome and∑k

j=1 αj = 1 and∑k

j=1 yj =n.

ENAR - March 26, 2006 18

Page 401: Bayesian Methods for Data Analysis - ISU Public Homepage Server

• Here: we model αj as a function of covariates (or predictors) X withcorresponding regression coefficients βj.

• For full hierarchical structure, the βj are modeled as exchangeable withsome common population distribution p(β|µ, τ).

• Model can be developed as extension of either binomial or Poissonmodels.

ENAR - March 26, 2006 19

Page 402: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Logit model for multinomial data

• Here i = 1, ..., I is number of covariate patters. E.g., in alligatorexample, 2 sizes × four lakes = 8 covariate categories.

• Let yi be a multinomial random variable with sample size ni and kpossible outcomes. Then

yi ∼ Mult (ni;αi1, ..., αik)

with∑

i yi = ni, and∑k

j αij = 1.

• αij is the probability of jth outcome for ith covariate combination.

• Standard parametrization: log of the probability of jth outcome relative

ENAR - March 26, 2006 20

Page 403: Bayesian Methods for Data Analysis - ISU Public Homepage Server

to baseline category j = 1:

log(

αij

αi1

)= ηij = (Xβj)i,

with βj a vector of regression coefficients for jth category.

• Sampling distribution:

p(y|β) ∝ ΠIi=1Π

kj=1

(exp(ηij)∑kl=1 exp(ηil)

)yij

.

• For identifiability, β1 = 0 and thus ηi1 = 0 for all i.

• βj is effect of changing X on probability of category j relative tocategory 1.

ENAR - March 26, 2006 21

Page 404: Bayesian Methods for Data Analysis - ISU Public Homepage Server

• Typically, indicators for each outcome category are added to predictorsto indicate relative frequency of each category when X = 0. Then

ηij = δj + (Xβj)i

with δ1 = β1 = 0 typically.

ENAR - March 26, 2006 22

Page 405: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Example from WinBUGS - Alligators

• Agresti (1990) analyzes feeding choices of 221 alligators.

• Response is one of five categories: fish, invertebrate, reptile, bird,other.

• Two covariates: length of alligator (less than 2.3 meters or larger than2.3 meters) and lake (Hancock, Oklawaha, Trafford, George).

• 2× 4 = 8 covariate combinations (see data)

• For i, j a combination of size and lake, we have counts in five possiblecategories yij = (yij1, ..., yij5).

• Modelp(yij|αij, nij) = Mult (yij|nij, θij1, ..., θij5)

ENAR - March 26, 2006 23

Page 406: Bayesian Methods for Data Analysis - ISU Public Homepage Server

with

θijk =exp(ηijk)∑kl=1 exp(ηijl)

,

andηijk = δk + βik + γjk.

• Here,

– δk is baseline indicator for category k– βik is coefficient for indicator for lake– γjk is coefficient for indicator for size

ENAR - March 26, 2006 24

Page 407: Bayesian Methods for Data Analysis - ISU Public Homepage Server

[2]

[3]

[4]

[5]

box plot: alpha

-6.0

-4.0

-2.0

0.0

Mean Std 2.5% Median 97.5%

alpha[2] -1.838 0.5278 -2.935 -1.823 -0.8267 alpha[3] -2.655 0.706 -4.261 -2.593 -1.419 alpha[4] -2.188 0.58 -3.382 -2.153 -1.172 alpha[5] -0.7844 0.3691 -1.531 -0.7687 -0.1168

Page 408: Bayesian Methods for Data Analysis - ISU Public Homepage Server

[2,2]

[2,3]

[2,4] [2,5]

[3,2]

[3,3]

[3,4] [3,5]

[4,2]

[4,3] [4,4]

[4,5]

box plot: beta

-5.0

-2.5

0.0

2.5

5.0

Page 409: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Mean Std 2.5% Median 97.5%

beta[2,2] 2.706 0.6431 1.51 2.659 4.008 beta[2,3] 1.398 0.8571 -0.2491 1.367 3.171 beta[2,4] -1.799 1.413 -5.1 -1.693 0.5832 beta[2,5] -0.9353 0.7692 -2.555 -0.867 0.4413 beta[3,2] 2.932 0.6864 1.638 2.922 4.297 beta[3,3] 1.935 0.8461 0.37 1.886 3.842 beta[3,4] 0.3768 0.7936 -1.139 0.398 1.931 beta[3,5] 0.7328 0.5679 -0.3256 0.7069 1.849 beta[4,2] 1.75 0.6116 0.636 1.73 3.023 beta[4,3] -1.595 1.447 -4.847 -1.433 0.9117 beta[4,4] -0.7617 0.8026 -2.348 -0.7536 0.746 beta[4,5] -0.843 0.5717 -1.97 -0.8492 0.281

Page 410: Bayesian Methods for Data Analysis - ISU Public Homepage Server

[2,2]

[2,3]

[2,4]

[2,5]

box plot: gamma

-4.0

-2.0

0.0

2.0

4.0

Mean Std 2.5% Median 97.5% gamma[2,2] -1.523 0.4101 -2.378 -1.523 -0.7253 gamma[2,3] 0.342 0.5842 -0.788 0.3351 1.476 gamma[2,4] 0.7098 0.6808 -0.651 0.7028 2.035 gamma[2,5] -0.353 0.4679 -1.216 -0.3461 0.518

Page 411: Bayesian Methods for Data Analysis - ISU Public Homepage Server

[1,1,1]

[1,1,2]

[1,1,3]

[1,1,4]

[1,1,5]

[1,2,1]

[1,2,2]

[1,2,3]

[1,2,4][1,2,5]

box plot: p[1,,]

0.0

0.2

0.4

0.6

0.8

[2,1,1]

[2,1,2]

[2,1,3]

[2,1,4]

[2,1,5]

[2,2,1]

[2,2,2]

[2,2,3]

[2,2,4]

[2,2,5]

box plot: p[2,,]

0.0

0.2

0.4

0.6

0.8

[3,1,1]

[3,1,2]

[3,1,3]

[3,1,4]

[3,1,5]

[3,2,1]

[3,2,2][3,2,3]

[3,2,4]

[3,2,5]

box plot: p[3,,]

0.0

0.2

0.4

0.6

0.8

[4,1,1][4,1,2]

[4,1,3][4,1,4]

[4,1,5]

[4,2,1]

[4,2,2]

[4,2,3]

[4,2,4] [4,2,5]

box plot: p[4,,]

0.0

0.2

0.4

0.6

0.8

Page 412: Bayesian Methods for Data Analysis - ISU Public Homepage Server

p[1,1,1] 0.5392 0.07161 0.4046 0.5384 0.676 p[1,1,2] 0.09435 0.04288 0.03016 0.08735 0.2067 p[1,1,3] 0.04585 0.02844 0.007941 0.04043 0.1213 p[1,1,4] 0.06837 0.03418 0.01923 0.06237 0.1457 p[1,1,5] 0.2523 0.06411 0.1389 0.2496 0.3808 p[1,2,1] 0.5679 0.09029 0.3959 0.5684 0.7342 p[1,2,2] 0.02322 0.01428 0.005301 0.02013 0.06186 p[1,2,3] 0.06937 0.04539 0.01191 0.05918 0.1768 p[1,2,4] 0.1469 0.07437 0.03647 0.1344 0.3182 p[1,2,5] 0.1927 0.07171 0.0793 0.1852 0.349 p[2,1,1] 0.2578 0.07217 0.1367 0.2496 0.4265 p[2,1,2] 0.5968 0.08652 0.4159 0.6 0.7551 p[2,1,3] 0.08137 0.0427 0.01939 0.0728 0.1841 p[2,1,4] 0.0095 0.01186 0.00533 0.04105 0.143 p[2,1,5] 0.05445 0.03517 0.009716 0.0476 0.1398 p[2,2,1] 0.4612 0.08211 0.3055 0.4632 0.6175 p[2,2,2] 0.2455 0.06879 0.1263 0.2426 0.3867 p[2,2,3] 0.1947 0.07042 0.07776 0.189 0.3502 p[2,2,4] 0.0302 0.02921 7.653E-4 0.02009 0.1122 p[2,2,5] 0.06833 0.03984 0.01357 0.06097 0.1643 p[3,1,1] 0.1794 0.05581 0.08555 0.1732 0.296

p[3,1,2] 0.5178 0.09136 0.3334 0.5185 0.6933 p[3,1,3] 0.09403 0.04716 0.02652 0.08517 0.2219 p[3,1,4] 0.03554 0.02504 0.006063 0.02996 0.1055 p[3,1,5] 0.1732 0.06253 0.07053 0.167 0.3261 p[3,2,1] 0.2937 0.07225 0.1618 0.2901 0.4469

Page 413: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Interpretation of results Because we set to zero several model parameters, interpreting results is tricky. For example: • Beta[2,2] ha posterior mean 2.714. This is the effect of lake Oklawaha

(relative to Hancock) on the alligator’s preference for invertebrates relative to fish.

Since beta[2,2] > 0, we conclude that alligators in Oklawaha eat more invertebrates than do alligators in Hancock (even though both may prefer fish!).

• Gamma[2,2] is the effect of size 2 relative to size on the relative preference for invertebrates. Since gamma[2,2] < 0, we conclude that large alligators prefer fish more than do small alligators.

• The alpha are baseline counts for each type of food relative to fish.

Page 414: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Hierarchical Poisson model

• Count data are often modeled using a Poisson model.

• If y ∼ Poisson(µ) then E(y) = var(y) = µ.

• When counts are assumed exchangeable given µ and the rates µ canalso be assumed to be exchangeable, a Gamma population model forthe rates is often chosen.

• The hierarchical model is then

yi ∼ Poisson(µi)

µi ∼ Gamma(α, β).

ENAR - March 26, 2006 25

Page 415: Bayesian Methods for Data Analysis - ISU Public Homepage Server

• Priors for the hyperparameters are often taken to be Gamma (orexponential):

α ∼ Gamma(a, b)

β ∼ Gamma(c, d),

with (a, b, c, d) known.

• The joint posterior distribution is

p(µ, α, β|y) ∝ Πiµyii exp{−µi}µα−1

i exp{−µiβ}αa−1 exp{−αb}βc−1 exp{−βd}

• To carry out Gibbs sampling we need to find the full conditionaldistributions.

ENAR - March 26, 2006 26

Page 416: Bayesian Methods for Data Analysis - ISU Public Homepage Server

• Conditional for µi is

p(µi| all) ∝ µyi+α−1i exp{−µi(β + 1)},

which is proportional to a Gamma with parameters (yi + α, β + 1).

• The full conditional for α is

p(α| all) ∝ Πiµα−1i αa−1 exp{αb}.

• The conditional for α does not have a standard form.

• For β:

p(β| all) ∝ Πi exp{−βµi}βc−1 exp{−βd}

∝ βc−1 exp{−β(∑

i

µi + d)},

ENAR - March 26, 2006 27

Page 417: Bayesian Methods for Data Analysis - ISU Public Homepage Server

which is proportional to a Gamma with parameters (c,∑

i µi + d).

• Computation:

– Given α, β, draw each µi from the corresponding Gamma conditional.– Draw α using a Metropolis step or rejection sampling or inverse cdf

method.– Draw β from the Gamma conditional.

• See Italian marriages example.

ENAR - March 26, 2006 28

Page 418: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Italian marriages – Example Gill (2002) collected data on the number of marriages per 1,000 people in Italy during 1936-1951. Question: did the number of marriages decrease during WWII years? (1939 – 1945). Model: Number of marriages yi are Poisson with year-specific means λi. Assuming that rates of marriages are exchangeable across years, we model the λi as Gamma(α, β). To complete model specification, place independent Gamma priors on (α, β), with known hyper-parameter values.

Page 419: Bayesian Methods for Data Analysis - ISU Public Homepage Server

WinBUGS code: model { for (i in 1:16) { y[i] ~ dpois(l[i]) l[i] ~ dgamma(alpha, beta) } alpha ~ dgamma(1,1) beta ~ dgamma(1,1) warave <- (l[4]+l[5]+ l[6]+l[7]+l[8]+l[9]+l[10]) / 7 nonwarave<- (l[1]+l[2]+l[3]+l[11]+l[12]+l[13]+l[14]+l[15]+l[16]) / 9 diff <- nonwarave - warave } list(y = c(7,9,8,7,7,6,6,5,5,7,9,10,8,8,8,7))

Page 420: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Results

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

lambda: 95% credible sets

0.0 5.0 10.0

[1]

[2]

[3][4] [5]

[6] [7]

[8] [9]

[10]

[11]

[12]

[13] [14] [15]

[16]

lambda: box plot

0.0

5.0

10.0

15.0

Page 421: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Difference between non-war and war years marriage rate

Nonwar - war average

-5.0 0.0 5.0

0.0 0.1 0.2 0.3 0.4

Overall marriage rate:

If λi ~ Gamma(α, β), then E(λi | y) = α / β.

mean sd 2.5% median 97.5% overallrate 7.362 1.068 5.508 7.285 9.665 overallstd 2.281 0.394 1.59 2.266 3.125

Page 422: Bayesian Methods for Data Analysis - ISU Public Homepage Server

overallrate chains 1:3 sample: 3000

4.0 6.0 8.0 10.0

0.0 0.2 0.4 0.6

overallstd chains 1:3 sample: 3000

1.0 2.0 3.0 4.0

0.0 0.5 1.0 1.5

Page 423: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Poisson regression

• When rates are not exchangeable, we need to incorporate covariatesinto the model. Often we are interested in the association between oneor more covariate and the outcome.

• It is possible (but not easy) to incorporate covariates into the Poisson-Gamma model.

• Christiansen and Morris (1997, JASA) propose the following model:

• Sampling distribution, where ei is a known exposure:

yi|λi ∼ Poisson(λiei).

Under model, E(yi/ei) = λi.

ENAR - March 26, 2006 29

Page 424: Bayesian Methods for Data Analysis - ISU Public Homepage Server

• Population distribution for the rates:

λi|α ∼ Gamma(ζ, ζ/µi),

with log(µi) = x′iβ, and α = (β0, β1, ..., βk−1, ζ).

• ζ is thought of as an unboserved prior count.

• Under population model,

E(λi) =ζ

ζ/µi

= µi

CV2(λi) =µ2

i

ζ

1µ2

i

=1ζ

ENAR - March 26, 2006 30

Page 425: Bayesian Methods for Data Analysis - ISU Public Homepage Server

• For k = 0, µi is known. For k = 1, µi are exchangeable. For k ≥ 2, µi

are (unconditionally) nonexchangeable.

• In all cases, standardized rates λi/µi are Gamma(ζ, ζ), areexchangeable, and have expectation 1.

• The covariates can include random effects.

• To complete specification of model, we need priors on α.

• Christensen and Morris (1997) suggest:

– β and ζ independent a priori.– Non-informative prior on β’s associated to ‘fixed’ effects.– For ζ a proper prior of the form:

p(ζ|y0) ∝y0

(ζ + y0)2,

ENAR - March 26, 2006 31

Page 426: Bayesian Methods for Data Analysis - ISU Public Homepage Server

where y0 is the prior guess for the median of ζ.

• Small values of y0 (for example, y0 < ζ and ζ the MLE of ζ) provideless information.

• When the rates cannot be assumed to be exchangeable, it is commonto choose a generalized linear model of the form:

p(y|β) ∝ Πi exp{−λi}λyii

∝ Πi exp{− exp(ηi)}[exp(ηi)]yi,

for ηi = x′iβ and log(λi) = ηi.

• The vector of covariates can include one or more random effects toaccommodate additional dispersion (see epylepsy example).

ENAR - March 26, 2006 32

Page 427: Bayesian Methods for Data Analysis - ISU Public Homepage Server

• The second-level distribution for the β’s will typically be flat (if covariateis a ‘fixed’ effect) or normal

βj ∼ Normal(βj0, σ2βj

)

if jth covariate is a random effect. The variance σ2βj

represents the

between ‘batch’ variability.

ENAR - March 26, 2006 33

Page 428: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Epilepsy example

• From Breslow and Clayton, 1993, JASA.

• Fifty nine epilectic patients in a clinical trial were randomized to a newdrug: T = 1 is the drug and T = 0 is the placebo.

• Covariates included:

– Baseline data: number of seizures during eight weeks preceding trial– Age in years.

• Outcomes: number of seizures during the two weeks preceding each offour clinical visits.

• Data suggest that number of seizures was significantly lower prior tofourth visit, so an indicator was used for V4 versus the others.

ENAR - March 26, 2006 34

Page 429: Bayesian Methods for Data Analysis - ISU Public Homepage Server

• Two random effects in the model:

– A patient-level effect to introduce between patient variability.– A patients by visit effect to introduce between visit within patient

dispersion.

ENAR - March 26, 2006 35

Page 430: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Epilepsy study – Program and results model { for(j in 1 : N) { for(k in 1 : T) { log(mu[j, k]) <- a0 + alpha.Base * (log.Base4[j] - log.Base4.bar) + alpha.Trt * (Trt[j] - Trt.bar) + alpha.BT * (BT[j] - BT.bar) + alpha.Age * (log.Age[j] - log.Age.bar) + alpha.V4 * (V4[k] - V4.bar) + b1[j] + b[j, k] y[j, k] ~ dpois(mu[j, k]) b[j, k] ~ dnorm(0.0, tau.b); # subject*visit random effects } b1[j] ~ dnorm(0.0, tau.b1) # subject random effects BT[j] <- Trt[j] * log.Base4[j] # interaction log.Base4[j] <- log(Base[j] / 4) log.Age[j] <- log(Age[j]) diff[j] <- mu[j,4] – mu[j,1] } # covariate means: log.Age.bar <- mean(log.Age[]) Trt.bar <- mean(Trt[]) BT.bar <- mean(BT[]) log.Base4.bar <- mean(log.Base4[]) V4.bar <- mean(V4[]) # priors: a0 ~ dnorm(0.0,1.0E-4) alpha.Base ~ dnorm(0.0,1.0E-4) alpha.Trt ~ dnorm(0.0,1.0E-4);

Page 431: Bayesian Methods for Data Analysis - ISU Public Homepage Server

alpha.BT ~ dnorm(0.0,1.0E-4) alpha.Age ~ dnorm(0.0,1.0E-4) alpha.V4 ~ dnorm(0.0,1.0E-4) tau.b1 ~ dgamma(1.0E-3,1.0E-3); sigma.b1 <- 1.0 / sqrt(tau.b1) tau.b ~ dgamma(1.0E-3,1.0E-3); sigma.b <- 1.0/ sqrt(tau.b) # re-calculate intercept on original scale: alpha0 <- a0 - alpha.Base * log.Base4.bar - alpha.Trt * Trt.bar - alpha.BT * BT.bar - alpha.Age * log.Age.bar - alpha.V4 * V4.bar }

Results Parameter Mean Std 2.5th Median 97.5th

alpha.Age 0.4677 0.3557 -0.2407 0.4744 1.172 alpha.Base 0.8815 0.1459 0.5908 0.8849 1.165 alpha.Trt -0.9587 0.4557 -1.794 -0.9637 -0.06769 alpha.V4 -0.1013 0.08818 -0.273 -0.09978 0.07268 alpha.BT 0.3778 0.2427 -0.1478 0.3904 0.7886 sigma.b1 0.4983 0.07189 0.3704 0.4931 0.6579

sigma.b 0.3641 0.04349 0.2871 0.362 0.4552

Page 432: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Individual random effects

[1] [2]

[3]

[4][5]

[6][7]

[8]

[9]

[10]

[11][12][13][14][15]

[16][17]

[18]

[19][20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29][30][31]

[32][33]

[34]

[35]

[36]

[37]

[38]

[39]

[40]

[41]

[42][43]

[44][45]

[46]

[47]

[48]

[49]

[50][51]

[52]

[53]

[54]

[55]

[56]

[57]

[58]

[59]

box plot: b1

-2.0

-1.0

0.0

1.0

2.0

Page 433: Bayesian Methods for Data Analysis - ISU Public Homepage Server

[1][2][3][4] [5]

[6][7][8]

[9][10] [11]

[12][13] [14][15]

[16][17][18]

[19][20][21][22][23]

[24][25][26]

[27][28][29]

[30][31][32]

[33][34][35][36][37] [38]

[39][40][41]

[42][43][44]

[45][46] [47]

[48][49] [50]

[51][52][53]

[54][55] [56]

[57][58][59]

Diff in seizure counts between V4 and V1

-100.0 -50.0 0.0

Page 434: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Patients 5 and 49 are ‘different’. For #5, the number of events increases from visits 1 through 4. For #49, there is a significant decrease.

[5,1]

[5,2]

[5,3]

[5,4]

box plot: b[5, ]

-1.0

-0.5

0.0

0.5

1.0

[49,1]

[49,2]

[49,3] [49,4]

box plot: b[49, ]

-0.5

0.0

0.5

1.0

Page 435: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Convergence and autocorrelation We ran three parallel chains for 10,000 iterations Discarded the first 5,000 iterations from each chain as burn-in Thinned by 10, so there are 1,500 draws for inference.

sigma.b chains 1:3

lag0 20 40

-1.0 -0.5 0.0 0.5 1.0

sigma.b1 chains 1:3

lag0 20 40

-1.0 -0.5 0.0 0.5 1.0

sigma.b chains 1:3

iteration5001 6000 8000

0.0 0.5 1.0 1.5

sigma.b1 chains 1:3

iteration5001 6000 8000

0.0 0.5 1.0 1.5

Page 436: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Example: hierarchical Poisson model

• The USDA collects data on the number of farmers who adoptconservation tillage practices.

• In a recent survey, 10 counties in a midwestern state were randomlysampled, and within county, a random number of farmers wereinterviewed and asked whether they had adopted conservation practices.

• The number of farmers interviewed in each county varied from a lowof 2 to a high of 100.

• Of interest to USDA is the estimation of the overall adoption rate inthe state.

KSU - April 2005 1

Page 437: Bayesian Methods for Data Analysis - ISU Public Homepage Server

• We fitted the following model:

yi ∼ Poisson(λi)

λi = θiEi

θi ∼ Gamma(α, β),

where yi is the number of adopters in the ith county, Ei is the number offarmers interviewed, θi is the expected adoption rate, and the expectednumber of adopters in a county can be obtained by multiplying θi bythe number of farms in the county.

• The hierarchical structure establishes that even though counties mayvary significantly in terms of the rate of adoption, they are stillexchangeable, so the rates are generated from a common populationdistribution.

• Information about the overall rate of adoption across the state is

KSU - April 2005 2

Page 438: Bayesian Methods for Data Analysis - ISU Public Homepage Server

contained in the posterior distribution of (α, β).

• We fitted the model using WinBUGS and chose non-informative priorsfor the hyperparameters.

Observed data:

County Ei yi County Ei yi

1 94 5 6 31 192 15 1 7 2 13 62 5 8 2 14 126 14 9 9 45 5 3 10 100 22

KSU - April 2005 3

Page 439: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Posterior distribution of rate of adoption in each county

[1]

[2]

[3] [4]

[5]

[6]

[7][8]

[9]

[10]

box plot: theta

0.0

0.5

1.0

1.5

Page 440: Bayesian Methods for Data Analysis - ISU Public Homepage Server

node mean sd 2.5% median 97.5% theta[1] 0.06033 0.02414 0.02127 0.05765 0.1147 theta[2] 0.1074 0.08086 0.009868 0.08742 0.306 theta[3] 0.09093 0.03734 0.03248 0.08618 0.1767 theta[4] 0.1158 0.03085 0.06335 0.1132 0.1831 theta[5] 0.5482 0.2859 0.131 0.4954 1.242 theta[6] 0.6028 0.1342 0.3651 0.5962 0.8932 theta[7] 0.4586 0.3549 0.0482 0.3646 1.343 theta[8] 0.4658 0.3689 0.04501 0.3805 1.41 theta[9] 0.4378 0.2055 0.1352 0.3988 0.9255 theta[10] 0.2238 0.0486 0.1386 0.2211 0.33

Page 441: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Posterior distribution of number of adopters in each county, given exposures

[1]

[2]

[3]

[4]

[5]

[6]

[7] [8]

[9]

[10]

box plot: lambda

0.0

10.0

20.0

30.0

40.0

Page 442: Bayesian Methods for Data Analysis - ISU Public Homepage Server

node mean sd 2.5% median 97.5%

lambda[1] 5.671 2.269 2.0 5.419 10.78 lambda[2] 1.611 1.213 0.148 1.311 4.59 lambda[3] 5.638 2.315 2.014 5.343 10.95 lambda[4] 14.59 3.887 7.982 14.26 23.07 lambda[5] 2.741 1.43 0.6551 2.477 6.21 lambda[6] 18.69 4.16 11.32 18.48 27.69 lambda[7] 0.9171 0.7097 0.09641 0.7292 2.687 lambda[8] 0.9317 0.7377 0.09002 0.761 2.819 lambda[9] 3.94 1.85 1.217 3.589 8.33

lambda[10] 22.38 0.1028 13.86 22.11 33.0 node mean sd 2.5% median 97.5%

alpha 0.8336 0.3181 0.3342 0.7989 1.582 beta 2.094 1.123 0.4235 1.897 4.835 mean 0.4772 0.2536 0.1988 0.4227 1.14

Page 443: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Poisson model for small area deaths

Taken from Bayesian Statistical Modeling, Peter Congdon, 2001

• Congdon considers the incidence of heart disease mortality in 758 electoral wards in

the Greater London area over three years (1990-92). These small areas are grouped

administratively into 33 boroughs.

• Regressors:

at ward level: xij, index of socio-economic deprivation.

at borough level: wj, where

wi =

�1 for inner London boroughs

0 for outer suburban boroughs

• We assume borough level variation in the intercepts and in the impacts of deprivation;

this variation is linked to the category of borough (inner vs outer).

KSU - April 2005 1

Page 444: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Model

• First level of the hierarchy

Oij|µij ∼ Poisson(µij)

log(µij) = log(Eij) + β1j + β2j(xij − x) + δij

δij ∼ N(0, σ2δ)

– Death counts Oij are Poisson with means µij.

– log(Eij) is an offset, i.e. an explanatory variable with known coefficient (equal

to 1).

KSU - April 2005 2

Page 445: Bayesian Methods for Data Analysis - ISU Public Homepage Server

– βj = (β1j, β2j)′ are random coefficients for the intercepts and the impacts of

deprivation at the borough level.

– δij is a random error for Poisson over-dispersion. We fitted two models: one

without and another with this random error term.

KSU - April 2005 3

Page 446: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Model (cont’d)

• Second level of the hierarchy

βj =

�β1j

β2j

�∼ N2(µβj,Σ)

where

Σ =

�ψ11 ψ12

ψ21 ψ22

µβ1j = γ11 + γ12wj

µβ2j = γ21 + γ22wj

– γ11, γ12, γ21, and γ22 are the population coefficients for the intercepts, the

impact of borough category, the impact of deprivations, and, respectively, the

interaction impact of the level-2 regressors and level-1 regressors.

• Hyperparameters

Σ−1 ∼ Wishart

�(ρR)

−1, ρ�

γij ∼ N(0, 0.1) i, j ∈ {1, 2}σ

2δ ∼ Inv-Gamma(a, b)

KSU - April 2005 4

Page 447: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Computation of Eij

Suppose that p∗ is an overall disease rate. Then, Eij = nijp∗ and µij = pij/p

∗.

The µ’s are said:

1. externally standardized if p∗ is obtained from another data source (such as a standard

reference table);

2. internally standardized if p∗ is obtained from the given dataset, e.g.

p∗=

Xij

Oij

Xij

nij

• In our example we rely on the latter approach.

• Under (a), the joint distribution of the Oij is a product Poisson; under (b) is

multinomial.

• However, since likelihood inference is unaffected by whether we condition onP

ijOij,

the product Poisson likelihood is commonly retained.

KSU - April 2005 5

Page 448: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Model 1: without overdispersion term

(i.e. without δij)

std. Posterior quantiles

node mean dev. 2.5% median 97.5%

γ11 -0.075 0.074 -0.224 -0.075 0.070

γ12 0.078 0.106 -0.128 0.078 0.290

γ21 0.621 0.138 0.354 0.620 0.896

γ22 0.104 0.197 -0.282 0.103 0.486

σβ1 0.294 0.038 0.231 0.290 0.380

σβ2 0.445 0.077 0.318 0.437 0.618

Deviance 945.800 11.320 925.000 945.300 969.400

KSU - April 2005 6

Page 449: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Model 2: with overdispersion term

std. Posterior quantiles

node mean dev. 2.5% median 97.5%

γ11 -0.069 0.074 -0.218 -0.069 0.076

γ12 0.068 0.106 -0.141 0.068 0.278

γ21 0.616 0.141 0.335 0.615 0.892

γ22 0.105 0.198 -0.276 0.104 0.499

σβ1 0.292 0.037 0.228 0.289 0.376

σβ2 0.431 0.075 0.306 0.423 0.600

Deviance 802.400 40.290 726.600 800.900 883.500

• Including ward-level random variability δij reduces the average Poisson GLM deviance

to 802, with a 95% credible interval from 726 to 883. This is in line with the expected

value of the GLM deviance for N = 758 areas if the Poisson Model is appropriate.

KSU - April 2005 7

Page 450: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Hierarchical models for spatial data

Based on the book by Banerjee, Carlin and Gelfand Hierarchical Modelingand Analysis for Spatial Data, 2004. We focus on Chapters 1, 2 and 5.

• Geo-referenced data arise in agriculture, climatology, economics,epidemiology, transportation and many other areas.

• What does geo-referenced mean? In a nutshell, we know the geographiclocation at which an observation was collected.

• Why does it matter? Sometimes, relative location can provideinformation about an outcome beyond that provided by covariates.

• Example: infant mortality is typically higher in high poverty areas.Even after incorporating poverty as a covariate, residuals may still bespatially correlated due to other factors such as nearness to pollutionsites, distance to pre and post-natal care centers, etc.

ENAR - March 26, 2006 1

Page 451: Bayesian Methods for Data Analysis - ISU Public Homepage Server

• Often data are also collected over time, so models that include spatialand temporal correlations of outcomes are needed.

• We focus on spatial, rather than spatio-temporal models.

• We also focus on models for univariate rather than multivariateoutcomes.

ENAR - March 26, 2006 2

Page 452: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Types of spatial data

• Point-referenced data: Y (s) a random outcome (perhaps vector-valued) at location s, where s varies continuously over some region D.The location s is typically two-dimensional (latitude and longitude) butmay also include altitude. Known as geostatistical data.

• Areal data: outcome Yi is an aggregate value over an areal unit withwell-defined boundaries. Here, D is divided into a finite collection ofareal units. Known as lattice data even though lattices can be irregular.

• Point-pattern data: Outcome Y (s) is the occurrence or not of an eventand locations s are random. Example: locations of trees of a speciesin a forest or addresseses of persons with a particular disease. Interestis often in deciding whether points occur independently in space orwhether there is clustering.

ENAR - March 26, 2006 3

Page 453: Bayesian Methods for Data Analysis - ISU Public Homepage Server

• Marked point process data: If covariate information is available we talkabout a marked point process. Covariate value at each site marks thesite as belonging to a certain covariate batch or group.

• Combinations: e.g. ozone daily levels collected in monitoring stationswith precise location, and number of children in a zip code reporting tothe ER on that day. Require data re-alignment to combine outcomesand covariates.

ENAR - March 26, 2006 4

Page 454: Bayesian Methods for Data Analysis - ISU Public Homepage Server
Page 455: Bayesian Methods for Data Analysis - ISU Public Homepage Server
Page 456: Bayesian Methods for Data Analysis - ISU Public Homepage Server
Page 457: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Models for point-level data

The basics

• Location index s varies continuously over region D.

• We often assume that the covariance between two observations atlocations si and sj depends only on the distance dij between thepoints.

• The spatial covariance is often modeled as exponential:

Cov (Y (si), Y (si′)) = C(dii′) = σ2e−φdii′,

where (σ2, φ) > 0 are the partial sill and decay parameters, respectively.

• Covariogram: a plot of C(dii′) against dii′.

ENAR - March 26, 2006 5

Page 458: Bayesian Methods for Data Analysis - ISU Public Homepage Server

• For i = i′, dii′ = 0 and C(dii′) = var(Y (si)).

• Sometimes, var(Y (si)) = τ2+σ2, for τ2 the nugget effect and τ2+σ2

the sill.

ENAR - March 26, 2006 6

Page 459: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Models for point-level data (cont’d)

Covariance structure

• Suppose that outcomes are normally distributed and that we choose anexponential model for the covariance matrix. Then:

Y |µ, θ ∼ N(µ,Σ(θ)),

with

Y = {Y (s1), Y (s2), ..., Y (sn)}Σ(θ)ii′ = cov(Y (si), Y (si′))

θ = (τ2, σ2, φ).

• ThenΣ(θ)ii′ = σ2 exp(−φdii′) + τ2Ii=i′,

ENAR - March 26, 2006 7

Page 460: Bayesian Methods for Data Analysis - ISU Public Homepage Server

with (τ2, σ2, φ) > 0.

• This is an example of an isotropic covariance function: the spatialcorrelation is only a function of d.

ENAR - March 26, 2006 8

Page 461: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Models for point-level data, details

• Basic model:Y (s) = µ(s) + w(s) + e(s),

where µ(s) = x′(s)β and the residual is divided into two components:

w(s) is a realization of a zero-centered stationary Gaussian process ande(s) is uncorrelated pure error.

• The w(s) are functions of the partial sill σ2 and decay φ parameters.

• The e(s) introduces the nugget effect τ2.

• τ2 interpreted as pure sampling variability or as microscale variability,i.e., spatial variability at distances smaller than the distance betweentwo outcomes: the e(s) are sometimes viewed as spatial processes withrapid decay.

ENAR - March 26, 2006 9

Page 462: Bayesian Methods for Data Analysis - ISU Public Homepage Server

The variogram and semivariogram

• A spatial process is said to be:

– Strictly stationary if distributions of Y (s) and Y (s + h) are equal,for h the distance.

– Weakly stationary if µ(s) = µ and Cov(Y (s), Y (s+ h)) = C(h).– Instrinsically stationary if

E[Y (s+ h)− Y (s)] = 0, and

E[Y (s+ h)− Y (s)]2 = V ar[Y (s+ h)− Y (s)]

= 2γ(h),

defined for differences and depending only on distance.

• 2γ(h) is the variogram and γ(h) is the semivariogram

ENAR - March 26, 2006 10

Page 463: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Examples of semi-variograms

Semi-variograms for the linear, spherical and exponential models.

ENAR - March 26, 2006 11

Page 464: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Stationarity

• Strict stationarity implies weak stationarity but the converse is not trueexcept in Gaussian processes.

• Weak stationarity implies intrinsec stationarity, but the converse is nottrue in general.

• Notice that intrinsec stationarity is defined on the differences betweenoutcomes at two locations and thus says nothing about the jointdistribution of outcomes.

ENAR - March 26, 2006 12

Page 465: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Semivariogram (cont’d)

• If γ(h) depends on h only through its length ||h||, then the spatialprocess is isotropic. Else it is anisotropic.

• There are many choices for isotropic models. The exponential modelis popular and has good properties. For t = ||h||:

γ(t) = τ2 + σ2(1− exp(−φt)) if t > 0,

= 0 otherwise.

• See figures, page 24.

• The powered exponential model has an extra parameter for smoothness:

γ(t) = τ2 + σ2(1− exp(−φtκ)) if t > 0

ENAR - March 26, 2006 13

Page 466: Bayesian Methods for Data Analysis - ISU Public Homepage Server

• Another popular choice is the Gaussian variogram model, equal to theexponential except for the exponent term, that is exp(−φ2t2)).

• Fitting of the variogram has been traditionally done ”by eye”:

– Plot an empirical estimate of the variogram akin to the samplevariance estimate or the autocorrelation function in time series

– Choose a theoretical functional form to fit the empirical γ– Choose values for (τ2, σ2, φ) that fit the data well.

• If a distribution for the outcomes is assumed and a functional formfor the variogram is chosen, parameter estimates can be estimated viasome likelihood-based method.

• Of course, we can also be Bayesians.

ENAR - March 26, 2006 14

Page 467: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Point-level data (cont’d)

• For point-referenced data, frequentists focus on spatial prediction usingkriging.

• Problem: given observations {Y (s1), ..., Y (sn)}, how do we predictY (so) at a new site so?

• Consider the model

Y = Xβ + ε, where ε ∼ N(0,Σ),

and where

Σ = σ2H(φ) + τ2I.

Here, H(φ)ii′ = ρ(φ, dii′).

ENAR - March 26, 2006 15

Page 468: Bayesian Methods for Data Analysis - ISU Public Homepage Server

• Kriging consists in finding a function f(y) of the observations thatminimizes the MSE of prediction

Q = E[(Y (so)− f(y))2|y].

ENAR - March 26, 2006 16

Page 469: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Classical kriging (cont’d)

• (Not a surprising!) Result: f(y) that minimizes Q is the conditionalmean of Y (s0) given observations y (see pages 50-52 for proof):

E[Y (so)|y] = x′oβ + γ′Σ−1(y −Xβ)

V ar[Y (so)|y] = σ2 + τ2 − γ′Σ−1γ,

where

γ = (σ2ρ(φ, do1), ..., σ2ρ(φ, don))

β = (X ′Σ−1X)−1X ′Σ−1y

Σ = σ2H(φ).

• Solution assumes that we have observed the covariates xo at the newsite.

ENAR - March 26, 2006 17

Page 470: Bayesian Methods for Data Analysis - ISU Public Homepage Server

• If not,in the classical framework Y (so), xo are jointly estimated usingan EM-type iterative algorithm.

ENAR - March 26, 2006 18

Page 471: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Bayesian methods for estimation

• The Gaussian isotropic kriging model is just a general linear modelsimilar to those in Chapter 15 of textbook.

• Just need to define the appropriate covariance structure.

• For an exponential covariance structure with a nugget effect, parametersto be estimated are θ = (β, σ2, τ2, φ).

• Steps:

– Choose priors and define sampling distribution– Obtain posterior for all parameters p(θ|y)– Bayesian kriging: get posterior predictive distribution for outcome at

new location p(yo|y,X, xo).

ENAR - March 26, 2006 19

Page 472: Bayesian Methods for Data Analysis - ISU Public Homepage Server

• Sampling distribution (marginal data model)

y|θ ∼ N(Xβ, σ2H(φ) + τ2I)

• Priors: typically chosen so that parameters are independent a priori.

• As in the linear model:

– Non-informative prior for β is uniform or can use a normal prior too.– Conjugate priors for variances σ2, τ2 are inverse gamma priors.

• For φ, appropriate prior depends on covariance model. For simpleexponential where

ρ(si − sj;φ) = exp(−φ||si − sj||),

a Gamma prior can be a good choice.

ENAR - March 26, 2006 20

Page 473: Bayesian Methods for Data Analysis - ISU Public Homepage Server

• Be cautious with improper priors for anything but β.

ENAR - March 26, 2006 21

Page 474: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Hierarchical representation of model

• Hierarchical model representation: first condition on the spatial randomeffects W = {w(s1), ..., w(sn)}:

y|θ,W ∼ N(Xβ +W, τ2I)

W |φ, σ2 ∼ N(0, σ2H(φ)).

• Model specification is then completed by choosing priors for β, τ2 andfor φ, σ2 (hyperparameters).

• Note that hierarchical model has n more parameters (the w(si)) thanthe marginal model.

• Computation with the marginal model preferable because σ2H(φ)+τ2Itends to be better behaved than σ2H(φ) at small distances.

ENAR - March 26, 2006 22

Page 475: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Estimation of spatial surface W |y

• Interest is sometimes on estimating the spatial surface using p(W |y).

• If marginal model is fitted, we can still get marginal posterior for W as

p(W |y) =∫p(W |σ2, φ)p(σ2, φ|y)dσ2dφ.

• Given draws (σ2(g), φ(g)) from the Gibbs sampler on the marginalmodel, we can generate W from

p(W |σ2(g), φ(g)) = N(0, σ2(g)H(φ(g))).

• Analytical marginalization over W is possible only if model has Gaussianform.

ENAR - March 26, 2006 23

Page 476: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Bayesian kriging

• Let Yo = Y (so) and xo = x(so). Kriging is accomplished by obtainingthe posterior predictive distribution

p(yo|xo, X, y) =∫p(yo, θ|y,X, xo)dθ

=∫p(yo|θ, y, xo)p(θ|y,X)dθ.

• Since (Yo, Y ) are jointly multivariate normal (see expressions 2.18 and2.19 on page 51), then p(yo|θ, y, xo) is a conditional normal distribution.

• Given MCMC draws of the parameters (θ(1), ..., θ(G)) from the posterior

distribution p(θ|y,X), we draw values y(g)o for each θ(g) as

y(g)o ∼ p(yo|θ(g), y, xo).

ENAR - March 26, 2006 24

Page 477: Bayesian Methods for Data Analysis - ISU Public Homepage Server

• Draws {y(1)o , y

(2)o , ..., y

(G)o , } are a sample from the posterior predictive

distribution of the outcome at the new location so.

• To predict Y at a set of m new locations so1, ..., som, it is best to dojoint prediction to be able to estimate the posterior association amongm predictions.

• Beware of joint prediction at many new locations with WinBUGS. Itcan take forever.

ENAR - March 26, 2006 25

Page 478: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Kriging example from WinBugs

• Data were first published by Davis (1973) and consist of heights at 52locations in a 310-foot square area.

• We have 52 s = (x, y) coordinates and outcomes (heights).

• Unit of distance is 50 feet and unit of elevation is 10 feet.

• The model is

height = β + ε, where ε ∼ N(0,Σ),

and whereΣ = σ2H(φ).

ENAR - March 26, 2006 26

Page 479: Bayesian Methods for Data Analysis - ISU Public Homepage Server

• Here, H(φ)ij = ρ(si − sj;φ) = exp(−φ||si − sj||κ).

• Priors on (β, φ, κ).

• We predict elevations at 225 new locations.

ENAR - March 26, 2006 27

Page 480: Bayesian Methods for Data Analysis - ISU Public Homepage Server

model {

# Spatially structured multivariate normal likelihood

height[1:N] ~ spatial.exp(mu[], x[], y[], tau, phi, kappa) # exponential correlation function

for(i in 1:N) {

mu[i] <- beta

}

# Priors

beta ~ dflat()

tau ~ dgamma(0.001, 0.001)

sigma2 <- 1/tau

# priors for spatial.exp parameters

phi ~ dunif(0.05, 20) # prior decay for correlation at min distance (0.2 x 50 ft) is 0.02 to 0.99

# prior range for correlation at max distance (8.3 x 50 ft) is 0 to 0.66

kappa ~ dunif(0.05,1.95)

# Spatial prediction

# Single site prediction

for(j in 1:M) {

height.pred[j] ~ spatial.unipred(beta, x.pred[j], y.pred[j], height[])

}

# Only use joint prediction for small subset of points, due to length of time it takes to run

for(j in 1:10) { mu.pred[j] <- beta }

height.pred.multi[1:10] ~ spatial.pred(mu.pred[], x.pred[1:10], y.pred[1:10], height[])

}

Data Click on one of the arrows for the data

Initial valuesClick on one of the arrows for inital values for spatial.exp model

Page 481: Bayesian Methods for Data Analysis - ISU Public Homepage Server

beta

iteration

1 500 1000 1500 2000

400.0

600.0

800.0

1.00E+3

1200.0

Page 482: Bayesian Methods for Data Analysis - ISU Public Homepage Server

phi

iteration

1 500 1000 1500 2000

0.0

0.2

0.4

0.6

0.8

Page 483: Bayesian Methods for Data Analysis - ISU Public Homepage Server

kappa

iteration

1 500 1000 1500 2000

0.5

1.0

1.5

2.0

Page 484: Bayesian Methods for Data Analysis - ISU Public Homepage Server

(1) < 700.0

(16) 700.0 - 750.0

(40) 750.0 - 800.0

(80) 800.0 - 850.0

(61) 850.0 - 900.0

(26) 900.0 - 950.0

(1) >= 950.0

(samples)means for height.pred

0.002km

N

Page 485: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Hierarchical models for areal data

• Areal data: often aggregate outcomes over a well-defined area.Example: number of cancer cases per county or proportion of peopleliving in poverty in a set of census tracks.

• What are the inferential issues?

1. Identifying a spatial pattern and its strength. If data are spatiallycorrelated, measurements from areas that are ‘close’ will be morealike.

2. Smoothing and to what degree. Observed measurements oftenpresent extreme values due to small samples in small areas. Maximalsmoothing: substitute observed measurements by the overall meanin the region. Something less extreme is what we discuss later.

3. Prediction: for a new area, how would we predict Y givenmeasurements in other areas?

ENAR - March 26, 2006 28

Page 486: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Defining neighbors

• A proximity matrix W with entries wij spatially connects areas i andj in some fashion.

• Typically, wii = 0.

• There are many choices for the wij:

– Binary: wij = 1 if areas i, j share a common boundary, and is 0otherwise.

– Continuous: decreasing function of intercentroidal distance.– Combo: wij = 1 if areas are within a certain distance.

• W need not be symmetric.

• Entries are often standardized by dividing into∑

j wij = wi+. If entriesare standardized, the W will often be asymmetric.

ENAR - March 26, 2006 29

Page 487: Bayesian Methods for Data Analysis - ISU Public Homepage Server

• The wij can be thought of as weights and provide a means to introducespatial structure into the model.

• Areas that are ’closer by’ in some sense are more alike.

• For any problem, we can define first, second, third, etc order neighbors.For distance bins (0, d1], (d1, d2], (d2, d3], ... we can define

1. W (1), the first-order proximity matrix with w(1)ij = 1 if distance

between i abd j is less than d1.

2. W (2), the second-order proximity matrix with w(2)ij = 1 if distance

between i abd j is more than d1 but less than d2.

ENAR - March 26, 2006 30

Page 488: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Areal data models

• Because these models are used mostly in epidemiology, we begin withan application in disease mapping to introduce concepts.

• Typical data:

Yi observed number of cases in area i, i = 1, ..., I

Ei expected number of cases in area i.

• The Y ’s are assumed to be random and the Es are assumed to beknown and to depend on the number of persons ni at risk.

• An internal standardized estimate of Ei is

Ei = nir = ni

(∑i yi∑i ni

),

ENAR - March 26, 2006 31

Page 489: Bayesian Methods for Data Analysis - ISU Public Homepage Server

corresponding to a constant disease rate across areas.

• An external standardized estimate is

Ei =∑

j

nijrj,

where rj is the risk for persons of age group j (from some existingtable of risks by age) and nij is the number of persons of age j in areai.

ENAR - March 26, 2006 32

Page 490: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Standard frequentist approach

• For small Ei,Yi|ηi ∼ Poisson(Eiηi),

with ηi the true relative risk in area i.

• The MLE is the standard mortality ratio

ηi = SMRi =Yi

Ei.

• The variance of the SMRi is

var(SMRi) =var(Yi)E2

i

=ηi

Ei,

ENAR - March 26, 2006 33

Page 491: Bayesian Methods for Data Analysis - ISU Public Homepage Server

estimated by plugging ηi to obtain

var(SMRi) =Yi

E2i

.

• To get a confidence interval for ηi, first assume that log(SMRi) isapproximately normal.

• From a Taylor expansion:

V ar[log(SMRi)] ≈ 1SMR2

i

V ar(SMRi)

=E2

i

Y 2i

× Yi

E2i

=1Yi.

ENAR - March 26, 2006 34

Page 492: Bayesian Methods for Data Analysis - ISU Public Homepage Server

• An approximate 95% CI for log(ηi) is

SMRi ± 1.96/(Yi)1/2.

• Transforming back, an approximate 95% CI for ηi is

(SMRi exp(−1.96/(Yi)1/2), SMRi exp(1.96/(Yi)1/2)).

• Suppose we wish to test whether risk in area i is high relative to otherareas. Then test

H0 : ηi = 1 versus Ha : ηi > 1.

• This is a one-sided test.

ENAR - March 26, 2006 35

Page 493: Bayesian Methods for Data Analysis - ISU Public Homepage Server

• Under H0, Yi ∼ Poisson(Ei) so the p−value for the test is

p = Prob(X ≥ Yi|Ei)

= 1− Prob(X ≥ Yi|Ei)

= 1−Yi−1∑x=0

exp(−Ei)Exi

x!.

• If p < 0.05 we reject H0.

ENAR - March 26, 2006 36

Page 494: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Hierarchical models for areal data

• To estimate and map underlying relative risks, might wish to fit arandom effects model.

• Assumption: true risks come from a common underlying distribution.

• Random effects models permit borrowing strength across areas toobtain better area-level estimates.

• Alas, models may be complex:

– High-dimensional: one random effect for each area.– Non-normal if data are counts or binomial proportions.

• We have already discussed hierarchical Poisson models, so material inthe next few transparencies is a review.

ENAR - March 26, 2006 37

Page 495: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Poisson-Gamma model

• Consider

Yi|ηi ∼ Poisson(Eiηi)

ηi|a, b ∼ Gamma(a, b).

• Since E(ηi) = a/b and V ar(ηi) = a/b2, we can fix a, b as follows:

– A priori, let E(ηi) = 1, the null value.– Let V ar(ηi) = (0.5)2, large on that scale.

• Resulting prior is Gamma(4, 4).

• Posterior is also Gamma:

p(ηi|yi) = Gamma(yi + a,Ei + b).

ENAR - March 26, 2006 38

Page 496: Bayesian Methods for Data Analysis - ISU Public Homepage Server

• A point estimate of ηi is

E(ηi|y) = E(ηi|yi) =yi + a

Ei + b

=yi + [E(ηi)]

2

V ar(ηi)

Ei + E(ηi)V ar(ηi)

=Ei( yi

Ei)

Ei + E(ηi)V ar(ηi)

+E(ηi)

V ar(ηi)E(ηi)

Ei + E(ηi)V ar(ηi)

= wiSMRi + (1− wi)E(ηi),

where wi = Ei/[Ei + (E(ηi)/V ar(ηi))].

• Bayesian point estimate is a weighted average of the data-based SMRi

and the prior mean E(ηi).

ENAR - March 26, 2006 39

Page 497: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Poisson-lognormal models with spatial errors

• The Poisson-Gamma model does not allow (easily) for spatial correlationamong the ηi.

• Instead, consider the Poisson-lognormal model, where in the secondstage we model the log-relative risks log(ηi) = ψi:

Yi|ψi ∼ Poisson(Ei exp(ψi))

ψi = x′iβ + θi + φi,

where xi are area-level covariates.

• The θi are assumed to be exchangeable and model between-areavariability:

θi ∼ N(0, 1/τh).

ENAR - March 26, 2006 40

Page 498: Bayesian Methods for Data Analysis - ISU Public Homepage Server

• The θi incorporate global extra-Poisson variability in the log-relativerisks (across the entire region).

• The φi are the ’spatial’ parameters; they capture regional clustering.

• They model extra-Poisson variability in the log-relative risks at the locallevel so that ’neighboring’ areas have similar risks.

• One way to model the φi is to proceed as in the point-referenced datacase. For φ = (φ1, ..., φI), consider

φ|µ, λ ∼ NI(µ,H(λ)),

and H(λ)ii′ = cov(φi, φi′) with hyperparameters λ.

• Possible models for H(λ) include the exponential, the poweredexponential, etc.

ENAR - March 26, 2006 41

Page 499: Bayesian Methods for Data Analysis - ISU Public Homepage Server

• While sensible, this model is difficult to fit because

– Lots of matrix inversions required– Distance between φi and φi′ may not be obvious.

ENAR - March 26, 2006 42

Page 500: Bayesian Methods for Data Analysis - ISU Public Homepage Server

CAR model

• More reasonable to think of a neighbor-based proximity measure andconsider a conditionally autoregressive model for φ:

φi ∼ N(φi, 1/(τcmi)),

whereφi =

∑i 6=j

wij(φi − φj),

and mi is the number of neighbors of area i. Earlier we called this wi+.

• The weights wij are (typically) 0 if areas i and j are not neighbors and1 if they are.

• CAR models lend themselves to the Gibbs sampler. Each φi canbe sampled from its conditional distribution so no matrix inversion is

ENAR - March 26, 2006 43

Page 501: Bayesian Methods for Data Analysis - ISU Public Homepage Server

needed:

p(φi| all) ∝ Poi(yi|Eiexiβ+θi+φi) N(φi|φi,

1miτc

).

ENAR - March 26, 2006 44

Page 502: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Difficulties with CAR model

• The CAR prior is improper. Prior is a pairwise difference prior identifiedonly up to a constant.

• The posterior will still be proper, but to identify an intercept β0 for thelog-relative risks, we need to impose a constraint:

∑i φi = 0.

• In simulation, constraint is imposed numerically by recentering eachvector φ around its own mean.

• τh and τc cannot be too large because θi and φi become unidentifiable.We observe only one Yi in each area yet we try to fit two random effects.Very little data information.

• Hyperpriors for τh, τc need to be chosen carefully.

ENAR - March 26, 2006 45

Page 503: Bayesian Methods for Data Analysis - ISU Public Homepage Server

• Consider

τh ∼ Gamma(ah, bh), τc ∼ Gamma(ac, bc).

• To place equal emphasis on heterogeneity and spatial clustering, it istempting to make ah = ac and bh = bc. This is not correct because

1. The τh prior is defined marginally, where the τc prior is conditional.2. The conditional prior precision is τcmi. Thus, a scale that satisfies

sd(θi) =1√τh)

≈ 10.7√mτc

≈ sd(φi)

with m the average number of neighbors is more ’fair’ (Bernardinelliet al. 1995, Statistics in Medicine).

ENAR - March 26, 2006 46

Page 504: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Example: Incidence of lip cancer in 56 areas in Scotland Data on the number of lip cancer cases in 56 counties in Scotland were obtained. Expected lip cancer counts Ei were available. Covariate was the proportion of the population in each county working in agriculture. Model included only one random effect bi to introduce spatial association between counties. Two counties are neighbors if they have a common border (i.e., they are adjacent). Three counties that are islands have no neighbors and for those, WinBUGS sets the random spatial effect to be 0. The relative risk for the islands is thus based on the baseline rate α0 and on the value of the covariate xi. We wish to smooth and map the relative risks RR.

Page 505: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Model model { # Likelihood for (i in 1 : N) { O[i] ~ dpois(mu[i]) log(mu[i]) <- log(E[i]) + alpha0 + alpha1 * X[i]/10 + b[i] RR[i] <- exp(alpha0 + alpha1 * X[i]/10 + b[i]) # Area-specific relative risk (for maps) } # CAR prior distribution for random effects: b[1:N] ~ car.normal(adj[], weights[], num[], tau) for(k in 1:sumNumNeigh) { weights[k] <- 1 } # Other priors: alpha0 ~ dflat() alpha1 ~ dnorm(0.0, 1.0E-5) tau ~ dgamma(0.5, 0.0005) # prior on precision sigma <- sqrt(1 / tau) # standard deviation }

Page 506: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Data list(N = 56, O = c( 9, 39, 11, 9, 15, 8, 26, 7, 6, 20, 13, 5, 3, 8, 17, 9, 2, 7, 9, 7, 16, 31, 11, 7, 19, 15, 7, 10, 16, 11, 5, 3, 7, 8, 11, 9, 11, 8, 6, 4, 10, 8, 2, 6, 19, 3, 2, 3, 28, 6, 1, 1, 1, 1, 0, 0), E = c( 1.4, 8.7, 3.0, 2.5, 4.3, 2.4, 8.1, 2.3, 2.0, 6.6, 4.4, 1.8, 1.1, 3.3, 7.8, 4.6, 1.1, 4.2, 5.5, 4.4, 10.5,22.7, 8.8, 5.6,15.5,12.5, 6.0, 9.0,14.4,10.2, 4.8, 2.9, 7.0, 8.5,12.3,10.1,12.7, 9.4, 7.2, 5.3, 18.8,15.8, 4.3,14.6,50.7, 8.2, 5.6, 9.3,88.7,19.6, 3.4, 3.6, 5.7, 7.0, 4.2, 1.8), X = c(16,16,10,24,10,24,10, 7, 7,16, 7,16,10,24, 7,16,10, 7, 7,10, 7,16,10, 7, 1, 1, 7, 7,10,10, 7,24,10, 7, 7, 0,10, 1,16, 0, 1,16,16, 0, 1, 7, 1, 1, 0, 1, 1, 0, 1, 1,16,10), num = c(3, 2, 1, 3, 3, 0, 5, 0, 5, 4, 0, 2, 3, 3, 2, 6, 6, 6, 5, 3, 3, 2, 4, 8, 3, 3, 4, 4, 11, 6, 7, 3, 4, 9, 4, 2, 4, 6, 3, 4, 5, 5, 4, 5, 4, 6, 6, 4, 9, 2,

Page 507: Bayesian Methods for Data Analysis - ISU Public Homepage Server

4, 4, 4, 5, 6, 5), adj = c( 19, 9, 5, 10, 7, 12, 28, 20, 18, 19, 12, 1, 17, 16, 13, 10, 2, 29, 23, 19, 17, 1, 22, 16, 7, 2, 5, 3, 19, 17, 7, 35, 32, 31, 29, 25, 29, 22, 21, 17, 10, 7, 29, 19, 16, 13, 9, 7, 56, 55, 33, 28, 20, 4, 17, 13, 9, 5, 1, 56, 18, 4, 50, 29, 16, 16, 10, 39, 34, 29, 9, 56, 55, 48, 47, 44, 31, 30, 27, 29, 26, 15, 43, 29, 25, 56, 32, 31, 24,

Page 508: Bayesian Methods for Data Analysis - ISU Public Homepage Server

45, 33, 18, 4, 50, 43, 34, 26, 25, 23, 21, 17, 16, 15, 9, 55, 45, 44, 42, 38, 24, 47, 46, 35, 32, 27, 24, 14, 31, 27, 14, 55, 45, 28, 18, 54, 52, 51, 43, 42, 40, 39, 29, 23, 46, 37, 31, 14, 41, 37, 46, 41, 36, 35, 54, 51, 49, 44, 42, 30, 40, 34, 23, 52, 49, 39, 34, 53, 49, 46, 37, 36, 51, 43, 38, 34, 30, 42, 34, 29, 26, 49, 48, 38, 30, 24, 55, 33, 30, 28, 53, 47, 41, 37, 35, 31, 53, 49, 48, 46, 31, 24, 49, 47, 44, 24, 54, 53, 52, 48, 47, 44, 41, 40, 38, 29, 21, 54, 42, 38, 34, 54, 49, 40, 34, 49, 47, 46, 41,

Page 509: Bayesian Methods for Data Analysis - ISU Public Homepage Server

52, 51, 49, 38, 34, 56, 45, 33, 30, 24, 18, 55, 27, 24, 20, 18), sumNumNeigh = 234) Results The proportion of individuals working in agriculture appears to be associated to the incidence of cancer

alpha1 sample: 10000

-0.25 0.0 0.25 0.5 0.75

0.0 1.0 2.0 3.0 4.0

Page 510: Bayesian Methods for Data Analysis - ISU Public Homepage Server

There appears to be spatial correlation among counties:

[1]

[2]

[3]

[4]

[5]

[7] [9][10]

[12][13]

[14]

[15]

[16]

[17]

[18]

[19]

[20][21]

[22][23]

[24]

[25][26]

[27][28][29]

[30][31][32][33]

[34][35]

[36]

[37][38][39][40]

[41]

[42]

[43][44]

[45][46][47][48]

[49]

[50][51][52][53][54][55]

[56]

box plot: b

-2.0

-1.0

0.0

1.0

2.0

Page 511: Bayesian Methods for Data Analysis - ISU Public Homepage Server

(26) < 1.0

(15) 1.0 - 2.0

(9) 2.0 - 3.0

(4) 3.0 - 4.0

(2) >= 4.0

(samples)means for RR

200.0km

N

Page 512: Bayesian Methods for Data Analysis - ISU Public Homepage Server

(38) < 10.0

(14) 10.0 - 20.0

(2) 20.0 - 30.0

(2) >= 30.0

values for O

200.0km

N

Page 513: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Example: Lung cancer in London Data were obtained from an annual report by the London Health Authority in which the association between health and poverty was investigated. Population under study are men 65 years of age and older. Available were: • Observed and expected lung cancer counts in 44 census ward districts • Ward-level index of socio-economic deprivation. A model with two random effects was fitted to the data. One random effect introduces region-wide heterogeneity (between ward variability) and the other one introduces regional clustering. The priors for the two components of random variability, are sometimes known as convolution priors. Interest is in: • Determining association between health outcomes and poverty. • Smoothing and mapping relative risks. • Assessing the degree of spatial clustering in these data.

Page 514: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Model model { for (i in 1 : N) { # Likelihood O[i] ~ dpois(mu[i]) log(mu[i]) <- log(E[i]) + alpha + beta * depriv[i] + b[i] + h[i] RR[i] <- exp(alpha + beta * depriv[i] + b[i] + h[i]) # Area-specific relative risk (for maps) # Exchangeable prior on unstructured random effects h[i] ~ dnorm(0, tau.h) } # CAR prior distribution for spatial random effects: b[1:N] ~ car.normal(adj[], weights[], num[], tau.b) for(k in 1:sumNumNeigh) { weights[k] <- 1 } # Other priors: alpha ~ dflat() beta ~ dnorm(0.0, 1.0E-5) tau.b ~ dgamma(0.5, 0.0005)

Page 515: Bayesian Methods for Data Analysis - ISU Public Homepage Server

sigma.b <- sqrt(1 / tau.b) tau.h ~ dgamma(0.5, 0.0005) sigma.h <- sqrt(1 / tau.h) propstd <- sigma.b / (sigma.b + sigma.h) } Data click on one of the arrows to open data Inits click on one of the arrows to open initial values

Note that the priors on the precisions for the exchangeable and the spatial random effects are Gamma(0.5, 0.0005). That means that a priori, the expected value of the standard deviations is approximately 0.03 with a relatively large prior standard deviation. This is not a “fair” prior as discussed in class. The average number of neighbors is 4.8 and this is not taken into account in the choice of priors.

Page 516: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Results Parameter Mean Std 2.5th

perc. 97.5th perc.

Alpha -0.208 0.1 -0.408 -0.024Beta 0.0474 0.0179 0.0133 0.0838Relative size of spatial std

0.358 0.243 0.052 0.874

sigma.b sample: 1000

-0.2 0.0 0.2 0.4 0.6

0.0 5.0 10.0 15.0 20.0

sigma.h sample: 1000

-0.2 0.0 0.2 0.4 0.6

0.0 2.5 5.0 7.5 10.0

Page 517: Bayesian Methods for Data Analysis - ISU Public Homepage Server

Posterior distribution of RR for first 15 wards

node mean sd 2.5% median 97.5% RR[1] 0.828 0.1539 0.51 0.8286 1.148 RR[2] 1.18 0.2061 0.7593 1.188 1.6 RR[3] 0.8641 0.1695 0.5621 0.8508 1.247 RR[4] 0.8024 0.1522 0.5239 0.7873 1.154 RR[5] 0.7116 0.1519 0.379 0.7206 0.9914 RR[6] 1.05 0.2171 0.7149 1.015 1.621 RR[7] 1.122 0.1955 0.7556 1.116 1.589 RR[8] 0.821 0.1581 0.4911 0.8284 1.132 RR[9] 1.112 0.2167 0.7951 1.07 1.702 RR[10] 1.546 0.2823 1.072 1.505 2.146 RR[11] 0.7697 0.1425 0.4859 0.7788 1.066 RR[12] 0.865 0.16 0.6027 0.8464 1.26 RR[13] 1.237 0.3539 0.8743 1.117 2.172 RR[14] 0.8359 0.1807 0.5279 0.8084 1.3 RR[15] 0.7876 0.1563 0.489 0.7869 1.141

Page 518: Bayesian Methods for Data Analysis - ISU Public Homepage Server

(17) < 5.0

(19) 5.0 - 10.0

(3) 10.0 - 15.0

(3) 15.0 - 20.0

(2) >= 20.0

values for O

2.5km

N

Page 519: Bayesian Methods for Data Analysis - ISU Public Homepage Server

(2) < -0.1

(2) -0.1 - -0.05

(18) -0.05 - 1.38778E-17

(19) 1.38778E-17 - 0.05

(2) 0.05 - 0.1

(1) >= 0.1

values for b

2.5km

N

Page 520: Bayesian Methods for Data Analysis - ISU Public Homepage Server

(8) < 0.8

(20) 0.8 - 1.0

(11) 1.0 - 1.2

(5) >= 1.2

values for RR

2.5km

N