(8) Hierarchical models - stat.ncsu.edureich/ABA/notes/Hier.pdf · ST440/540: Applied Bayesian Statistics (8) Hierarchical models. Hierarchical models ... Hierarchical linear regression

(8) Hierarchical models

ST440/540: Applied Bayesian Statistics

ST440/540: Applied Bayesian Statistics (8) Hierarchical models

Hierarchical models

I Hierarchical modeling provides a framework for buildingcomplex and high-dimensional models from simple andlow-dimensional building blocks

I Of course, it is possible to analyze these models usingnon-Bayesian methods

I However, this modeling framework is popular in theBayesian literature because MCMC is conducive tohierarchical models

I Both “divide and conquer” big problems by splitting theminto a series of smaller problems in the same way


Hierarchical models

Often Bayesian models can we written in the following layers ofthe hierarchy

1. Data layer: [Y|θ,α] is the likelihood for the observed dataY

2. Process layer: [θ|α] is the model for the parameters θthat define the latent data generating process

3. Prior layer: [α] prior for hyperparameters


Epidemiology example - Data layer

I Let St and It be the number of susceptible and infectedindividuals in a population, respectively, at time t

I The data Yt is the number of observed cases at time t

I The data layer models our ability to measure the process It

I Data layer: Yt |It ∼ Binomial(It ,p)

I This assumes no false positives and false negativeprobability p


Epidemiology example - Process layer

I Scientific understanding of the disease is used to modeldisease propagation

I We might select the simple Reed-Frost model

Process layer: It+1 ∼ Binomial[St ,1− (1− q)It

]St+1 = St − It+1

I This assumes all infected individuals are removed from thepopulation before the next time step

I Also that q is the probability of a non-infected personcoming into contact with and contracting the disease froman infected individual


Epidemiology example - Prior layer

I The epidemiological process-layer model expresses thedisease dynamics up to a few unknown parameters

I The Bayesian model is completed using priors, say,

I Prior layer:

I1 ∼ Poisson(λ1)

S1 ∼ Poisson(λ2)

p,q ∼ beta(a,b)


Hierarchical models and MCMC

I Consider the classic one-way random effects model:

Yij ∼ N(θi , σ2) and θi ∼ N(µ, τ2)

where Yij is the j th replicate for unit i and α = (µ, σ2, τ2)has an uninformative prior

I This hierarchy can be written using a directed acyclic graph(DAG; also called Bayesian network or belief network)

I See “DAG” on http://www4.stat.ncsu.edu/~reich/ABA/derivations8.pdf


http://www4.stat.ncsu.edu/~reich/ABA/derivations8.pdf

http://www4.stat.ncsu.edu/~reich/ABA/derivations8.pdf

Epidemiology example - DAG

Prior layer

Process layer

Data layer

Y2 Y3 Y4 …

I2, S2 I3, S3 I4, S4…

I1, S1, p, q

λ1, λ2, a, b


Hierarchical models and MCMC

I MCMC is efficient in this case even if the number ofparameter or levels of the hierarchy is large

I You only need to consider “connected nodes” when youupdate each parameter

I 1. [θi |·]

2. [µ|·]

3. [σ2|·]

4. [τ2|·]

I Each of these updates is a draw from a standardone-dimensional normal or inverse gamma


Two-way random effects model

I http://www4.stat.ncsu.edu/~reich/ABA/code/Ozone

I The ozone measurement at spatial location i and day j isdenoted Yij

I We fit the model

Yij ∼ Normal(µ+ αi + γj , σ2)

I µ is the overall mean

I αi is the random effect for location i

I γj is the random effect of day j


http://www4.stat.ncsu.edu/~reich/ABA/code/Ozone

http://www4.stat.ncsu.edu/~reich/ABA/code/Ozone

Two-way random effects model

I Model: Yij ∼ Normal(µ+ αi + γj , σ2)

I Priors for the fixed effects model:

αi ∼ Normal(0,102) and γj ∼ Normal(0,102)

I Priors for the random effects model:

αi ∼ Normal(0, σ2a) and γj ∼ Normal(0, σ2

g)

I These look similar, when do we say something is a randomeffect in a Bayesian analysis?


Random slopes model

I Let Yij be the j th observation for subject i

I As an example, consider the data plotted on the next slidewere Yij is the bone density for child i at age Xj .

I Here we might specify a different regression for each childto capture variability over the population of children:

Yij ∼ Normal(γ0i + Xiγ1i , σ2)

I γ i = (γi0, γi1)T controls the growth curve for child i

I These separate regression are tied together in the prior,γ i ∼ Normal(β,Σ), which borrows strength across children

I This is a linear mixed model: γ i are random effects specificto one child and β are fixed effects common to all children


Bone height data

8.0 8.5 9.0 9.5

4648

5052

54

Age

Bon

e he

ight


Prior for a covariance matrix

I The random-effects covariance matrix is Σ =

[σ2

1 σ12σ12 σ2

2

]

I σ21 is the variance of the intercepts across children

I σ22 is the variance of the slopes across children

I σ12 is the covariance between the intercepts and slopes

I Prior 1: σ21, σ

22 ∼ InvGamma and ρ = σ12

σ1σ2∼ Unif(−1,1)

I Prior 2: Inverse Wishart works better in higher dimensions


Inverse Wishart distribution

I The inverse Wishart distribution is the most common priorfor a p × p covariance matrix

I It reduces to the inverse gamma distribution if p = 1

I Say Σ ∼ InvW(κ,R) where κ > p + 1 and R is a p × pcovariance matrix are hyperparameters

I The PDF is

f (Σ) ∝ |Σ|−(κ+p+1)/2 exp

[12

trace(RΣ−1)

]I The mean is 1

κ−p−1R


Full conditional distributions

I The hierarchical model is:I Yij ∼ Normal(γ0i + Xiγ1i , σ

2)

I γ i ∼ Normal(β,Σ)

I p(β) ∝ 1

I σ2 ∼ InvGamma(a,b)

I Σ ∼ InvWishart(κ,R)

I The full conditionals are all conjugate

I MCMC code: http://www4.stat.ncsu.edu/~reich/ABA/code/lmm.R

I JAGS: http://www4.stat.ncsu.edu/~reich/ABA/code/Jaw


http://www4.stat.ncsu.edu/~reich/ABA/code/lmm.R

http://www4.stat.ncsu.edu/~reich/ABA/code/lmm.R

http://www4.stat.ncsu.edu/~reich/ABA/code/Jaw

http://www4.stat.ncsu.edu/~reich/ABA/code/Jaw

Bone height data - fitted values

8.0 8.5 9.0 9.5

4650

54

Subject 1

Age

Bon

e he

ight

DataFitted

8.0 8.5 9.0 9.5

4650

54

Subject 2

Age

Bon

e he

ight

8.0 8.5 9.0 9.5

4650

54

Subject 3

Age

Bon

e he

ight

8.0 8.5 9.0 9.5

4650

54

Subject 4

Age

Bon

e he

ight

8.0 8.5 9.0 9.5

4650

54

Subject 5

Age

Bon

e he

ight

8.0 8.5 9.0 9.5

4650

54

Subject 6

Age

Bon

e he

ight

Population mean intercept

β1

25 30 35 40 45

050

015

00

Population mean slope

β2

1.0 1.5 2.0 2.5 3.0

010

0020

00

Corr(gamma[1],gamma[2])

Correlation

−1.00 −0.90 −0.80 −0.70

020

0040

00


Missing data models

I We will deal with missing data in the linear regressioncontext, but the ideas apply to all models

I The model is

Yi ∼ Normal(β0 + β1Xi1 + ...+ βpXip, σ2)

I Often either Yi or elements Xij are missing

I We will study separately the case of missing responsesand missing covariates


Missing responses

I If the response is missing this is essentially a predictionproblem

I We have seen how to handle this in JAGS

I We obtain samples from the PPD of Yi

I At each MCMC iteration we simply draw

Yi ∼ Normal(β0 + β1Xi1 + ...+ βpXip, σ2)

I This distribution accounts for random error as well asuncertainty in the model parameters

I For the other updates the data are essentially complete

I If only responses are missing, can we delete them for thepurpose of estimating β?


Missing covariates

I Now say all responses are observed, but a somecovariates are missing

I The simplest approach is imputation, e.g., just plug in thesample mean of the covariate for the missing values

I This doesn’t account for uncertainty in the imputations

I Bayesian methods handle this well using MCMC


Missing covariates

I The main idea is to treat the missing values as unknownparameters in the Bayesian model

I Unknown parameters need priors, so missingXi = (Xi1, ...,Xip)T must have priors such as

Xi ∼ Normal(µX ,ΣX )

I Assumptions about missing data:I Missing status is independent of Y and X

I Covariates are GaussianI There are ways to relax both assumptions, but it becomes

complicated


Missing covariates

I Of course if the prior is way off, the results will be invalid

I For example, if in reality the data are not missing atrandom the Bayesian model will likely give bad results

I Example of non-random missingness:

I If specified correctly, the model will lead to inference for βthat properly accounts for uncertainty about the missingdata


Hierarchical linear regression model with missing data

I Yi |Xi , β, σ2 ∼ Normal(XT

i β, σ2)

I Xi |mu,Σ ∼ Normal(mu,Σ)

I p(β) ∝ 1

I σ2 ∼ InvG(0.01,0.01)

I µ ∼ Normal(0,1002Ip)

I Σ ∼ InvWishart(0.01,0.01Ip)

If some observations have missing Y and some have missingX , can we delete those with missing Y ? Can we delete thosewith missing X?


Overview of the Gibbs sampling algorithm

I The full conditional of missing Yi is:

I The full conditional of missing Xi is:

I In fact, all the full conditionals are conjugate

I JAGS does not handle missing X well

I Let’s use this as an opportunity to explore OpenBUGShttp://www4.stat.ncsu.edu/~reich/ABA/code/Missing.R


http://www4.stat.ncsu.edu/~reich/ABA/code/Missing.R

http://www4.stat.ncsu.edu/~reich/ABA/code/Missing.R

Non- and Semi-parametric modeling

I Nonparametric (NP) methods attempt to analyze the databy making the fewest number of assumptions as possible

I NP methods are generally more robust and flexible, butless powerful than correctly specified parametric models

I Most frequentist NP methods completely avoid specifying amodel

I For example, a rank or sign test to compare two means


Non- and Semi-parametric modeling

I Bayesian methods need a likelihood in order to obtain aposterior, so you can’t completely avoid specifying a model

I Bayesian NP (BNP) then attempts to specify a model thatis so flexible that it almost certainly captures the true model

I One definition of the BNP model is one that hasinfinitely-many parameters

I In some cases, NP models are difficult conceptually andcomputationally, and so semiparametric models with alarge but finite number of parameters are usefulapproximations


Parametric simple linear regressionConsider the classic parametric model:

Yi = β0 + β1Xi + εi where εi ∼ N(0, σ2).

Assumptions:1. εi are independent2. εi are Gaussian3. The mean of Yi is linear in X .4. The residual distribution does not depend on X

Alternatives:1. Parametric alternatives such as a time series model.2. Let εi ∼ F , and place a prior on the distribution F .3. Let E(Y |X ) = g(X ) and put a prior on the function g.4. Heteroskedastic regression Var(εi) = exp(α0 + α1X ).

In 2-4 we are placing priors on functions, not parameters.


Nonparametric regression

I Let’s relax the assumption of linearity in the mean.

I The mean is g(X ), where g is some function that relates Xto E(Y |X ).

I Parametric models include1. Linear: g(X ) = β0 + β1X2. Quadratic: g(X ) = β0 + β1X + β2X 2

3. Logistic: g(X ) = β0 + β1exp[β2+β3X ]

1+exp[β2+β3X ] .

I NP regression puts a prior on the curve g(X ), rather thanthe parameters β1, ..., βp that determine the parametricmodel.


Semiparametric regressionI Semiparametric regression approximates the function g

using a finite basis expansion

g(X ) =J∑

j=1

Bj(X )βj

where Bj(X ) are known basis functions and βj areunknown coefficients that determine the shape of g

I Example: the cubic spline basis functions are

Bj(X ) = (X − vj)3+

where vj are fixed knots that span the range of XI Many other expansions exist: wavelets; Fourier, etcI Fact: A basis expansion of J terms can match the true

curve g at any J points X1, ...,XJI So increasing J gives an arbitrarily flexible model


Model fitting

I The model is Yi ∼ N(BT

i β, σ2), where βj ∼ N(0, τ2) and Bi

is comprised of the known basis functions Bj(Xi).

I Therefore, the model is usual linear regression model andis straightforward to fit using MCMC.

I http://www4.stat.ncsu.edu/~reich/ABA/Code/Moto

I How to pick J?

I Can we have more basis functions than observations?

I What would you do if your prior was that g was probablyquadratic, but you are not 100% sure about this. That is,your prior is that g(X ) ≈ β0 + β1X + β2X 2.


http://www4.stat.ncsu.edu/~reich/ABA/Code/Moto

http://www4.stat.ncsu.edu/~reich/ABA/Code/Moto

Documents

(8) Hierarchical models - stat.ncsu.edureich/ABA/notes/Hier.pdf · ST440/540: Applied Bayesian Statistics (8) Hierarchical models. Hierarchical models ... Hierarchical linear regression