Upload
trinhxuyen
View
217
Download
0
Embed Size (px)
Citation preview
(8) Hierarchical models
ST440/540: Applied Bayesian Statistics
ST440/540: Applied Bayesian Statistics (8) Hierarchical models
Hierarchical models
I Hierarchical modeling provides a framework for buildingcomplex and high-dimensional models from simple andlow-dimensional building blocks
I Of course, it is possible to analyze these models usingnon-Bayesian methods
I However, this modeling framework is popular in theBayesian literature because MCMC is conducive tohierarchical models
I Both “divide and conquer” big problems by splitting theminto a series of smaller problems in the same way
ST440/540: Applied Bayesian Statistics (8) Hierarchical models
Hierarchical models
Often Bayesian models can we written in the following layers ofthe hierarchy
1. Data layer: [Y|θ,α] is the likelihood for the observed dataY
2. Process layer: [θ|α] is the model for the parameters θthat define the latent data generating process
3. Prior layer: [α] prior for hyperparameters
ST440/540: Applied Bayesian Statistics (8) Hierarchical models
Epidemiology example - Data layer
I Let St and It be the number of susceptible and infectedindividuals in a population, respectively, at time t
I The data Yt is the number of observed cases at time t
I The data layer models our ability to measure the process It
I Data layer: Yt |It ∼ Binomial(It ,p)
I This assumes no false positives and false negativeprobability p
ST440/540: Applied Bayesian Statistics (8) Hierarchical models
Epidemiology example - Process layer
I Scientific understanding of the disease is used to modeldisease propagation
I We might select the simple Reed-Frost model
Process layer: It+1 ∼ Binomial[St ,1− (1− q)It
]St+1 = St − It+1
I This assumes all infected individuals are removed from thepopulation before the next time step
I Also that q is the probability of a non-infected personcoming into contact with and contracting the disease froman infected individual
ST440/540: Applied Bayesian Statistics (8) Hierarchical models
Epidemiology example - Prior layer
I The epidemiological process-layer model expresses thedisease dynamics up to a few unknown parameters
I The Bayesian model is completed using priors, say,
I Prior layer:
I1 ∼ Poisson(λ1)
S1 ∼ Poisson(λ2)
p,q ∼ beta(a,b)
ST440/540: Applied Bayesian Statistics (8) Hierarchical models
Hierarchical models and MCMC
I Consider the classic one-way random effects model:
Yij ∼ N(θi , σ2) and θi ∼ N(µ, τ2)
where Yij is the j th replicate for unit i and α = (µ, σ2, τ2)has an uninformative prior
I This hierarchy can be written using a directed acyclic graph(DAG; also called Bayesian network or belief network)
I See “DAG” on http://www4.stat.ncsu.edu/~reich/ABA/derivations8.pdf
ST440/540: Applied Bayesian Statistics (8) Hierarchical models
Epidemiology example - DAG
Prior layer
Process layer
Data layer
Y2 Y3 Y4 …
I2, S2 I3, S3 I4, S4…
I1, S1, p, q
λ1, λ2, a, b
ST440/540: Applied Bayesian Statistics (8) Hierarchical models
Hierarchical models and MCMC
I MCMC is efficient in this case even if the number ofparameter or levels of the hierarchy is large
I You only need to consider “connected nodes” when youupdate each parameter
I 1. [θi |·]
2. [µ|·]
3. [σ2|·]
4. [τ2|·]
I Each of these updates is a draw from a standardone-dimensional normal or inverse gamma
ST440/540: Applied Bayesian Statistics (8) Hierarchical models
Two-way random effects model
I http://www4.stat.ncsu.edu/~reich/ABA/code/Ozone
I The ozone measurement at spatial location i and day j isdenoted Yij
I We fit the model
Yij ∼ Normal(µ+ αi + γj , σ2)
I µ is the overall mean
I αi is the random effect for location i
I γj is the random effect of day j
ST440/540: Applied Bayesian Statistics (8) Hierarchical models
Two-way random effects model
I Model: Yij ∼ Normal(µ+ αi + γj , σ2)
I Priors for the fixed effects model:
αi ∼ Normal(0,102) and γj ∼ Normal(0,102)
I Priors for the random effects model:
αi ∼ Normal(0, σ2a) and γj ∼ Normal(0, σ2
g)
I These look similar, when do we say something is a randomeffect in a Bayesian analysis?
ST440/540: Applied Bayesian Statistics (8) Hierarchical models
Random slopes model
I Let Yij be the j th observation for subject i
I As an example, consider the data plotted on the next slidewere Yij is the bone density for child i at age Xj .
I Here we might specify a different regression for each childto capture variability over the population of children:
Yij ∼ Normal(γ0i + Xiγ1i , σ2)
I γ i = (γi0, γi1)T controls the growth curve for child i
I These separate regression are tied together in the prior,γ i ∼ Normal(β,Σ), which borrows strength across children
I This is a linear mixed model: γ i are random effects specificto one child and β are fixed effects common to all children
ST440/540: Applied Bayesian Statistics (8) Hierarchical models
Bone height data
8.0 8.5 9.0 9.5
4648
5052
54
Age
Bon
e he
ight
ST440/540: Applied Bayesian Statistics (8) Hierarchical models
Prior for a covariance matrix
I The random-effects covariance matrix is Σ =
[σ2
1 σ12σ12 σ2
2
]
I σ21 is the variance of the intercepts across children
I σ22 is the variance of the slopes across children
I σ12 is the covariance between the intercepts and slopes
I Prior 1: σ21, σ
22 ∼ InvGamma and ρ = σ12
σ1σ2∼ Unif(−1,1)
I Prior 2: Inverse Wishart works better in higher dimensions
ST440/540: Applied Bayesian Statistics (8) Hierarchical models
Inverse Wishart distribution
I The inverse Wishart distribution is the most common priorfor a p × p covariance matrix
I It reduces to the inverse gamma distribution if p = 1
I Say Σ ∼ InvW(κ,R) where κ > p + 1 and R is a p × pcovariance matrix are hyperparameters
I The PDF is
f (Σ) ∝ |Σ|−(κ+p+1)/2 exp
[12
trace(RΣ−1)
]I The mean is 1
κ−p−1R
ST440/540: Applied Bayesian Statistics (8) Hierarchical models
Full conditional distributions
I The hierarchical model is:I Yij ∼ Normal(γ0i + Xiγ1i , σ
2)
I γ i ∼ Normal(β,Σ)
I p(β) ∝ 1
I σ2 ∼ InvGamma(a,b)
I Σ ∼ InvWishart(κ,R)
I The full conditionals are all conjugate
I MCMC code: http://www4.stat.ncsu.edu/~reich/ABA/code/lmm.R
I JAGS: http://www4.stat.ncsu.edu/~reich/ABA/code/Jaw
ST440/540: Applied Bayesian Statistics (8) Hierarchical models
Bone height data - fitted values
8.0 8.5 9.0 9.5
4650
54
Subject 1
Age
Bon
e he
ight
DataFitted
8.0 8.5 9.0 9.5
4650
54
Subject 2
Age
Bon
e he
ight
8.0 8.5 9.0 9.5
4650
54
Subject 3
Age
Bon
e he
ight
8.0 8.5 9.0 9.5
4650
54
Subject 4
Age
Bon
e he
ight
8.0 8.5 9.0 9.5
4650
54
Subject 5
Age
Bon
e he
ight
8.0 8.5 9.0 9.5
4650
54
Subject 6
Age
Bon
e he
ight
Population mean intercept
β1
25 30 35 40 45
050
015
00
Population mean slope
β2
1.0 1.5 2.0 2.5 3.0
010
0020
00
Corr(gamma[1],gamma[2])
Correlation
−1.00 −0.90 −0.80 −0.70
020
0040
00
ST440/540: Applied Bayesian Statistics (8) Hierarchical models
Missing data models
I We will deal with missing data in the linear regressioncontext, but the ideas apply to all models
I The model is
Yi ∼ Normal(β0 + β1Xi1 + ...+ βpXip, σ2)
I Often either Yi or elements Xij are missing
I We will study separately the case of missing responsesand missing covariates
ST440/540: Applied Bayesian Statistics (8) Hierarchical models
Missing responses
I If the response is missing this is essentially a predictionproblem
I We have seen how to handle this in JAGS
I We obtain samples from the PPD of Yi
I At each MCMC iteration we simply draw
Yi ∼ Normal(β0 + β1Xi1 + ...+ βpXip, σ2)
I This distribution accounts for random error as well asuncertainty in the model parameters
I For the other updates the data are essentially complete
I If only responses are missing, can we delete them for thepurpose of estimating β?
ST440/540: Applied Bayesian Statistics (8) Hierarchical models
Missing covariates
I Now say all responses are observed, but a somecovariates are missing
I The simplest approach is imputation, e.g., just plug in thesample mean of the covariate for the missing values
I This doesn’t account for uncertainty in the imputations
I Bayesian methods handle this well using MCMC
ST440/540: Applied Bayesian Statistics (8) Hierarchical models
Missing covariates
I The main idea is to treat the missing values as unknownparameters in the Bayesian model
I Unknown parameters need priors, so missingXi = (Xi1, ...,Xip)T must have priors such as
Xi ∼ Normal(µX ,ΣX )
I Assumptions about missing data:I Missing status is independent of Y and X
I Covariates are GaussianI There are ways to relax both assumptions, but it becomes
complicated
ST440/540: Applied Bayesian Statistics (8) Hierarchical models
Missing covariates
I Of course if the prior is way off, the results will be invalid
I For example, if in reality the data are not missing atrandom the Bayesian model will likely give bad results
I Example of non-random missingness:
I If specified correctly, the model will lead to inference for βthat properly accounts for uncertainty about the missingdata
ST440/540: Applied Bayesian Statistics (8) Hierarchical models
Hierarchical linear regression model with missing data
I Yi |Xi , β, σ2 ∼ Normal(XT
i β, σ2)
I Xi |mu,Σ ∼ Normal(mu,Σ)
I p(β) ∝ 1
I σ2 ∼ InvG(0.01,0.01)
I µ ∼ Normal(0,1002Ip)
I Σ ∼ InvWishart(0.01,0.01Ip)
If some observations have missing Y and some have missingX , can we delete those with missing Y ? Can we delete thosewith missing X?
ST440/540: Applied Bayesian Statistics (8) Hierarchical models
Overview of the Gibbs sampling algorithm
I The full conditional of missing Yi is:
I The full conditional of missing Xi is:
I In fact, all the full conditionals are conjugate
I JAGS does not handle missing X well
I Let’s use this as an opportunity to explore OpenBUGShttp://www4.stat.ncsu.edu/~reich/ABA/code/Missing.R
ST440/540: Applied Bayesian Statistics (8) Hierarchical models
Non- and Semi-parametric modeling
I Nonparametric (NP) methods attempt to analyze the databy making the fewest number of assumptions as possible
I NP methods are generally more robust and flexible, butless powerful than correctly specified parametric models
I Most frequentist NP methods completely avoid specifying amodel
I For example, a rank or sign test to compare two means
ST440/540: Applied Bayesian Statistics (8) Hierarchical models
Non- and Semi-parametric modeling
I Bayesian methods need a likelihood in order to obtain aposterior, so you can’t completely avoid specifying a model
I Bayesian NP (BNP) then attempts to specify a model thatis so flexible that it almost certainly captures the true model
I One definition of the BNP model is one that hasinfinitely-many parameters
I In some cases, NP models are difficult conceptually andcomputationally, and so semiparametric models with alarge but finite number of parameters are usefulapproximations
ST440/540: Applied Bayesian Statistics (8) Hierarchical models
Parametric simple linear regressionConsider the classic parametric model:
Yi = β0 + β1Xi + εi where εi ∼ N(0, σ2).
Assumptions:1. εi are independent2. εi are Gaussian3. The mean of Yi is linear in X .4. The residual distribution does not depend on X
Alternatives:1. Parametric alternatives such as a time series model.2. Let εi ∼ F , and place a prior on the distribution F .3. Let E(Y |X ) = g(X ) and put a prior on the function g.4. Heteroskedastic regression Var(εi) = exp(α0 + α1X ).
In 2-4 we are placing priors on functions, not parameters.
ST440/540: Applied Bayesian Statistics (8) Hierarchical models
Nonparametric regression
I Let’s relax the assumption of linearity in the mean.
I The mean is g(X ), where g is some function that relates Xto E(Y |X ).
I Parametric models include1. Linear: g(X ) = β0 + β1X2. Quadratic: g(X ) = β0 + β1X + β2X 2
3. Logistic: g(X ) = β0 + β1exp[β2+β3X ]
1+exp[β2+β3X ] .
I NP regression puts a prior on the curve g(X ), rather thanthe parameters β1, ..., βp that determine the parametricmodel.
ST440/540: Applied Bayesian Statistics (8) Hierarchical models
Semiparametric regressionI Semiparametric regression approximates the function g
using a finite basis expansion
g(X ) =J∑
j=1
Bj(X )βj
where Bj(X ) are known basis functions and βj areunknown coefficients that determine the shape of g
I Example: the cubic spline basis functions are
Bj(X ) = (X − vj)3+
where vj are fixed knots that span the range of XI Many other expansions exist: wavelets; Fourier, etcI Fact: A basis expansion of J terms can match the true
curve g at any J points X1, ...,XJI So increasing J gives an arbitrarily flexible model
ST440/540: Applied Bayesian Statistics (8) Hierarchical models
Model fitting
I The model is Yi ∼ N(BT
i β, σ2), where βj ∼ N(0, τ2) and Bi
is comprised of the known basis functions Bj(Xi).
I Therefore, the model is usual linear regression model andis straightforward to fit using MCMC.
I http://www4.stat.ncsu.edu/~reich/ABA/Code/Moto
I How to pick J?
I Can we have more basis functions than observations?
I What would you do if your prior was that g was probablyquadratic, but you are not 100% sure about this. That is,your prior is that g(X ) ≈ β0 + β1X + β2X 2.
ST440/540: Applied Bayesian Statistics (8) Hierarchical models