View
223
Download
0
Category
Preview:
Citation preview
Introduction to Bayesian Data Analysis andMarkov Chain Monte Carlo
Jeffrey S. MorrisUniversity of Texas M.D. Anderson Cancer Center
Department of Biostatisticsjeffmo@mdanderson.org
September 20, 2002
AbstractThe purpose of this talk is to give a brief overview of Bayesian
Inference and Markov Chain Monte Carlo methods, including the GibbsSampler and Metropolis Hastings algorithm.
MCMC OVERVIEW 1
Outline
• Bayesian vs. Frequentist paradigm
• Bayesian Inference and MCMC
? Gibbs Sampler? Metropolis-Hastings Algorithm
• Assessing Convergence of MCMC
• Hierarchical Model Example
• MCMC: Benefits and Cautions
MCMC OVERVIEW 2
Frequentist vs. Bayesian paradigms
• Data: X Parameters: Θ
• To a frequentist:
? The data X are random, and the parameters Θ are fixed.
MCMC OVERVIEW 2
Frequentist vs. Bayesian paradigms
• Data: X Parameters: Θ
• To a frequentist:
? The data X are random, and the parameters Θ are fixed.? (ML) Inference is performed by finding Θ such that f(X|Θ) is
maximized.
MCMC OVERVIEW 2
Frequentist vs. Bayesian paradigms
• Data: X Parameters: Θ
• To a frequentist:
? The data X are random, and the parameters Θ are fixed.? (ML) Inference is performed by finding Θ such that f(X|Θ) is
maximized.? We cannot make probability statements about parameters, but only can
make statements about performance of estimators over repeatedsampling (e.g.confidence intervals).
MCMC OVERVIEW 2
Frequentist vs. Bayesian paradigms
• Data: X Parameters: Θ
• To a frequentist:
? The data X are random, and the parameters Θ are fixed.? (ML) Inference is performed by finding Θ such that f(X|Θ) is
maximized.? We cannot make probability statements about parameters, but only can
make statements about performance of estimators over repeatedsampling (e.g.confidence intervals).
• To a Bayesian:
? The current data X is fixed, and the unknown parameters Θ are random.
MCMC OVERVIEW 2
Frequentist vs. Bayesian paradigms
• Data: X Parameters: Θ
• To a frequentist:
? The data X are random, and the parameters Θ are fixed.? (ML) Inference is performed by finding Θ such that f(X|Θ) is
maximized.? We cannot make probability statements about parameters, but only can
make statements about performance of estimators over repeatedsampling (e.g.confidence intervals).
• To a Bayesian:
? The current data X is fixed, and the unknown parameters Θ are random.? Inference is performed via the posterior distribution f(Θ|X).
MCMC OVERVIEW 2
Frequentist vs. Bayesian paradigms
• Data: X Parameters: Θ
• To a frequentist:
? The data X are random, and the parameters Θ are fixed.? (ML) Inference is performed by finding Θ such that f(X|Θ) is
maximized.? We cannot make probability statements about parameters, but only can
make statements about performance of estimators over repeatedsampling (e.g.confidence intervals).
• To a Bayesian:
? The current data X is fixed, and the unknown parameters Θ are random.? Inference is performed via the posterior distribution f(Θ|X).? We can make probability statements about parameters, since they are
random quantities (e.g. credible intervals)
MCMC OVERVIEW 3
Bayes’ Rule
• The posterior distribution is computed by applying Bayes’ Rule:
f(Θ|X) =f(X|Θ)f(Θ)
f(X)
MCMC OVERVIEW 3
Bayes’ Rule
• The posterior distribution is computed by applying Bayes’ Rule:
f(Θ|X) =f(X|Θ)f(Θ)
f(X)
• f(X|Θ)= Likelihood
MCMC OVERVIEW 3
Bayes’ Rule
• The posterior distribution is computed by applying Bayes’ Rule:
f(Θ|X) =f(X|Θ)f(Θ)
f(X)
• f(X|Θ)= Likelihood
• f(Θ)= Prior Distribution
MCMC OVERVIEW 3
Bayes’ Rule
• The posterior distribution is computed by applying Bayes’ Rule:
f(Θ|X) =f(X|Θ)f(Θ)
f(X)
• f(X|Θ)= Likelihood
• f(Θ)= Prior Distribution
? Reflects prior knowledge about Θ
MCMC OVERVIEW 3
Bayes’ Rule
• The posterior distribution is computed by applying Bayes’ Rule:
f(Θ|X) =f(X|Θ)f(Θ)
f(X)
• f(X|Θ)= Likelihood
• f(Θ)= Prior Distribution
? Reflects prior knowledge about Θ? Sometimes controversial
MCMC OVERVIEW 3
Bayes’ Rule
• The posterior distribution is computed by applying Bayes’ Rule:
f(Θ|X) =f(X|Θ)f(Θ)
f(X)
• f(X|Θ)= Likelihood
• f(Θ)= Prior Distribution
? Reflects prior knowledge about Θ? Sometimes controversial? If little information available, just use diffuse priors (avoid improper priors)
MCMC OVERVIEW 3
Bayes’ Rule
• The posterior distribution is computed by applying Bayes’ Rule:
f(Θ|X) =f(X|Θ)f(Θ)
f(X)
• f(X|Θ)= Likelihood
• f(Θ)= Prior Distribution
? Reflects prior knowledge about Θ? Sometimes controversial? If little information available, just use diffuse priors (avoid improper priors)
• f(X)= Marginal Distribution =∫f(X|Θ)f(Θ)dΘ
MCMC OVERVIEW 3
Bayes’ Rule
• The posterior distribution is computed by applying Bayes’ Rule:
f(Θ|X) =f(X|Θ)f(Θ)
f(X)
• f(X|Θ)= Likelihood
• f(Θ)= Prior Distribution
? Reflects prior knowledge about Θ? Sometimes controversial? If little information available, just use diffuse priors (avoid improper priors)
• f(X)= Marginal Distribution =∫f(X|Θ)f(Θ)dΘ
? Difficult to compute (usually intractable integral)
MCMC OVERVIEW 3
Bayes’ Rule
• The posterior distribution is computed by applying Bayes’ Rule:
f(Θ|X) =f(X|Θ)f(Θ)
f(X)
• f(X|Θ)= Likelihood
• f(Θ)= Prior Distribution
? Reflects prior knowledge about Θ? Sometimes controversial? If little information available, just use diffuse priors (avoid improper priors)
• f(X)= Marginal Distribution =∫f(X|Θ)f(Θ)dΘ
? Difficult to compute (usually intractable integral)? Often not necessary to compute.
MCMC OVERVIEW 4
Conjugate priors
• Conjugate priors: f(Θ) and f(Θ|X) have same distributional form.
• Examples: Normal-Normal, Beta-Binomial, Gamma-Poisson
MCMC OVERVIEW 4
Conjugate priors
• Conjugate priors: f(Θ) and f(Θ|X) have same distributional form.
• Examples: Normal-Normal, Beta-Binomial, Gamma-Poisson
• Ex: (X|θ) ∼ Binomial(n, θ); θ ∼ Beta(α, β)
MCMC OVERVIEW 4
Conjugate priors
• Conjugate priors: f(Θ) and f(Θ|X) have same distributional form.
• Examples: Normal-Normal, Beta-Binomial, Gamma-Poisson
• Ex: (X|θ) ∼ Binomial(n, θ); θ ∼ Beta(α, β)
f(θ|X) ∝ f(X|θ)f(θ)
MCMC OVERVIEW 4
Conjugate priors
• Conjugate priors: f(Θ) and f(Θ|X) have same distributional form.
• Examples: Normal-Normal, Beta-Binomial, Gamma-Poisson
• Ex: (X|θ) ∼ Binomial(n, θ); θ ∼ Beta(α, β)
f(θ|X) ∝ f(X|θ)f(θ)
∝ θX(1− θ)n−Xθα(1− θ)β
MCMC OVERVIEW 4
Conjugate priors
• Conjugate priors: f(Θ) and f(Θ|X) have same distributional form.
• Examples: Normal-Normal, Beta-Binomial, Gamma-Poisson
• Ex: (X|θ) ∼ Binomial(n, θ); θ ∼ Beta(α, β)
f(θ|X) ∝ f(X|θ)f(θ)
∝ θX(1− θ)n−Xθα(1− θ)β
= θα+X(1− θ)β+n−X
MCMC OVERVIEW 4
Conjugate priors
• Conjugate priors: f(Θ) and f(Θ|X) have same distributional form.
• Examples: Normal-Normal, Beta-Binomial, Gamma-Poisson
• Ex: (X|θ) ∼ Binomial(n, θ); θ ∼ Beta(α, β)
f(θ|X) ∝ f(X|θ)f(θ)
∝ θX(1− θ)n−Xθα(1− θ)β
= θα+X(1− θ)β+n−X
= kernel of Beta(α+X,β + n−X)
MCMC OVERVIEW 4
Conjugate priors
• Conjugate priors: f(Θ) and f(Θ|X) have same distributional form.
• Examples: Normal-Normal, Beta-Binomial, Gamma-Poisson
• Ex: (X|θ) ∼ Binomial(n, θ); θ ∼ Beta(α, β)
f(θ|X) ∝ f(X|θ)f(θ)
∝ θX(1− θ)n−Xθα(1− θ)β
= θα+X(1− θ)β+n−X
= kernel of Beta(α+X,β + n−X)
• For single parameter problem: conjugate priors allow closed form posteriordistributions.
MCMC OVERVIEW 4
Conjugate priors
• Conjugate priors: f(Θ) and f(Θ|X) have same distributional form.
• Examples: Normal-Normal, Beta-Binomial, Gamma-Poisson
• Ex: (X|θ) ∼ Binomial(n, θ); θ ∼ Beta(α, β)
f(θ|X) ∝ f(X|θ)f(θ)
∝ θX(1− θ)n−Xθα(1− θ)β
= θα+X(1− θ)β+n−X
= kernel of Beta(α+X,β + n−X)
• For single parameter problem: conjugate priors allow closed form posteriordistributions.
• What if we don’t want to use conjugate priors?
MCMC OVERVIEW 4
Conjugate priors
• Conjugate priors: f(Θ) and f(Θ|X) have same distributional form.
• Examples: Normal-Normal, Beta-Binomial, Gamma-Poisson
• Ex: (X|θ) ∼ Binomial(n, θ); θ ∼ Beta(α, β)
f(θ|X) ∝ f(X|θ)f(θ)
∝ θX(1− θ)n−Xθα(1− θ)β
= θα+X(1− θ)β+n−X
= kernel of Beta(α+X,β + n−X)
• For single parameter problem: conjugate priors allow closed form posteriordistributions.
• What if we don’t want to use conjugate priors?
What if we have multiple parameters?
MCMC OVERVIEW 5
Non-conjugate Case
• Suppose we are interested in the posterior mean:
E(Θ|X) =∫
Θf(Θ|X)dΘ
MCMC OVERVIEW 5
Non-conjugate Case
• Suppose we are interested in the posterior mean:
E(Θ|X) =∫
Θf(Θ|X)dΘ
=∫
Θf(X|Θ)f(Θ)dΘ∫f(X|Θ)f(Θ)dΘ
MCMC OVERVIEW 5
Non-conjugate Case
• Suppose we are interested in the posterior mean:
E(Θ|X) =∫
Θf(Θ|X)dΘ
=∫
Θf(X|Θ)f(Θ)dΘ∫f(X|Θ)f(Θ)dΘ
• How do we compute this integral if it is intractable?
MCMC OVERVIEW 5
Non-conjugate Case
• Suppose we are interested in the posterior mean:
E(Θ|X) =∫
Θf(Θ|X)dΘ
=∫
Θf(X|Θ)f(Θ)dΘ∫f(X|Θ)f(Θ)dΘ
• How do we compute this integral if it is intractable?
? Numerical Integration (Quadrature)
MCMC OVERVIEW 5
Non-conjugate Case
• Suppose we are interested in the posterior mean:
E(Θ|X) =∫
Θf(Θ|X)dΘ
=∫
Θf(X|Θ)f(Θ)dΘ∫f(X|Θ)f(Θ)dΘ
• How do we compute this integral if it is intractable?
? Numerical Integration (Quadrature)May not work if there are many parameters.
MCMC OVERVIEW 5
Non-conjugate Case
• Suppose we are interested in the posterior mean:
E(Θ|X) =∫
Θf(Θ|X)dΘ
=∫
Θf(X|Θ)f(Θ)dΘ∫f(X|Θ)f(Θ)dΘ
• How do we compute this integral if it is intractable?
? Numerical Integration (Quadrature)May not work if there are many parameters.
? Monte Carlo integration
MCMC OVERVIEW 6
Markov Chain Monte Carlo: Monte Carlo Integration
• Monte Carlo integration:
Estimate integrals by randomly drawing samples from the requireddistribution.
MCMC OVERVIEW 6
Markov Chain Monte Carlo: Monte Carlo Integration
• Monte Carlo integration:
Estimate integrals by randomly drawing samples from the requireddistribution.
E(Θ|X) =∫
Θf(Θ|X)dΘ
MCMC OVERVIEW 6
Markov Chain Monte Carlo: Monte Carlo Integration
• Monte Carlo integration:
Estimate integrals by randomly drawing samples from the requireddistribution.
E(Θ|X) =∫
Θf(Θ|X)dΘ
≈ 1n
n∑t=1
Θt,
where Θt ∼ f(Θ|X)
MCMC OVERVIEW 6
Markov Chain Monte Carlo: Monte Carlo Integration
• Monte Carlo integration:
Estimate integrals by randomly drawing samples from the requireddistribution.
E(Θ|X) =∫
Θf(Θ|X)dΘ
≈ 1n
n∑t=1
Θt,
where Θt ∼ f(Θ|X)
• We still need a method for drawing samples from the posterior distribution:
MCMC OVERVIEW 6
Markov Chain Monte Carlo: Monte Carlo Integration
• Monte Carlo integration:
Estimate integrals by randomly drawing samples from the requireddistribution.
E(Θ|X) =∫
Θf(Θ|X)dΘ
≈ 1n
n∑t=1
Θt,
where Θt ∼ f(Θ|X)
• We still need a method for drawing samples from the posterior distribution:
? Rejection Sampling
MCMC OVERVIEW 6
Markov Chain Monte Carlo: Monte Carlo Integration
• Monte Carlo integration:
Estimate integrals by randomly drawing samples from the requireddistribution.
E(Θ|X) =∫
Θf(Θ|X)dΘ
≈ 1n
n∑t=1
Θt,
where Θt ∼ f(Θ|X)
• We still need a method for drawing samples from the posterior distribution:
? Rejection Sampling? Importance Sampling
MCMC OVERVIEW 6
Markov Chain Monte Carlo: Monte Carlo Integration
• Monte Carlo integration:
Estimate integrals by randomly drawing samples from the requireddistribution.
E(Θ|X) =∫
Θf(Θ|X)dΘ
≈ 1n
n∑t=1
Θt,
where Θt ∼ f(Θ|X)
• We still need a method for drawing samples from the posterior distribution:
? Rejection Sampling? Importance Sampling? Markov Chain
MCMC OVERVIEW 7
Markov Chain Monte Carlo: Markov Chains
• Markov Chain : Method to draw samples from a desired stationarydistribution.
MCMC OVERVIEW 7
Markov Chain Monte Carlo: Markov Chains
• Markov Chain : Method to draw samples from a desired stationarydistribution.
• Steps:
1. Obtain starting values Θ0
MCMC OVERVIEW 7
Markov Chain Monte Carlo: Markov Chains
• Markov Chain : Method to draw samples from a desired stationarydistribution.
• Steps:
1. Obtain starting values Θ0
2. Sample Θ1 from suitably chosen transition kernel P (Θ1|Θ0)
MCMC OVERVIEW 7
Markov Chain Monte Carlo: Markov Chains
• Markov Chain : Method to draw samples from a desired stationarydistribution.
• Steps:
1. Obtain starting values Θ0
2. Sample Θ1 from suitably chosen transition kernel P (Θ1|Θ0)3. Repeat second step n times to obtain chain {Θ0,Θ1, . . . ,Θn}.
MCMC OVERVIEW 7
Markov Chain Monte Carlo: Markov Chains
• Markov Chain : Method to draw samples from a desired stationarydistribution.
• Steps:
1. Obtain starting values Θ0
2. Sample Θ1 from suitably chosen transition kernel P (Θ1|Θ0)3. Repeat second step n times to obtain chain {Θ0,Θ1, . . . ,Θn}.
• Theorems show that, under certain regularity conditions, the chain willconverge to a particular stationary distribution after suitable burn-in period.
MCMC OVERVIEW 7
Markov Chain Monte Carlo: Markov Chains
• Markov Chain : Method to draw samples from a desired stationarydistribution.
• Steps:
1. Obtain starting values Θ0
2. Sample Θ1 from suitably chosen transition kernel P (Θ1|Θ0)3. Repeat second step n times to obtain chain {Θ0,Θ1, . . . ,Θn}.
• Theorems show that, under certain regularity conditions, the chain willconverge to a particular stationary distribution after suitable burn-in period.
• End result: A (correlated) sample from the stationary distribution.
MCMC OVERVIEW 8
Markov Chain Monte Carlo
• Given Markov Chain {Θ0,Θ1, . . . ,Θn} with stationary distribution f(Θ|X)with burn-in m, we can estimate the posterior mean using Monte Carlointegration:
MCMC OVERVIEW 8
Markov Chain Monte Carlo
• Given Markov Chain {Θ0,Θ1, . . . ,Θn} with stationary distribution f(Θ|X)with burn-in m, we can estimate the posterior mean using Monte Carlointegration:
E(Θ|X) ≈ 1n−m
n∑t=m+1
Θt.
MCMC OVERVIEW 8
Markov Chain Monte Carlo
• Given Markov Chain {Θ0,Θ1, . . . ,Θn} with stationary distribution f(Θ|X)with burn-in m, we can estimate the posterior mean using Monte Carlointegration:
E(Θ|X) ≈ 1n−m
n∑t=m+1
Θt.
• Other quantities can also be computed from Markov Chain:
MCMC OVERVIEW 8
Markov Chain Monte Carlo
• Given Markov Chain {Θ0,Θ1, . . . ,Θn} with stationary distribution f(Θ|X)with burn-in m, we can estimate the posterior mean using Monte Carlointegration:
E(Θ|X) ≈ 1n−m
n∑t=m+1
Θt.
• Other quantities can also be computed from Markov Chain:
? Standard errors? Quantiles? Density estimates
MCMC OVERVIEW 8
Markov Chain Monte Carlo
• Given Markov Chain {Θ0,Θ1, . . . ,Θn} with stationary distribution f(Θ|X)with burn-in m, we can estimate the posterior mean using Monte Carlointegration:
E(Θ|X) ≈ 1n−m
n∑t=m+1
Θt.
• Other quantities can also be computed from Markov Chain:
? Standard errors? Quantiles? Density estimates
• Samples can be used to perform any Bayesian inference of interest.
• How do we generate the Markov Chain?
MCMC OVERVIEW 9
Gibbs Sampler
• Gibbs Sampler(Geman and Geman, 1984):
Markov transition kernel consists of drawing from full conditionaldistributions.
MCMC OVERVIEW 9
Gibbs Sampler
• Gibbs Sampler(Geman and Geman, 1984):
Markov transition kernel consists of drawing from full conditionaldistributions.
• Suppose Θ = {θ1, θ2, . . . , θp}T .
MCMC OVERVIEW 9
Gibbs Sampler
• Gibbs Sampler(Geman and Geman, 1984):
Markov transition kernel consists of drawing from full conditionaldistributions.
• Suppose Θ = {θ1, θ2, . . . , θp}T .
Full conditional distribution for parameter i: f(θi|X,Θ−i)
MCMC OVERVIEW 9
Gibbs Sampler
• Gibbs Sampler(Geman and Geman, 1984):
Markov transition kernel consists of drawing from full conditionaldistributions.
• Suppose Θ = {θ1, θ2, . . . , θp}T .
Full conditional distribution for parameter i: f(θi|X,Θ−i)
Conditions on:
? The data X? The values for all other parameters Θ−i.
MCMC OVERVIEW 10
Gibbs Sampler
• Steps of Gibbs sampler:
1. Choose a set of starting values Θ(0).2. Generate (Θ(1)|Θ(0)) by sampling:
θ(1)1 from f(θ(1)
1 |X,Θ(0)−1)
θ(1)2 from f(θ(1)
2 |X,Θ(0)−2)
...θ
(1)p from f(θ(1)
p |X,Θ(0)−p)
MCMC OVERVIEW 10
Gibbs Sampler
• Steps of Gibbs sampler:
1. Choose a set of starting values Θ(0).2. Generate (Θ(1)|Θ(0)) by sampling:
θ(1)1 from f(θ(1)
1 |X,Θ(0)−1)
θ(1)2 from f(θ(1)
2 |X,Θ(0)−2)
...θ
(1)p from f(θ(1)
p |X,Θ(0)−p)
3. Repeat step two to get chain of length n: {Θ(0),Θ(1), . . .Θ(n)}.
MCMC OVERVIEW 10
Gibbs Sampler
• Steps of Gibbs sampler:
1. Choose a set of starting values Θ(0).2. Generate (Θ(1)|Θ(0)) by sampling:
θ(1)1 from f(θ(1)
1 |X,Θ(0)−1)
θ(1)2 from f(θ(1)
2 |X,Θ(0)−2)
...θ
(1)p from f(θ(1)
p |X,Θ(0)−p)
3. Repeat step two to get chain of length n: {Θ(0),Θ(1), . . .Θ(n)}.4. Assuming convergence by iteration m, compute posterior mean,
quantiles, etc. using samples m through n.
MCMC OVERVIEW 10
Gibbs Sampler
• Steps of Gibbs sampler:
1. Choose a set of starting values Θ(0).2. Generate (Θ(1)|Θ(0)) by sampling:
θ(1)1 from f(θ(1)
1 |X,Θ(0)−1)
θ(1)2 from f(θ(1)
2 |X,Θ(0)−2)
...θ
(1)p from f(θ(1)
p |X,Θ(0)−p)
3. Repeat step two to get chain of length n: {Θ(0),Θ(1), . . .Θ(n)}.4. Assuming convergence by iteration m, compute posterior mean,
quantiles, etc. using samples m through n.
• Many variations possible:
MCMC OVERVIEW 10
Gibbs Sampler
• Steps of Gibbs sampler:
1. Choose a set of starting values Θ(0).2. Generate (Θ(1)|Θ(0)) by sampling:
θ(1)1 from f(θ(1)
1 |X,Θ(0)−1)
θ(1)2 from f(θ(1)
2 |X,Θ(0)−2)
...θ
(1)p from f(θ(1)
p |X,Θ(0)−p)
3. Repeat step two to get chain of length n: {Θ(0),Θ(1), . . .Θ(n)}.4. Assuming convergence by iteration m, compute posterior mean,
quantiles, etc. using samples m through n.
• Many variations possible:
? Parameters to update each iteration, order of updating
MCMC OVERVIEW 10
Gibbs Sampler
• Steps of Gibbs sampler:
1. Choose a set of starting values Θ(0).2. Generate (Θ(1)|Θ(0)) by sampling:
θ(1)1 from f(θ(1)
1 |X,Θ(0)−1)
θ(1)2 from f(θ(1)
2 |X,Θ(0)−2)
...θ
(1)p from f(θ(1)
p |X,Θ(0)−p)
3. Repeat step two to get chain of length n: {Θ(0),Θ(1), . . .Θ(n)}.4. Assuming convergence by iteration m, compute posterior mean,
quantiles, etc. using samples m through n.
• Many variations possible:
? Parameters to update each iteration, order of updating? ’Blocking’ parameters together, working with marginalized distributions
MCMC OVERVIEW 10
Gibbs Sampler
• Steps of Gibbs sampler:
1. Choose a set of starting values Θ(0).2. Generate (Θ(1)|Θ(0)) by sampling:
θ(1)1 from f(θ(1)
1 |X,Θ(0)−1)
θ(1)2 from f(θ(1)
2 |X,Θ(0)−2)
...θ
(1)p from f(θ(1)
p |X,Θ(0)−p)
3. Repeat step two to get chain of length n: {Θ(0),Θ(1), . . .Θ(n)}.4. Assuming convergence by iteration m, compute posterior mean,
quantiles, etc. using samples m through n.
• Many variations possible:
? Parameters to update each iteration, order of updating? ’Blocking’ parameters together, working with marginalized distributions
• If conjugate priors used for all parameters, full conditionals in closed form.
MCMC OVERVIEW 10
Gibbs Sampler
• Steps of Gibbs sampler:
1. Choose a set of starting values Θ(0).2. Generate (Θ(1)|Θ(0)) by sampling:
θ(1)1 from f(θ(1)
1 |X,Θ(0)−1)
θ(1)2 from f(θ(1)
2 |X,Θ(0)−2)
...θ
(1)p from f(θ(1)
p |X,Θ(0)−p)
3. Repeat step two to get chain of length n: {Θ(0),Θ(1), . . .Θ(n)}.4. Assuming convergence by iteration m, compute posterior mean,
quantiles, etc. using samples m through n.
• Many variations possible:
? Parameters to update each iteration, order of updating? ’Blocking’ parameters together, working with marginalized distributions
• If conjugate priors used for all parameters, full conditionals in closed form.
• What if we don’t have closed form distributions for full conditionals?
MCMC OVERVIEW 11
Metropolis-Hastings Algorithm
• Metropolis-Hastings algorithm (Metropolis et al. 1953, Hastings 1970):
Method to construct a Markov Chain for θ, even if closed form expressionfor distribution is not available.
MCMC OVERVIEW 11
Metropolis-Hastings Algorithm
• Metropolis-Hastings algorithm (Metropolis et al. 1953, Hastings 1970):
Method to construct a Markov Chain for θ, even if closed form expressionfor distribution is not available.
π(θ): kernel of distribution of interest for θ, f(θ(t)i |X,Θ
(t−1)−i ).
MCMC OVERVIEW 11
Metropolis-Hastings Algorithm
• Metropolis-Hastings algorithm (Metropolis et al. 1953, Hastings 1970):
Method to construct a Markov Chain for θ, even if closed form expressionfor distribution is not available.
π(θ): kernel of distribution of interest for θ, f(θ(t)i |X,Θ
(t−1)−i ).
• Steps:
MCMC OVERVIEW 11
Metropolis-Hastings Algorithm
• Metropolis-Hastings algorithm (Metropolis et al. 1953, Hastings 1970):
Method to construct a Markov Chain for θ, even if closed form expressionfor distribution is not available.
π(θ): kernel of distribution of interest for θ, f(θ(t)i |X,Θ
(t−1)−i ).
• Steps:
1. Get θ(0)= starting value for θ.
MCMC OVERVIEW 11
Metropolis-Hastings Algorithm
• Metropolis-Hastings algorithm (Metropolis et al. 1953, Hastings 1970):
Method to construct a Markov Chain for θ, even if closed form expressionfor distribution is not available.
π(θ): kernel of distribution of interest for θ, f(θ(t)i |X,Θ
(t−1)−i ).
• Steps:
1. Get θ(0)= starting value for θ.2. Get θ∗=proposed value for θ(1), by sampling from proposal density
q(θ|X, θ(0)).
MCMC OVERVIEW 11
Metropolis-Hastings Algorithm
• Metropolis-Hastings algorithm (Metropolis et al. 1953, Hastings 1970):
Method to construct a Markov Chain for θ, even if closed form expressionfor distribution is not available.
π(θ): kernel of distribution of interest for θ, f(θ(t)i |X,Θ
(t−1)−i ).
• Steps:
1. Get θ(0)= starting value for θ.2. Get θ∗=proposed value for θ(1), by sampling from proposal density
q(θ|X, θ(0)).
3. Compute α(θ(0), θ∗)=min(
1,π(θ∗)q(θ(0)|θ∗)π(θ(0)q(θ∗|θ(0))
).
MCMC OVERVIEW 11
Metropolis-Hastings Algorithm
• Metropolis-Hastings algorithm (Metropolis et al. 1953, Hastings 1970):
Method to construct a Markov Chain for θ, even if closed form expressionfor distribution is not available.
π(θ): kernel of distribution of interest for θ, f(θ(t)i |X,Θ
(t−1)−i ).
• Steps:
1. Get θ(0)= starting value for θ.2. Get θ∗=proposed value for θ(1), by sampling from proposal density
q(θ|X, θ(0)).
3. Compute α(θ(0), θ∗)=min(
1,π(θ∗)q(θ(0)|θ∗)π(θ(0)q(θ∗|θ(0))
).
4. Generate u ∼Uniform(0,1).
MCMC OVERVIEW 11
Metropolis-Hastings Algorithm
• Metropolis-Hastings algorithm (Metropolis et al. 1953, Hastings 1970):
Method to construct a Markov Chain for θ, even if closed form expressionfor distribution is not available.
π(θ): kernel of distribution of interest for θ, f(θ(t)i |X,Θ
(t−1)−i ).
• Steps:
1. Get θ(0)= starting value for θ.2. Get θ∗=proposed value for θ(1), by sampling from proposal density
q(θ|X, θ(0)).
3. Compute α(θ(0), θ∗)=min(
1,π(θ∗)q(θ(0)|θ∗)π(θ(0)q(θ∗|θ(0))
).
4. Generate u ∼Uniform(0,1).If u < α⇒ let θ(1) = θ∗, else let θ(1) = θ(0).
MCMC OVERVIEW 11
Metropolis-Hastings Algorithm
• Metropolis-Hastings algorithm (Metropolis et al. 1953, Hastings 1970):
Method to construct a Markov Chain for θ, even if closed form expressionfor distribution is not available.
π(θ): kernel of distribution of interest for θ, f(θ(t)i |X,Θ
(t−1)−i ).
• Steps:
1. Get θ(0)= starting value for θ.2. Get θ∗=proposed value for θ(1), by sampling from proposal density
q(θ|X, θ(0)).
3. Compute α(θ(0), θ∗)=min(
1,π(θ∗)q(θ(0)|θ∗)π(θ(0)q(θ∗|θ(0))
).
4. Generate u ∼Uniform(0,1).If u < α⇒ let θ(1) = θ∗, else let θ(1) = θ(0).
• Types of proposals: Random Walk, Independence, Symmetric
MCMC OVERVIEW 12
Assessing Convergence of Markov Chains
• The Markov Chain is known to converge to the stationary distribution ofinterest, but how do I know when convergence has been achieved?
MCMC OVERVIEW 12
Assessing Convergence of Markov Chains
• The Markov Chain is known to converge to the stationary distribution ofinterest, but how do I know when convergence has been achieved?
i.e. How do I decide how long the burn-in should be?
MCMC OVERVIEW 12
Assessing Convergence of Markov Chains
• The Markov Chain is known to converge to the stationary distribution ofinterest, but how do I know when convergence has been achieved?
i.e. How do I decide how long the burn-in should be?
1. Look at time series plots for the parameters.
MCMC OVERVIEW 12
Assessing Convergence of Markov Chains
• The Markov Chain is known to converge to the stationary distribution ofinterest, but how do I know when convergence has been achieved?
i.e. How do I decide how long the burn-in should be?
1. Look at time series plots for the parameters.2. Run multiple chains with divergent starting values.
MCMC OVERVIEW 12
Assessing Convergence of Markov Chains
• The Markov Chain is known to converge to the stationary distribution ofinterest, but how do I know when convergence has been achieved?
i.e. How do I decide how long the burn-in should be?
1. Look at time series plots for the parameters.2. Run multiple chains with divergent starting values.3. Run formal diagnostics (Gelman and Rubin 1992, Geweke 1992)
MCMC OVERVIEW 12
Assessing Convergence of Markov Chains
• The Markov Chain is known to converge to the stationary distribution ofinterest, but how do I know when convergence has been achieved?
i.e. How do I decide how long the burn-in should be?
1. Look at time series plots for the parameters.2. Run multiple chains with divergent starting values.3. Run formal diagnostics (Gelman and Rubin 1992, Geweke 1992)
• Other issues:
? Length of chain
MCMC OVERVIEW 12
Assessing Convergence of Markov Chains
• The Markov Chain is known to converge to the stationary distribution ofinterest, but how do I know when convergence has been achieved?
i.e. How do I decide how long the burn-in should be?
1. Look at time series plots for the parameters.2. Run multiple chains with divergent starting values.3. Run formal diagnostics (Gelman and Rubin 1992, Geweke 1992)
• Other issues:
? Length of chain? Thinning to decrease autocorrelation
MCMC OVERVIEW 12
Assessing Convergence of Markov Chains
• The Markov Chain is known to converge to the stationary distribution ofinterest, but how do I know when convergence has been achieved?
i.e. How do I decide how long the burn-in should be?
1. Look at time series plots for the parameters.2. Run multiple chains with divergent starting values.3. Run formal diagnostics (Gelman and Rubin 1992, Geweke 1992)
• Other issues:
? Length of chain? Thinning to decrease autocorrelation
MCMC OVERVIEW 13
Example: Hierarchical Models
• Example: Growth curves for rats.• Data Yij consists of weights for 30 rats over 5 weeks.
MCMC OVERVIEW 13
Example: Hierarchical Models
• Example: Growth curves for rats.• Data Yij consists of weights for 30 rats over 5 weeks.
Rat Growth Model Data
Time (days)
Wei
ght (
g)
10 15 20 25 30 35
100
150
200
250
300
350
MCMC OVERVIEW 13
Example: Hierarchical Models
• Example: Growth curves for rats.• Data Yij consists of weights for 30 rats over 5 weeks.
Rat Growth Model Data
Time (days)
Wei
ght (
g)
10 15 20 25 30 35
100
150
200
250
300
350
• Can estimate mean growth curve by linear regression, but growth curvemodels necessary to get standard errors right.
MCMC OVERVIEW 14
Example: Hierarchical Models
• Model: Yij ∼ Normal(µij, τc)
µij = αi + βi(xj − x)
αi ∼ Normal(αc, τα)
MCMC OVERVIEW 14
Example: Hierarchical Models
• Model: Yij ∼ Normal(µij, τc)
µij = αi + βi(xj − x)
αi ∼ Normal(αc, τα)
βi ∼ Normal(βc, τβ)
MCMC OVERVIEW 14
Example: Hierarchical Models
• Model: Yij ∼ Normal(µij, τc)
µij = αi + βi(xj − x)
αi ∼ Normal(αc, τα)
βi ∼ Normal(βc, τβ)
• Model could be fit using linear mixed model or Bayesian hierarchical model.
MCMC OVERVIEW 14
Example: Hierarchical Models
• Model: Yij ∼ Normal(µij, τc)
µij = αi + βi(xj − x)
αi ∼ Normal(αc, τα)
βi ∼ Normal(βc, τβ)
• Model could be fit using linear mixed model or Bayesian hierarchical model.
• Priors (conjugate and vague):αc, βc ∼ Normal(0, 10−6)
MCMC OVERVIEW 14
Example: Hierarchical Models
• Model: Yij ∼ Normal(µij, τc)
µij = αi + βi(xj − x)
αi ∼ Normal(αc, τα)
βi ∼ Normal(βc, τβ)
• Model could be fit using linear mixed model or Bayesian hierarchical model.
• Priors (conjugate and vague):αc, βc ∼ Normal(0, 10−6)
τc, τα, τβ ∼ Gamma(0.001, 0.001)
MCMC OVERVIEW 14
Example: Hierarchical Models
• Model: Yij ∼ Normal(µij, τc)
µij = αi + βi(xj − x)
αi ∼ Normal(αc, τα)
βi ∼ Normal(βc, τβ)
• Model could be fit using linear mixed model or Bayesian hierarchical model.
• Priors (conjugate and vague):αc, βc ∼ Normal(0, 10−6)
τc, τα, τβ ∼ Gamma(0.001, 0.001)
• Gibbs sampler:
Since conjugate priors were used, the full conditionals are all available inclosed form and can be derived using some algebra.
MCMC OVERVIEW 14
Example: Hierarchical Models
• Model: Yij ∼ Normal(µij, τc)
µij = αi + βi(xj − x)
αi ∼ Normal(αc, τα)
βi ∼ Normal(βc, τβ)
• Model could be fit using linear mixed model or Bayesian hierarchical model.
• Priors (conjugate and vague):αc, βc ∼ Normal(0, 10−6)
τc, τα, τβ ∼ Gamma(0.001, 0.001)
• Gibbs sampler:
Since conjugate priors were used, the full conditionals are all available inclosed form and can be derived using some algebra.
• WinBUGS: Statistical software to perform MCMC in general problems.
MCMC OVERVIEW 14
Example: Hierarchical Models
• Model: Yij ∼ Normal(µij, τc)
µij = αi + βi(xj − x)
αi ∼ Normal(αc, τα)
βi ∼ Normal(βc, τβ)
• Model could be fit using linear mixed model or Bayesian hierarchical model.
• Priors (conjugate and vague):αc, βc ∼ Normal(0, 10−6)
τc, τα, τβ ∼ Gamma(0.001, 0.001)
• Gibbs sampler:
Since conjugate priors were used, the full conditionals are all available inclosed form and can be derived using some algebra.
• WinBUGS: Statistical software to perform MCMC in general problems.
MCMC OVERVIEW 15
Conclusions
• Why use MCMC?
? Flexible computing tool with ability to fit complex models.
MCMC OVERVIEW 15
Conclusions
• Why use MCMC?
? Flexible computing tool with ability to fit complex models.? No need to make simplified modeling assumptions out of convenience.
MCMC OVERVIEW 15
Conclusions
• Why use MCMC?
? Flexible computing tool with ability to fit complex models.? No need to make simplified modeling assumptions out of convenience.? Given posterior samples, can get all benefits of Bayesian inference.
MCMC OVERVIEW 15
Conclusions
• Why use MCMC?
? Flexible computing tool with ability to fit complex models.? No need to make simplified modeling assumptions out of convenience.? Given posterior samples, can get all benefits of Bayesian inference.
• Words of Caution:
MCMC OVERVIEW 15
Conclusions
• Why use MCMC?
? Flexible computing tool with ability to fit complex models.? No need to make simplified modeling assumptions out of convenience.? Given posterior samples, can get all benefits of Bayesian inference.
• Words of Caution:
? Monitor convergence!
MCMC OVERVIEW 15
Conclusions
• Why use MCMC?
? Flexible computing tool with ability to fit complex models.? No need to make simplified modeling assumptions out of convenience.? Given posterior samples, can get all benefits of Bayesian inference.
• Words of Caution:
? Monitor convergence!∗ Unfortunately, the most complex models tend to converge very slowly.
MCMC OVERVIEW 15
Conclusions
• Why use MCMC?
? Flexible computing tool with ability to fit complex models.? No need to make simplified modeling assumptions out of convenience.? Given posterior samples, can get all benefits of Bayesian inference.
• Words of Caution:
? Monitor convergence!∗ Unfortunately, the most complex models tend to converge very slowly.∗ Can try blocking and marginalization to decrease correlation of model
parameters in MCMC.
MCMC OVERVIEW 15
Conclusions
• Why use MCMC?
? Flexible computing tool with ability to fit complex models.? No need to make simplified modeling assumptions out of convenience.? Given posterior samples, can get all benefits of Bayesian inference.
• Words of Caution:
? Monitor convergence!∗ Unfortunately, the most complex models tend to converge very slowly.∗ Can try blocking and marginalization to decrease correlation of model
parameters in MCMC.? Check if your answers make sense - compare with plots and simple
methods
MCMC OVERVIEW 15
Conclusions
• Why use MCMC?
? Flexible computing tool with ability to fit complex models.? No need to make simplified modeling assumptions out of convenience.? Given posterior samples, can get all benefits of Bayesian inference.
• Words of Caution:
? Monitor convergence!∗ Unfortunately, the most complex models tend to converge very slowly.∗ Can try blocking and marginalization to decrease correlation of model
parameters in MCMC.? Check if your answers make sense - compare with plots and simple
methods? Perform sensitivity analysis on priors.
MCMC OVERVIEW 15
Conclusions
• Why use MCMC?
? Flexible computing tool with ability to fit complex models.? No need to make simplified modeling assumptions out of convenience.? Given posterior samples, can get all benefits of Bayesian inference.
• Words of Caution:
? Monitor convergence!∗ Unfortunately, the most complex models tend to converge very slowly.∗ Can try blocking and marginalization to decrease correlation of model
parameters in MCMC.? Check if your answers make sense - compare with plots and simple
methods? Perform sensitivity analysis on priors.
• Other book: Gelman, Carlin, Stern, & Rubin (1995) Bayesian Data Analysis
MCMC OVERVIEW 16
ReferencesGelman A and Rubin DB (1992) . Inference from iterative simulation usingmultiple sequences. Statistical Science 7, 457�75472.
Geman S and Geman D (1984) . Stochastic relaxation, Gibbs distributions,and the Bayesian restoration of images. IEEE Trans. Pattn. Anal. Mach. Intel.6, 721�75741.
Geweke J (1992) . Evaluation of accuracy of sampling-based approaches tothe calculation of posterior moments. In Bayesian Statistics 4(ed. JMBernardo, J Berger, AP Dawid and AFM Smith), pp. 169�75193. OxfordUniversity Press.
Gilks WR, Richardson S, and Spiegelhalter DJ (1996) . Markov ChainMonte Carlo in Practice, Chapman and Hall.
Hastings WK (1970) . Monte Carlo sampling methods using Markov chainsand their applications. Biometrika 57, 97�75109.
Metropolis N, Rosenbluth AW, Rosenbluth MN, Teller AH and Teller E(1953). Equations of state calculations by fast computing machine. J. Chem.Phys. 21, 1087�751091.
Recommended