Upload
others
View
19
Download
0
Embed Size (px)
Citation preview
Markov Chain Monte Carlo Methods
Markov Chain Monte Carlo Methods
Olivier Cappe
LTCI, Telecom ParisTech & CNRShttp://perso.telecom-paristech.fr/~cappe
November 30, 2010
Markov Chain Monte Carlo Methods
Textbook: Monte Carlo Statistical Methodsby Christian. P. Robert and George Casella
Slides: Adapted from (mostly)http://www.ceremade.dauphine.fr/~xian/coursBC.pdf
available athttp://perso.telecom-paristech.fr/~cappe/cours/mido-tsi.pdf
Other suggestedreading...
Markov Chain Monte Carlo Methods
Outline
Motivations, Random Variable Generation Chapters 1 & 2
Monte Carlo Integration Chapter 3
Notions on Markov Chains Chapter 6
The Metropolis-Hastings Algorithm Chapter 7
The Gibbs Sampler Chapters 8–10
Case Study (with a glimpse of Chapter 11∗)
Corresponding homeworks
Date: Oct. 27 Nov. 3 Nov. 10 Nov. 24 Dec. 1Assignment: I (§1 & 2) II (§3) III (§6) IV (§7) V (§8–10)
Markov Chain Monte Carlo Methods
Motivation and leading example
Motivation and leading example
Motivation and leading exampleIntroductionLikelihood methodsMissing variable modelsBayesian MethodsBayesian troubles
Random variable generation
Monte Carlo Integration
Notions on Markov Chains for MCMC
The Metropolis-Hastings Algorithm
The Gibbs Sampler
Case Study
Markov Chain Monte Carlo Methods
Motivation and leading example
Introduction
Latent structures make life harder!
Even simple models may lead to computational complications, asin latent variable models
f(x|θ) =∫f?(x, x?|θ) dx?
If (x, x?) observed, fine!If only x observed, trouble!
Markov Chain Monte Carlo Methods
Motivation and leading example
Introduction
Example (Mixture models)
Models of mixtures of distributions:
X ∼ fj with probability pj ,
for j = 1, 2, . . . , k, with overall density
X ∼ p1f1(x) + · · ·+ pkfk(x) .
For a sample of independent random variables (X1, · · · , Xn),sample density
n∏i=1
{p1f1(xi) + · · ·+ pkfk(xi)} .
Expanding this product involves kn elementary terms: prohibitiveto compute in large samples.
Markov Chain Monte Carlo Methods
Motivation and leading example
Introduction
−1 0 1 2 3
−1
01
23
µ1
µ 2
Case of the 0.3N (µ1, 1) + 0.7N (µ2, 1) likelihood
Markov Chain Monte Carlo Methods
Motivation and leading example
Likelihood methods
Maximum likelihood methods
Go Bayes!!
◦ For an iid sample X1, . . . , Xn from a population with densityf(x|θ1, . . . , θk), the likelihood function is
L(x|θ) = L(x1, . . . , xn|θ1, . . . , θk)
=∏n
i=1f(xi|θ1, . . . , θk).
θn = arg maxθ
L(x|θ)
◦ Global justifications from asymptotics
◦ Computational difficulty depends on structure, eg latentvariables
Markov Chain Monte Carlo Methods
Motivation and leading example
Likelihood methods
Example (Mixtures again)
For a mixture of two normal distributions,
pN (µ, τ2) + (1− p)N (θ, σ2) ,
likelihood proportional to
n∏i=1
[pτ−1ϕ
(xi − µτ
)+ (1− p) σ−1 ϕ
(xi − θσ
)]containing 2n terms.
Markov Chain Monte Carlo Methods
Motivation and leading example
Likelihood methods
Standard maximization techniques often fail to find the globalmaximum because of multimodality or undesirable behavior(usually at the frontier of the domain) of the likelihood function.
Example
In the special case
f(x|µ, σ) = (1− ε) exp{(−1/2)x2}+ε
σexp{(−1/2σ2)(x− µ)2}
(1)with ε > 0 known, whatever n, the likelihood is unbounded:
limσ→0
L(x1, . . . , xn|µ = x1, σ) =∞
Markov Chain Monte Carlo Methods
Motivation and leading example
Missing variable models
The special case of missing variable models
Consider again a latent variable representation
g(x|θ) =∫Zf(x, z|θ) dz
Define the completed (but unobserved) likelihood
Lc(x, z|θ) = f(x, z|θ)
Useful for optimisation algorithm
Markov Chain Monte Carlo Methods
Motivation and leading example
Missing variable models
The EM Algorithm
Gibbs connection Bayes rather than EM
Algorithm (Expectation–Maximisation)
Iterate (in m)
1. (E step) Compute
Q(θ; θ(m),x) = E[logLc(x,Z|θ)|θ(m),x] ,
2. (M step) Maximise Q(θ; θ(m),x) in θ and take
θ(m+1) = arg maxθ
Q(θ; θ(m),x).
until a fixed point [of Q] is reached
Markov Chain Monte Carlo Methods
Motivation and leading example
Missing variable models
−2 −1 0 1 2
0.0
0.1
0.2
0.3
0.4
0.5
0.6
Sample from the mixture model
Markov Chain Monte Carlo Methods
Motivation and leading example
Missing variable models
−1 0 1 2 3
−1
01
23
µ1
µ 2
Likelihood of .7N (µ1, 1) + .3N (µ2, 1) and EM steps
Markov Chain Monte Carlo Methods
Motivation and leading example
Bayesian Methods
The Bayesian Perspective
In the Bayesian paradigm, the information brought by the data x,realization of
X ∼ f(x|θ),
is combined with prior information specified by prior distributionwith density
π(θ)
Markov Chain Monte Carlo Methods
Motivation and leading example
Bayesian Methods
Central tool
Summary in a probability distribution, π(θ|x), called the posteriordistributionDerived from the joint distribution f(x|θ)π(θ), according to
π(θ|x) =f(x|θ)π(θ)∫f(x|θ)π(θ)dθ
,
[Bayes Theorem]where
Z(x) =∫f(x|θ)π(θ)dθ
is the marginal density of X also called the (Bayesian) evidence
Markov Chain Monte Carlo Methods
Motivation and leading example
Bayesian Methods
Central tool... central to Bayesian inference
Posterior defined up to a constant as
π(θ|x) ∝ f(x|θ)π(θ)
I Operates conditional upon the observations
I Integrate simultaneously prior information and informationbrought by x
I Avoids averaging over the unobserved values of x
I Coherent updating of the information available on θ,independent of the order in which i.i.d. observations arecollected
I Provides a complete inferential scope and a unique motor ofinference
Markov Chain Monte Carlo Methods
Motivation and leading example
Bayesian troubles
Conjugate bonanza...
Example (Binomial)
For an observation X ∼ B(n, p) so-called conjugate prior is thefamily of beta Be(a, b) distributionsThe classical Bayes estimator δπ is the posterior mean
Γ(a+ b+ n)Γ(a+ x)Γ(n− x+ b)
∫ 1
0p px+a−1(1− p)n−x+b−1dp
=x+ a
a+ b+ n.
Markov Chain Monte Carlo Methods
Motivation and leading example
Bayesian troubles
Conjugate Prior
Conjugacy
Given a likelihood function L(y|θ), the family Π of priors π0 on Θis conjugate if the posterior π(θ|y) also belong to Π
In this case, posterior inference is tractable and reduces toupdating the hyperparameters∗ of the prior
∗The hyperparameters are parameters of the priors; they are most often nottreated as a random variables
Markov Chain Monte Carlo Methods
Motivation and leading example
Bayesian troubles
Discrete/Multinomial & Dirichlet
If the observations consist of positive counts Y1, . . . , Yd modelledby a Multinomial distribution
L(y|θ, n) =n!∏di=1 yi!
d∏i=1
θyii
The conjugate family is the D(α1, . . . , αd) distribution
π(θ|α) =Γ(∑d
i=1 αi)∏di=1 Γ(αi)
d∏i
θαi−1i
defined on the probability simplex (θi ≥ 0,∑d
i=1 θi = 1), where Γis the gamma function Γ(α) =
∫∞0 tα−1e−tdt (Γ(k) = (k − 1)! for
integers k)
Markov Chain Monte Carlo Methods
Motivation and leading example
Bayesian troubles
Figure: Dirichlet: 1D marginals
Markov Chain Monte Carlo Methods
Motivation and leading example
Bayesian troubles
Figure: Dirichlet: 3D examples (projected on two dimensions)
Markov Chain Monte Carlo Methods
Motivation and leading example
Bayesian troubles
Multinomial Posterior
Posteriorπ(θ|y) = D(y1 + α1, . . . , yd + αd)
Posterior Mean† (yi + αi∑dj=1 yj + αj
)1≤i≤d
MAP (yi + αi − 1∑dj=1 yj + αj − 1
)1≤i≤d
if yi + αi > 1 for i = 1, . . . , dEvidence
Z(y) =Γ(∑d
i=1 αi)∏di=1 Γ(yi + αi)∏d
i=1 Γ(αi)Γ(∑d
i=1 yi + αi)†Also known as Laplace smoothing when αi = 1
Markov Chain Monte Carlo Methods
Motivation and leading example
Bayesian troubles
Conjugate Priors for the Normal I
Conjugate Prior for the Normal Mean
For the N (y|µ,w) distribution with iid observations y1, . . . ,yn,the conjugate prior for the mean µ is Gaussian N (µ|m0, v0):
π(µ|y1:n) ∝ exp[−(µ−m0)2/2v0
] n∏k=1
exp[−(yk − µ)2/2w
]∝ exp
{−1
2
[µ2
(1v0
+n
w
)− 2µ
(m0
v0+snw
)]}= N
(µ
∣∣∣∣sn +m0w/v0
n+ w/v0,
w
n+ w/v0
)where sn =
∑nk=1 yk
a
aAnd y1:n denotes the collection y1, . . . , yn
Markov Chain Monte Carlo Methods
Motivation and leading example
Bayesian troubles
Conjugate Priors for the Normal II
Conjugate Priors for the Normal Variance
If w is to be estimated and µ is known, the conjugate prior for w isthe inverse Gamma distribution I G (w|α0, β0):
π0(w|β0, α0) =βα0
0
Γ(α0)w−α0+1e−β0/w
and
π(w|y1:n) ∝ w−(α0+1)e−β0/wn∏k=1
1√w
exp[−(yk − µ)2/2w
]= w−(n/2+α0+1) exp
[−(s(2)
n /2 + β0)/w]
where s(2)n =
∑nk=1(Yk − µ)2.
Markov Chain Monte Carlo Methods
Motivation and leading example
Bayesian troubles
The Gamma, Chi-Square and Inverses
The Gamma Distributiona
aA different convention is to use Gam*(a,b), where b = 1/β is the scaleparameter
G a(θ|α, β) =βα
Γ(α)θα−1e−βθ
where α is the shape and β the inverse scale parameter(E(θ) = α/β, Var(θ) = α/β2)
I θ ∼ I G (θ|α, β): 1/θ ∼ G a(θ|α, β)I θ ∼ χ2(θ|ν): θ ∼ G a(θ|ν/2, 1/2)
Markov Chain Monte Carlo Methods
Motivation and leading example
Bayesian troubles
Figure: Gamma pdf (k = α, θ = 1/β)
Markov Chain Monte Carlo Methods
Motivation and leading example
Bayesian troubles
Conjugate Priors for the Normal IV
Example (Normal)
In the normal N (µ,w) case, with both µ and w unknown,conjugate prior on θ = (µ,w) of the form
(w)−λw exp−{λµ(µ− ξ)2 + α
}/w
since
π((µ,w)|x1, . . . , xn) ∝ (w)−λw exp−{λµ(µ− ξ)2 + α
}/w
×(w)−n exp−{n(µ− x)2 + s2
x
}/w
∝ (w)−λw+n exp−{
(λµ + n)(µ− ξx)2
+α+ s2x +
nλµn+ λµ
}/w
Markov Chain Monte Carlo Methods
Motivation and leading example
Bayesian troubles
Conjugate Priors for the Normal III
Conjugate Priors are However Available Only in Simple Cases
In the previous example the conjugate prior when both µ and w areunknown is not particularly useful.
I Hence, it is very common to resort to independent marginallyconjugate priors: eg., in the Gaussian case, takeN (µ|m0, v0)I G (w|α0, β0) as prior, then π(µ|w, y) isGaussian, π(w|µ, y) is inverse-gamma but π(µ,w|y) does notbelong to a known family‡
I There nonetheless exists some important multivariateextensions : Bayesian normal linear model, inverse-Wishartdistribution for covariance matrices
‡Although closed-form expressions for π(µ|y)and π(w|y) are available
Markov Chain Monte Carlo Methods
Motivation and leading example
Bayesian troubles
...and conjugate curse
Conjugate priors are very limited in scope
In addition, the use ofconjugate priors only for computationalreasons
• implies a restriction on the modeling of the available priorinformation
• may be detrimental to the usefulness of the Bayesian approach
• gives an impression of subjective manipulation of the priorinformation disconnected from reality.
Markov Chain Monte Carlo Methods
Motivation and leading example
Bayesian troubles
A typology of Bayes computational problems
(i). latent variable models in general
(ii). use of a complex parameter space, as for instance inconstrained parameter sets like those resulting from imposingstationarity constraints in dynamic models;
(iii). use of a complex sampling model with an intractablelikelihood, as for instance in some graphical models;
(iv). use of a huge dataset;
(v). use of a complex prior distribution (which may be theposterior distribution associated with an earlier sample);
Markov Chain Monte Carlo Methods
Random variable generation
Random variable generation
Motivation and leading example
Random variable generationBasic methodsUniform pseudo-random generatorBeyond Uniform distributionsTransformation methodsAccept-Reject MethodsFundamental theorem of simulationLog-concave densities
Monte Carlo Integration
Notions on Markov Chains for MCMC
The Metropolis-Hastings Algorithm
The Gibbs Sampler
Case Study
Markov Chain Monte Carlo Methods
Random variable generation
Random variable generation
• Rely on the possibility of producing (computer-wise) anendless flow of random variables (usually iid) from well-knowndistributions
• Given a uniform random number generator, illustration ofmethods that produce random variables from both standardand nonstandard distributions
Markov Chain Monte Carlo Methods
Random variable generation
Basic methods
The inverse transform method
For a function F on R, the generalized inverse of F , F−, is definedby
F−(u) = inf {x; F (x) ≥ u} .
Definition (Probability Integral Transform)
If U ∼ U[0,1], then the random variable F−(U) has the distributionF .
Markov Chain Monte Carlo Methods
Random variable generation
Basic methods
The inverse transform method (2)
To generate a random variable X ∼ F , simply generate
U ∼ U[0,1]
and then make the transform
x = F−(u)
Markov Chain Monte Carlo Methods
Random variable generation
Uniform pseudo-random generator
Desiderata and limitationsskip Uniform
• Production of a deterministic sequence of values in [0, 1] whichimitates a sequence of iid uniform random variables U[0,1].
• Can’t use the physical imitation of a “random draw” [noguarantee of uniformity, no reproducibility]
• Random sequence in the sense: Having generated(X1, · · · , Xn), knowledge of Xn [or of (X1, · · · , Xn)] impartsno discernible knowledge of the value of Xn+1.
• Deterministic: Given the initial value X0, sample(X1, · · · , Xn) always the same
• Validity of a random number generator based on a singlesample X1, · · · , Xn when n tends to +∞, not on replications
(X11, · · · , X1n), (X21, · · · , X2n), . . . (Xk1, · · · , Xkn)
where n fixed and k tends to infinity.
Markov Chain Monte Carlo Methods
Random variable generation
Uniform pseudo-random generator
Uniform pseudo-random generator
Algorithm starting from an initial value 0 ≤ u0 ≤ 1 and atransformation D, which produces a sequence
(ui) = (Di(u0))
in [0, 1].For all n,
(u1, · · · , un)
reproduces the behavior of an iid U[0,1] sample (V1, · · · , Vn) whencompared through usual tests
Markov Chain Monte Carlo Methods
Random variable generation
Uniform pseudo-random generator
Uniform pseudo-random generator (2)
• Validity means the sequence U1, · · · , Un leads to accept thehypothesis
H : U1, · · · , Un are iid U[0,1].
• The set of tests used is generally of some consequence
◦ Kolmogorov–Smirnov and other nonparametric tests◦ Time series methods, for correlation between Ui and
(Ui−1, · · · , Ui−k)◦ Marsaglia’s battery of tests called Die Hard (!)
Markov Chain Monte Carlo Methods
Random variable generation
Uniform pseudo-random generator
Usual generatorsIn R and S-plus, procedure runif()
The Uniform Distribution
Description:‘runif’ generates random deviates.
Example:u <- runif(20)
‘.Random.seed’ is an integer vector, containingthe random number generator state for randomnumber generation in R. It can be saved andrestored, but should not be altered by users.
Markov Chain Monte Carlo Methods
Random variable generation
Uniform pseudo-random generator
500 520 540 560 580 600
0.0
0.2
0.4
0.6
0.8
1.0
uniform sample
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.5
1.0
1.5
Markov Chain Monte Carlo Methods
Random variable generation
Uniform pseudo-random generator
Usual generators (2)In C, procedure rand() or random()
SYNOPSIS#include <stdlib.h>long int random(void);
DESCRIPTIONThe random() function uses a non-linear additivefeedback random number generator employing adefault table of size 31 long integers to returnsuccessive pseudo-random numbers in the rangefrom 0 to RAND_MAX. The period of this randomgenerator is very large, approximately16*((2**31)-1).RETURN VALUErandom() returns a value between 0 and RAND_MAX.
Markov Chain Monte Carlo Methods
Random variable generation
Uniform pseudo-random generator
Usual generators (3)
In Matlab and Octave, procedure rand()
RAND Uniformly distributed pseudorandom numbers.R = RAND(M,N) returns an M-by-N matrix containingpseudorandom values drawn from the standard uniformdistribution on the open interval(0,1).
The sequence of numbers produced by RAND isdetermined by the internal state of the uniformpseudorandom number generator that underlies RAND,RANDI, and RANDN.
Markov Chain Monte Carlo Methods
Random variable generation
Uniform pseudo-random generator
Usual generators (4)
In Scilab, procedure rand()
rand() : with no arguments gives a scalar whosevalue changes each time it is referenced. Bydefault, random numbers are uniformly distributedin the interval (0,1). rand(’normal’) switches toa normal distribution with mean 0 and variance 1.
EXAMPLEx=rand(10,10,’uniform’)
Markov Chain Monte Carlo Methods
Random variable generation
Beyond Uniform distributions
Beyond Uniform generators
• Generation of any sequence of random variables can beformally implemented through a uniform generator
◦ Distributions with explicit F− (for instance, exponential, andWeibull distributions), use the probability integral transform
here
◦ Case specific methods rely on properties of the distribution (forinstance, normal distribution, Poisson distribution)
◦ More generic methods (for instance, accept-reject)
• Simulation of the standard distributions is accomplished quiteefficiently by many numerical and statistical programmingpackages.
Markov Chain Monte Carlo Methods
Random variable generation
Transformation methods
Transformation methods
Case where a distribution F is linked in a simple way to anotherdistribution easy to simulate.
Example (Exponential variables)
If U ∼ U[0,1], the random variable
X = − logU/λ
has distribution
P(X ≤ x) = P(− logU ≤ λx)= P(U ≥ e−λx) = 1− e−λx,
the exponential distribution E xp(λ).
Markov Chain Monte Carlo Methods
Random variable generation
Transformation methods
Other random variables that can be generated starting from anexponential include
Y = −2ν∑j=1
log(Uj) ∼ χ22ν
Y = − 1β
a∑j=1
log(Uj) ∼ G a(a, β)
Y =
∑aj=1 log(Uj)∑a+bj=1 log(Uj)
∼ Be(a, b)
Markov Chain Monte Carlo Methods
Random variable generation
Transformation methods
Points to note
◦ Transformation quite simple to use
◦ There are more efficient algorithms for gamma and betarandom variables
◦ Cannot generate gamma random variables with a non-integershape parameter
◦ For instance, cannot get a χ21 variable, which would get us a
N (0, 1) variable.
Markov Chain Monte Carlo Methods
Random variable generation
Transformation methods
Box-Muller Algorithm
Example (Normal variables)
If r, θ polar coordinates of (X1, X2), then,
r2 = X21 +X2
2 ∼ χ22 = E (1/2) and θ ∼ U [0, 2π]
Consequence: If U1, U2 iid U[0,1],
X1 =√−2 log(U1) cos(2πU2)
X2 =√−2 log(U1) sin(2πU2)
iid N (0, 1).
Markov Chain Monte Carlo Methods
Random variable generation
Transformation methods
Box-Muller Algorithm (2)
1. Generate U1, U2 iid U[0,1] ;
2. Define
x1 =√−2 log(u1) cos(2πu2) ,
x2 =√−2 log(u1) sin(2πu2) ;
3. Take x1 and x2 as two independent draws from N (0, 1).
Markov Chain Monte Carlo Methods
Random variable generation
Transformation methods
Box-Muller Algorithm (3)
I Unlike algorithms based on the CLT,this algorithm is exact
I Get two normals for the price oftwo uniforms
I Drawback (in speed)in calculating log, cos and sin.
−4 −2 0 2 4
0.0
0.1
0.2
0.3
0.4
Markov Chain Monte Carlo Methods
Random variable generation
Transformation methods
More transforms
Reject
Example (Poisson generation)
Poisson–exponential connection:If N ∼ P(λ) and Xi ∼ E xp(λ), i ∈ N∗,
Pλ(N = k) =Pλ(X1 + · · ·+Xk ≤ 1 < X1 + · · ·+Xk+1) .
Markov Chain Monte Carlo Methods
Random variable generation
Transformation methods
More Poisson
Skip Poisson
• A Poisson can be simulated by generating E xp(1) till theirsum exceeds 1.
• This method is simple, but is really practical only for smallervalues of λ.
• On average, the number of exponential variables required is λ.
• Other approaches are more suitable for large λ’s.
Markov Chain Monte Carlo Methods
Random variable generation
Transformation methods
Negative extension
I A generator of Poisson random variables can produce negativebinomial random variables since,
Y ∼ Ga(n, (1− p)/p) X|y ∼ P(y)
impliesX ∼ N eg(n, p)
Markov Chain Monte Carlo Methods
Random variable generation
Transformation methods
Mixture representation
• The representation of the negative binomial is a particularcase of a mixture distribution
• The principle of a mixture representation is to represent adensity f as the marginal of another distribution, for example
f(x) =∑i∈Y
pi fi(x) ,
• If the component distributions fi(x) can be easily generated,X can be obtained by first choosing fi with probability pi andthen generating an observation from fi.
Markov Chain Monte Carlo Methods
Random variable generation
Transformation methods
Partitioned sampling
Special case of mixture sampling when
fi(x) = f(x) IAi(x)/∫
Ai
f(x) dx
andpi = P(X ∈ Ai)
for a partition (Ai)i
Markov Chain Monte Carlo Methods
Random variable generation
Accept-Reject Methods
Accept-Reject algorithm
• Many distributions from which it is difficult, or evenimpossible, to directly simulate.
• Another class of methods that only require us to know thefunctional form of the density f of interest only up to amultiplicative constant.
• The key to this method is to use a simpler (simulation-wise)density g, the instrumental density , from which the simulationfrom the target density f is actually done.
Markov Chain Monte Carlo Methods
Random variable generation
Fundamental theorem of simulation
Fundamental theorem of simulation
Lemma
Simulating
X ∼ f(x)
equivalent to simulating
(X,U) ∼ U{(x, u) : 0 < u < f(x)} 0 2 4 6 8 10
0.00
0.05
0.10
0.15
0.20
0.25
x
f(x)
Markov Chain Monte Carlo Methods
Random variable generation
Fundamental theorem of simulation
The Accept-Reject algorithm
Given a density of interest f , find a density g and a constant Msuch that
f(x) ≤Mg(x)
on the support of f .
Accept-Reject Algorithm
1. Generate X ∼ g, U ∼ U[0,1] ;
2. Accept Y = X if U ≤ f(X)/Mg(X) ;
3. Return to 1. otherwise.
Markov Chain Monte Carlo Methods
Random variable generation
Fundamental theorem of simulation
Validation of the Accept-Reject method
Warranty:
This algorithm produces a variable Y distributed according to f
Markov Chain Monte Carlo Methods
Random variable generation
Fundamental theorem of simulation
Two interesting properties
◦ First, it provides a generic method to simulate from anydensity f that is known up to a multiplicative factorProperty particularly important in Bayesian calculations wherethe posterior distribution
π(θ|x) ∝ π(θ) f(x|θ) .
is specified up to a normalizing constant
◦ Second, the probability of acceptance in the algorithm is1/M , e.g., expected number of trials until a variable isaccepted is M
Markov Chain Monte Carlo Methods
Random variable generation
Fundamental theorem of simulation
More interesting properties
◦ In cases f and g both probability densities, the constant M isnecessarily larger that 1.
◦ The size of M , and thus the efficiency of the algorithm, arefunctions of how closely g can imitate f , especially in the tails
◦ For f/g to remain bounded, necessary for g to have tailsthicker than those of f .It is therefore impossible to use the A-R algorithm to simulatea Cauchy distribution f using a normal distribution g, howeverthe reverse works quite well.
Markov Chain Monte Carlo Methods
Random variable generation
Fundamental theorem of simulation
No Cauchy!
Example (Normal from a Cauchy)
Take
f(x) =1√2π
exp(−x2/2)
and
g(x) =1π
11 + x2
,
densities of the normal and Cauchy distributions.Then
f(x)g(x)
=√π
2(1 + x2) e−x
2/2 ≤√
2πe
= 1.52
attained at x = ±1.
Markov Chain Monte Carlo Methods
Random variable generation
Fundamental theorem of simulation
Example (Normal from a Cauchy (2))
So probability of acceptance
1/1.52 = 0.66,
and, on the average, one out of every three simulated Cauchyvariables is rejected.
Markov Chain Monte Carlo Methods
Random variable generation
Fundamental theorem of simulation
No Double!
Example (Normal/Double Exponential)
Generate a N (0, 1) by using a double-exponential distributionwith density
g(x|α) = (α/2) exp(−α|x|)
Thenf(x)g(x|α)
≤√
2πα−1e−α
2/2
and minimum of this bound (in α) attained for
α? = 1
Markov Chain Monte Carlo Methods
Random variable generation
Fundamental theorem of simulation
Example (Normal/Double Exponential (2))
Probability of acceptance √π/2e = .76
To produce one normal random variable requires on the average1/.76 ≈ 1.3 uniform variables.
Markov Chain Monte Carlo Methods
Random variable generation
Fundamental theorem of simulation
truncate
Example (Gamma generation)
Illustrates a real advantage of the Accept-Reject algorithmThe gamma distribution Ga(α, β) represented as the sum of αexponential random variables, only if α is an integer
Markov Chain Monte Carlo Methods
Random variable generation
Fundamental theorem of simulation
Example (Gamma generation (2))
Can use the Accept-Reject algorithm with instrumental distribution
Ga(a, b), with a = [α], α ≥ 0.
(Without loss of generality, β = 1.)Up to a normalizing constant,
f/gb = b−axα−a exp{−(1− b)x} ≤ b−a(
α− a(1− b)e
)α−afor b ≤ 1.The maximum is attained at b = a/α.
Markov Chain Monte Carlo Methods
Random variable generation
Fundamental theorem of simulation
Truncated Normal simulation
Example (Truncated Normal distributions)
Constraint x ≥ µ produces density proportional to
e−(x−µ)2/2σ2Ix≥µ
for a bound µ large compared with µThere exists alternatives far superior to the naıve method ofgenerating a N (µ, σ2) until exceeding µ, which requires an averagenumber of
1/Φ((µ− µ)/σ)
simulations from N (µ, σ2) for a single acceptance.
Markov Chain Monte Carlo Methods
Random variable generation
Fundamental theorem of simulation
Example (Truncated Normal distributions (2))
Instrumental distribution: translated exponential distribution,E (α, µ), with density
gα(z) = αe−α(z−µ) Iz≥µ .
The ratio f/gα is bounded by
f/gα ≤
{1/α exp(α2/2− αµ) if α > µ ,
1/α exp(−µ2/2) otherwise.
Markov Chain Monte Carlo Methods
Random variable generation
Log-concave densities
Log-concave densities (1)
move to next chapter Densities f whose logarithm is concave, forinstance Bayesian posterior distributions such that
log π(θ|x) = log π(θ) + log f(x|θ) + c
concave
Markov Chain Monte Carlo Methods
Random variable generation
Log-concave densities
Log-concave densities (2)
TakeSn = {xi, i = 0, 1, . . . , n+ 1} ⊂ supp(f)
such that h(xi) = log f(xi) known up to the same constant.
By concavity of h, line Li,i+1 through(xi, h(xi)) and (xi+1, h(xi+1))
I below h in [xi, xi+1] and
I above this graph outside this interval x 1 x 2 x 3 x 4
x
L (x)2,3
log f(x)
Markov Chain Monte Carlo Methods
Random variable generation
Log-concave densities
Log-concave densities (3)
For x ∈ [xi, xi+1], if
hn(x) = min{Li−1,i(x), Li+1,i+2(x)} and hn(x) = Li,i+1(x) ,
the envelopes arehn(x) ≤ h(x) ≤ hn(x)
uniformly on the support of f , with
hn(x) = −∞ and hn(x) = min(L0,1(x), Ln,n+1(x))
on [x0, xn+1]c.
Markov Chain Monte Carlo Methods
Random variable generation
Log-concave densities
Log-concave densities (4)
Therefore, if
fn(x) = exphn(x) and fn(x) = exphn(x)
thenfn(x) ≤ f(x) ≤ fn(x) = $n gn(x) ,
where $n normalizing constant of fn
Markov Chain Monte Carlo Methods
Random variable generation
Log-concave densities
ARS Algorithm
1. Initialize n and Sn.
2. Generate X ∼ gn(x), U ∼ U[0,1].
3. If U ≤ fn(X)/$n gn(X), accept X;
otherwise, if U ≤ f(X)/$n gn(X), accept X
Markov Chain Monte Carlo Methods
Random variable generation
Log-concave densities
kill ducks
Example (Northern Pintail ducks)
Ducks captured at time i with both probability pi andsize N of the population unknown.Dataset
(n1, . . . , n11) = (32, 20, 8, 5, 1, 2, 0, 2, 1, 1, 0)
Number of recoveries over the years 1957–1968 of 1612Northern Pintail ducks banded in 1956
Markov Chain Monte Carlo Methods
Random variable generation
Log-concave densities
Example (Northern Pintail ducks (2))
Corresponding conditional likelihood
L(n1, . . . , nI |N, p1, . . . , pI) ∝I∏i=1
pnii (1− pi)N−ni ,
where I number of captures, ni number of captured animalsduring the ith capture, and r is the total number of differentcaptured animals.
Markov Chain Monte Carlo Methods
Random variable generation
Log-concave densities
Example (Northern Pintail ducks (3))
Prior selectionIf
N ∼P(λ)
and
αi = log(
pi1− pi
)∼ N (µi, σ2),
[Normal logistic]
Markov Chain Monte Carlo Methods
Random variable generation
Log-concave densities
Example (Northern Pintail ducks (4))
Posterior distribution
π(α,N |, n1, . . . , nI) ∝ N !(N − r)!
λN
N !
I∏i=1
(1 + eαi)−N
I∏i=1
exp{αini −
12σ2
(αi − µi)2
}
Markov Chain Monte Carlo Methods
Random variable generation
Log-concave densities
Example (Northern Pintail ducks (5))
For the conditional posterior distribution
π(αi|N,n1, . . . , nI) ∝ exp{αini −
12σ2
(αi − µi)2
}/(1+eαi)N ,
the ARS algorithm can be implemented since
αini −1
2σ2(αi − µi)2 −N log(1 + eαi)
is concave in αi.
Markov Chain Monte Carlo Methods
Random variable generation
Log-concave densities
Posterior distributions of capture log-odds ratios for theyears 1957–1965.
−10 −9 −8 −7 −6 −5 −4 −3
0.0
0.2
0.4
0.6
0.8
1.0
1957
−10 −9 −8 −7 −6 −5 −4 −3
0.0
0.2
0.4
0.6
0.8
1.0
1958
−10 −9 −8 −7 −6 −5 −4 −3
0.0
0.2
0.4
0.6
0.8
1.0
1959
−10 −9 −8 −7 −6 −5 −4 −3
0.0
0.2
0.4
0.6
0.8
1.0
1960
−10 −9 −8 −7 −6 −5 −4 −3
0.0
0.2
0.4
0.6
0.8
1.0
1961
−10 −9 −8 −7 −6 −5 −4 −3
0.0
0.2
0.4
0.6
0.8
1.0
1962
−10 −9 −8 −7 −6 −5 −4 −3
0.0
0.2
0.4
0.6
0.8
1.0
1963
−10 −9 −8 −7 −6 −5 −4 −3
0.0
0.2
0.4
0.6
0.8
1.0
1964
−10 −9 −8 −7 −6 −5 −4 −3
0.0
0.2
0.4
0.6
0.8
1.0
1965
Markov Chain Monte Carlo Methods
Random variable generation
Log-concave densities
1960
−8 −7 −6 −5 −4
0.0
0.2
0.4
0.6
0.8
True distribution versus histogram of simulated sample
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Monte Carlo integration
Motivation and leading example
Random variable generation
Monte Carlo IntegrationIntroductionMonte Carlo IntegrationImportance SamplingSelf-Normalised Importance SamplingAcceleration MethodsApplication to Bayesian Model Choice
Notions on Markov Chains for MCMC
The Metropolis-Hastings Algorithm
The Gibbs Sampler
Case Study
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Introduction
Quick reminder
Two major classes of numerical problems that arise in statisticalinference
◦ Optimization - generally associated with the likelihoodapproach
◦ Integration- generally associated with the Bayesian approach
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Introduction
skip Example!
Example (Bayesian decision theory)
Bayes estimators are not always posterior expectations, but rathersolutions of the minimization problem
minδ
∫Θ
L(θ, δ) π(θ) f(x|θ) dθ .
Proper loss:For L(θ, δ) = (θ − δ)2, the Bayes estimator is the posterior meanAbsolute error loss:For L(θ, δ) = |θ − δ|, the Bayes estimator is the posterior medianWithout loss functionUse the maximum a posteriori (MAP) estimator
arg maxθf(x|, θ)π(θ)
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Monte Carlo Integration
Monte Carlo integration
Theme:Generic problem of evaluating the integral
I = Ef [h(X)] =∫
Xh(x) f(x) dx
where X is uni- or multidimensional, f is a closed form, partlyclosed form, or implicit density, and h is a function
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Monte Carlo Integration
Monte Carlo integration (2)
Monte Carlo solutionFirst use a sample (X1, . . . , Xm) from the density f toapproximate the integral I by the empirical average
hm =1m
m∑j=1
h(Xj)
which convergeshm −→ Ef [h(X)]
by the Strong Law of Large Numbers
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Monte Carlo Integration
Monte Carlo precision
Estimate the variance with
vm =1
m− 1
m∑j=1
[h(Xj)− hm]2,
and for m large,
√mhm − Ef [h(X)]
√vm
∼ N (0, 1),
(assuming Ef [h2(X)] <∞).
Note: This can lead to the construction of a convergence test andof confidence bounds on the approximation of Ef [h(X)].
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Monte Carlo Integration
Example (Cauchy prior/normal sample)
For estimating a normal mean, a robust prior is a Cauchy prior
X ∼ N (θ, 1), θ ∼ C(0, 1).
Under squared error loss, posterior mean
δπ(x) =
∫ ∞−∞
θ
1 + θ2e−(x−θ)2/2dθ∫ ∞
−∞
11 + θ2
e−(x−θ)2/2dθ
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Monte Carlo Integration
Example (Cauchy prior/normal sample (2))
Form of δπ suggests simulating iid variables
θ1, · · · , θm ∼ N (x, 1)
and calculating
δπm(x) =m∑i=1
θi1 + θ2
i
/ m∑i=1
11 + θ2
i
.
The Law of Large Numbers implies
δπm(x) −→ δπ(x) as m −→∞.
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Monte Carlo Integration
0 200 400 600 800 1000
9.6
9.8
10.0
10.2
10.4
10.6
iterations
Range of estimators δπm for 100 runs and x = 10
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Importance Sampling
Importance sampling
Paradox
Simulation from f (the true density) is not necessarily optimal
Alternative to direct sampling from f is importance sampling,based on the alternative representation
Ef [h(X)] =∫
X
[h(x)
f(x)g(x)
]g(x) dx .
which allows us to use other distributions than f
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Importance Sampling
Importance sampling (2)
Importance sampling algorithm
To evaluation
Ef [h(X)] =∫
Xh(x) f(x) dx ,
do
1. Generate a sample X1, . . . , Xm from a distribution g
2. Use the weighted approximation
1m
m∑j=1
f(Xj)g(Xj)
h(Xj)
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Importance Sampling
Consitency of importance sampling
Convergence of the estimator
1m
m∑j=1
f(Xj)g(Xj)
h(Xj) −→∫
Xh(x) f(x) dx
converges almost surely for any choice of the distribution g[as long as supp(g) ⊃ supp(f)]
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Importance Sampling
Important details
◦ Instrumental distribution g chosen from distributions easy tosimulate
◦ The same sample (generated from g) can be used repeatedly,not only for different functions h, but also for differentdensities f
◦ Even dependent proposals can be used, as seen laterPMC chapter
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Importance Sampling
Although g can be any density, some choices are better thanothers:
◦ Finite variance only when
Ef
[h2(X)
f(X)g(X)
]=∫
Xh2(x)
f2(X)g(X)
dx <∞ .
◦ Instrumental distributions with tails lighter than those of f(that is, with sup f/g =∞) not appropriate.
◦ If sup f/g =∞, the weights f(xj)/g(xj) vary widely, givingtoo much importance to a few values xj .
◦ If sup f/g = M <∞, the accept-reject algorithm can be usedas well to simulate f directly.
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Importance Sampling
Example (Counter-Example: Cauchy target)
Case of Cauchy distribution C(0, 1) when importance function isGaussian N (0, 1).Ratio of the densities
ω(x) =p?(x)p0(x)
=√
2πexpx2/2π (1 + x2)
very badly behaved: e.g.,∫ ∞−∞
ω(x)2p0(x)dx =∞ .
Poor performances of the associated importance samplingestimator
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Importance Sampling
0 2000 4000 6000 8000 10000
050
100150
200
iterations
Range and average of 500 replications of IS estimate ofE[exp(−X)] over 10, 000 iterations.
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Importance Sampling
Optimal importance function
The choice of g that minimizes§ the variance of theimportance sampling estimator is
g∗(x) =|h(x)| f(x)∫
X |h(x′)| f(x′) dx′.
Rather formal optimality result since optimal choice of g∗(x)requires the knowledge of I, the integral of interest!
§This choice ensures that, for a positive function h, hm = I, even for asingle draw!
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Self-Normalised Importance Sampling
Self-Normalized Importance Sampling (Hammersley &Handscomb, 1964)
In self-normalized (or Bayesian) importance sampling, expectationsunder the target pdf f
Ef [h(X)]
are estimated as
h(g)m =
m∑i=1
ωi∑mj=1 ωj︸ ︷︷ ︸Wi
h(Xi) =1m
∑mi=1 ωih(Xi)
1m
∑mj=1 ωj
,
whereI Xi ∼ iid g, where g is an instrumental pdfI ωi = f
g (Xi).
This form of importance sampling does not necessitate that f beproperly normalized.
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Self-Normalised Importance Sampling
Performance of self-normalized IS
Assuming that Ef [fg (X)(1 + h2(X))] <∞, h(g)m is consistent and
asymptotically normal, with asymptotic variance given by
υ(g) = Ef
[f
g(X) (h(X)− Ef (h))2
].
The asymptotic variance can be estimated from the IS sample by
υ(g)m = m
m∑i=1
(Wi)2{h(Xi)− h(g)
m }2 ,
where Wi = ωi/∑m
j=1 ωj are the normalized weights.
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Self-Normalised Importance Sampling
Elements of proofConsistency If f = f/c, where c =
∫f(x)dx, is a pdf,
Eg[ωih(Xi)] = Eg
[f
g(X)h(X)
]= cEf [h(X)] .
CLT
√m(h(g)
m − Ef (h)) =1√m
∑mi=1 ωi(h(Xi)− Ef (h))
1m
∑mj=1 ωj
and varg[ωi(h(Xi)− Ef (h))] = c2 Ef [ fg (X)(h(X)− Ef (h))2].
Empirical Estimate
υmg (h) =m∑i=1
Wi
fg (Xi)
1m
∑mj=1
fg (Xj)
{h(Xi)− h(g)m }2
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Self-Normalised Importance Sampling
Empirical Diagnostic Tool for IS
Effective Sample Size
ESS(g)m =
[m∑i=1
(Wi)2
]−1
1. 1 ≤ ESS(g)m ≤ m
2. m/ESS(g)m is an estimate of Ef [fg (X)] which is the maximal
IS asymptotic variance for functions h such that|h(x)− Ef (h)| ≤ 1 (note: these functions have maximalMonte Carlo variance of 1 under f).
3. m/ESS(g)m −1 is an estimator of the χ2 divergence∫
(f(x)− g(x))2
g(x)dx
(
√m/ESS(g)
m is the empirical coefficient of variation).
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Self-Normalised Importance Sampling
Optimal IS Instrumental DensityWhen considering large classes of functions h, IS generally appearsto perform worse than Monte Carlo under f (e.g., for functionssuch that |h(x)− Ef (h)| ≤ 1, the maximal Monte Carloasymptotic variance is 1 whereas, the maximal IS asymptoticvariance is Ef [fg (X)] which is strictly larger than 1 by Jensen’sinequality, unless g = f).
However, for a fixed function h, the optimal IS density is not f but
|h(x)− Ef (h)|f(x)∫|h(x′)− Ef (h)|f(x′)dx′
as Cauchy-Schwarz inequality implies that
Ef
[f
g(X) (h(X)− Ef (h))2
]Ef
[g
f(X)
]︸ ︷︷ ︸
=1
≥ E2f [|h(X)− Ef (h)|] .
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Self-Normalised Importance Sampling
Example (Bayesian Posterior)
Consider a simple model in which
θ ∼ N (1, 22) ,
X|θ ∼ N (θ2, σ2) .
And apply IS to compute expectations under the posterior forX = 2, using the prior as instrumental pdf.
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Self-Normalised Importance Sampling
−6 −4 −2 0 2 4 6 80
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
InstrumentalTarget
σ = 2lim ESS(g)
m /m 0.68
υ(g)[h(θ) = θ] 2.0
υ(f) 1.4
−6 −4 −2 0 2 4 6 80
0.5
1
1.5
InstrumentalTarget
σ = 0.5lim ESS(g)
m /m 0.19
υ(g)[h(θ) = θ] 9.0
υ(f) 1.7
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Self-Normalised Importance Sampling
IS suffers from curse of dimensionalityAs dimension increases, discrepancy between importance andtarget worsens
To illustrate this issue, consider the simple model where both thetarget and instrumental are d-dimensional product distributions:
fn(x1:d) =d∏i=1
f(xi)
gn(x1:d) =d∏i=1
g(xi)
Then, the unormalized importance weights
ω =d∏i=1
f
g(Xi)
are a product of iid terms of expectation 1.
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Self-Normalised Importance Sampling
IS suffers from curse of dimensionality (Contd.)
For a function of interest h that only depends on the firstcoordinate x1, the asymptotic variance of the Self-normalized ISapproximation to Ef (h) is given by
υn(h) =∫· · ·∫ ( d∏
i=1
f
g(xi)
)2
(h(x1)− Ef (h))2d∏i=1
g(xi) dx1 . . . dxn
=(∫ f
g(x)f(x)dx︸ ︷︷ ︸>1
)d−1∫f
g(x1) (h(x1)− Ef (h))2 g(x1)dx1 .
In practise, this situation can usually be detected by monitoring theeffective sample size criterion, which become abnormally small.
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Self-Normalised Importance Sampling
Importance Sampling Scaling Example
2 5 10 20
Figure: Importance sampling (m = 1000 samples) for a multivariateNormal distribution truncated at ±4σ with uniform proposals (2Dmarginal for increasing values of d; radius of point increases with theimportance weight; circle shows 99% region for the 2D marginal).
Here Ef[f(X)g(X)
]= 2.26
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Self-Normalised Importance Sampling
too volatile!
Example (Stochastic volatility model)
yk = β exp (xk/2) εk , εk ∼ N (0, 1)
with AR(1) log-variance process (or volatility)
xk+1 = ϕxk + σuk , uk ∼ N (0, 1)
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Self-Normalised Importance Sampling
Evolution of IBM stocks (corrected from trend and log-ratio-ed)
0 100 200 300 400 500
−10
−9−8
−7−6
days
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Self-Normalised Importance Sampling
Example (Stochastic volatility model (2))
Observed likelihood unavailable in closed from.Joint posterior distribution of the hidden state sequence{Xk}1≤k≤n can be evaluated explicitly
n∏k=2
exp[−{σ−2(xk − φxk−1)2 + β−2 exp(−xk)y2
k + xk}/2],
(2)up to a normalizing constant.
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Self-Normalised Importance Sampling
Computational problems
Example (Stochastic volatility model (3))
Direct simulation from this distribution impossible because of
(a) dependence among the Xk’s,
(b) dimension of the sequence {Xk}1≤k≤n, and
(c) exponential term exp(−xk)y2k within (2).
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Self-Normalised Importance Sampling
Importance sampling
Example (Stochastic volatility model (4))
We illustrate below the behavior of (self-normalized) importancesampling on trajectories simulated under the model with ϕ = 0.98,σ = 0.17 and β = 0.64 when proposing whole trajectories{Xk}1≤k≤n from the AR(1) prior, using m = 1000 simulations.
The importance weight associated to x1, . . . , xn reduces to
n∏k=1
exp[−{β−2 exp(−xk)y2
k + xk}/2],
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Self-Normalised Importance Sampling
Typical Evolution of ESS for a Single ObservationTrajectory
1 5 10 15 20 25 30 35 40 45 500
100
200
300
400
500
600
700
800
900
1000
ES
S (
500
inde
p. r
eplic
atio
ns o
f the
par
ticle
s)
time index
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Self-Normalised Importance Sampling
Evolution of ESS When Averaging Over Trajectories
1 5 10 15 20 25 30 35 40 45 500
100
200
300
400
500
600
700
800
900
1000E
SS
(50
0 in
dep.
sim
ulat
ions
)
time index
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Self-Normalised Importance Sampling
Typical Evolution of the Weight Distribution
−25 −20 −15 −10 −5 00
500
1000
−25 −20 −15 −10 −5 00
500
1000
−25 −20 −15 −10 −5 00
50
100
Importance Weights (base 10 logarithm)
Figure: Histograms of the base 10 logarithm of the normalizedimportance weights for subtrajectories of length (from top to bottom) 1,10 and 100 samples.
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Acceleration Methods
Correlated simulations
Negative correlation reduces variance
Special technique — but efficient when it applies
Two samples (X1, . . . , Xm) and (Y1, . . . , Ym) from f to estimate
I =∫
Rh(x)f(x)dx
by
I1 =1m
m∑i=1
h(Xi) and I2 =1m
m∑i=1
h(Yi)
with mean I and variance σ2
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Acceleration Methods
Variance reduction
Variance of the average
var
(I1 + I2
2
)=σ2
2+
12
cov(I1, I2).
If the two samples are negatively correlated,
cov(I1, I2) ≤ 0 ,
they improve on two independent samples of same size
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Acceleration Methods
Antithetic variables
◦ If f symmetric about µ, take Yi = 2µ−Xi
◦ If Xi = F−1(Ui), take Yi = F−1(1− Ui)◦ If (Ai)i partition of X , partitioned sampling by samplingXj ’s in each Ai (requires to know Pr(Ai))
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Acceleration Methods
Control variates
out of control!
For
I =∫h(x)f(x)dx
unknown and
I0 =∫h0(x)f(x)dx
known,
I0 estimated by I0 and
I estimated by I
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Acceleration Methods
Control variates (2)
Combined estimator
I∗ = I + β(I0 − I0)
I∗ is unbiased for I and
var(I∗) = var(I) + β2var(I0) + 2β cov(I, I0)
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Acceleration Methods
Optimal control
Optimal choice of β
β? = −cov(I, I0)
var(I0),
withvar(I?) = (1− ρ2) var(I) ,
where ρ correlation between I and I0
Usual solution: regression coefficient of h(xi) over h0(xi)
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Acceleration Methods
Example (Quantile Approximation)
Evaluate
% = P(X > a) =∫ ∞a
f(x)dx
by
% =1n
n∑i=1
I(Xi > a),
with Xi iid f .
Assuming that the median µ (P(X > µ) = 12) is known.
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Acceleration Methods
Example (Quantile Approximation (2))
Control variate
% =1n
n∑i=1
I(Xi > a) + β
(1n
n∑i=1
I(Xi > µ)− P(X > µ)
)
improves upon % if
β < 0 and |β| < 2cov(%, %0)var(%0)
2P(X > a)P(X > µ)
.
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Acceleration Methods
Integration by conditioning
Use Rao-Blackwell Theorem
var(E[δ(X)|Y]) ≤ var(δ(X))
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Acceleration Methods
Consequence
If I unbiased estimator of I = Ef [h(X)], with X simulated from ajoint density f(x, y), where∫
f(x, y)dy = f(x),
the estimatorI∗ = Ef [I|Y1, . . . , Yn]
dominate I(X1, . . . , Xn) variance-wise (and is unbiased)
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Acceleration Methods
skip expectation
Example (Student’s t expectation)
ForE[h(X)] = E[exp(−X2)] with X ∼ T (ν, 0, σ2)
a Student’s t distribution can be simulated as
X|y ∼ N (µ, σ2y) and Y −1 ∼ χ2ν .
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Acceleration Methods
Example (Student’s t expectation (2))
Empirical distribution
1m
m∑j=1
exp(−X2j ) ,
can be improved from the joint sample
((X1, Y1), . . . , (Xm, Ym))
since
1m
m∑j=1
E[exp(−X2)|Yj ] =1m
m∑j=1
1√2σ2Yj + 1
is the conditional expectation.In this example, precision ten times better
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Acceleration Methods
0 2000 4000 6000 8000 10000
0.50
0.52
0.54
0.56
0.58
0.60
0 2000 4000 6000 8000 10000
0.50
0.52
0.54
0.56
0.58
0.60
Estimators of E[exp(−X2)]: empirical average (full) andconditional expectation (dotted) for (ν, µ, σ) = (4.6, 0, 1).
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Application to Bayesian Model Choice
Bayesian model choice
directly Markovian
Probabilise the entire model/parameter space
I allocate probabilities pi to all models Mi
I define priors πi(θi) for each parameter space Θi
I compute
π(Mi|x) =pi
∫Θi
fi(x|θi)πi(θi)dθi∑j
pj
∫Θj
fj(x|θj)πj(θj)dθj
I take largest π(Mi|x) to determine “best” model,
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Application to Bayesian Model Choice
Bayes factor
Definition (Bayes factors)
For testing hypotheses H0 : θ ∈ Θ0 vs. Ha : θ 6∈ Θ0, under prior
π(Θ0)π0(θ) + π(Θc0)π1(θ) ,
central quantity
B01 =π(Θ0|x)π(Θc
0|x)
/π(Θ0)π(Θc
0)=
∫Θ0
f(x|θ)π0(θ)dθ∫Θc0
f(x|θ)π1(θ)dθ
[Jeffreys, 1939]
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Application to Bayesian Model Choice
Evidence
A similar quantity, the evidence
Ek =∫
Θk
πk(θk)Lk(θk) dθk,
a.k.a. the marginal likelihood.[Jeffreys, 1939]
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Application to Bayesian Model Choice
Bayes factor approximation
When approximating the Bayes factor
B01 =
∫Θ0
f0(x|θ0)π0(θ0)dθ0∫Θ1
f1(x|θ1)π1(θ1)dθ1
use of importance functions $0 and $1 and
B01 =n−1
0
∑n0i=1 f0(x|θi0)π0(θi0)/$0(θi0)
n−11
∑n1i=1 f1(x|θi1)π1(θi1)/$1(θi1)
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Application to Bayesian Model Choice
Diabetes in Pima Indian women
Example (R benchmark)
“A population of women who were at least 21 years old, of PimaIndian heritage and living near Phoenix (AZ), was tested fordiabetes according to WHO criteria. The data were collected bythe US National Institute of Diabetes and Digestive and KidneyDiseases.”200 Pima Indian women with observed variables
I plasma glucose concentration in oral glucose tolerance test
I diastolic blood pressure
I diabetes pedigree function
I presence/absence of diabetes
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Application to Bayesian Model Choice
Probit modelling on Pima Indian women
Probability of diabetes function of above variables
P(y = 1|x) = Φ(x1β1 + x2β2 + x3β3) ,
Test of H0 : β3 = 0 for 200 observations of Pima.tr based on ag-prior modelling:
β ∼ N3
(0, n(XTX)−1
)
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Application to Bayesian Model Choice
Importance sampling for the Pima Indian dataset
Use of the importance function inspired from the MLE estimatedistribution
β ∼ N (β, Σ)
R Importance sampling codemodel1=summary(glm(y~-1+X1,family=binomial(link="probit")))
is1=rmvnorm(Niter,mean=model1$coeff[,1],sigma=2*model1$cov.unscaled)
is2=rmvnorm(Niter,mean=model2$coeff[,1],sigma=2*model2$cov.unscaled)
bfis=mean(exp(probitlpost(is1,y,X1)-dmvlnorm(is1,mean=model1$coeff[,1],
sigma=2*model1$cov.unscaled))) / mean(exp(probitlpost(is2,y,X2)-
dmvlnorm(is2,mean=model2$coeff[,1],sigma=2*model2$cov.unscaled)))
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Application to Bayesian Model Choice
Diabetes in Pima Indian womenComparison of the variation of the Bayes factor approximationsbased on 100 replicas for 20, 000 simulations from the prior andthe above MLE importance sampler
Basic Monte Carlo Importance sampling
23
45
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Application to Bayesian Model Choice
Bridge sampling
Special case:If
π1(θ1|x) ∝ π1(θ1|x)π2(θ2|x) ∝ π2(θ2|x)
live on the same space (Θ1 = Θ2), then
B12 =1n
n∑i=1
π1(θi|x)π2(θi|x)
with θi ∼ π2(θ|x)
[Gelman & Meng, 1998; Chen, Shao & Ibrahim, 2000]
Requires simulating from π2(θ|x) which will be done(approximately) using MCMC
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Application to Bayesian Model Choice
Bridge sampling variance
The bridge sampling estimator does poorly if
var(B12)B2
12
=1n
Eπ2
[(π1(θ)π2(θ)
)2
− 1
]
=1n
Eπ2
[(π1(θ)− π2(θ)
π2(θ)
)2]
︸ ︷︷ ︸R (π1(θ)−π2(θ))2
π2(θ)dθ
is large, i.e. if the posteriors π1 and π2 have little overlap...
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Application to Bayesian Model Choice
(Further) bridge sampling
General identity:
B12 =
∫π2(θ|x)α(θ)π1(θ|x)dθ∫π1(θ|x)α(θ)π2(θ|x)dθ
∀ α(·)
≈
1n1
n1∑i=1
π2(θ1i|x)α(θ1i)
1n2
n2∑i=1
π1(θ2i|x)α(θ2i)θji ∼ πj(θ|x)
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Application to Bayesian Model Choice
Optimal bridge sampling
The optimal choice of auxiliary function is
α?(θ) =n1 + n2
n1π1(θ|x) + n2π2(θ|x)
leading to
B12 ≈
1n1
n1∑i=1
π2(θ1i|x)n1π1(θ1i|x) + n2π2(θ1i|x)
1n2
n2∑i=1
π1(θ2i|x)n1π1(θ2i|x) + n2π2(θ2i|x)
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Application to Bayesian Model Choice
Optimal bridge sampling (2)
Reason:
Var(B12)B2
12
≈ 1n1n2
{∫π1(θ)π2(θ)[n1π1(θ) + n2π2(θ)]α(θ)2 dθ(∫
π1(θ)π2(θ)α(θ) dθ)2 − 1
}
(by the δ method)
Drawback: Dependence on the unknown normalising constantssolved iteratively
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Application to Bayesian Model Choice
Extension to varying dimensions
When dim(Θ1) 6= dim(Θ2), e.g. θ2 = (θ1, ψ), introduction of apseudo-posterior density, ω(ψ|θ1, x), augmenting π1(θ1|x) intojoint distribution
π1(θ1|x)× ω(ψ|θ1, x)
on Θ2 so that
B12 =
∫π1(θ1|x)α(θ1, ψ)π2(θ1, ψ|x)dθ1ω(ψ|θ1, x) dψ∫π2(θ1, ψ|x)α(θ1, ψ)π1(θ1|x)dθ1 ω(ψ|θ1, x) dψ
= Eπ2
[π1(θ1)ω(ψ|θ1)π2(θ1, ψ)
]=
Eϕ [π1(θ1)ω(ψ|θ1)/ϕ(θ1, ψ)]Eϕ [π2(θ1, ψ)/ϕ(θ1, ψ)]
for any conditional density ω(ψ|θ1) and any joint density ϕ.
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Application to Bayesian Model Choice
Illustration for the Pima Indian dataset
Use of the MLE induced conditional of β3 given (β1, β2) as apseudo-posterior and mixture of both MLE approximations on β3
in bridge sampling estimate
R bridge sampling codecova=model2$cov.unscaled
expecta=model2$coeff[,1]
covw=cova[3,3]-t(cova[1:2,3])%*%ginv(cova[1:2,1:2])%*%cova[1:2,3]
probit1=hmprobit(Niter,y,X1)
probit2=hmprobit(Niter,y,X2)
pseudo=rnorm(Niter,meanw(probit1),sqrt(covw))
probit1p=cbind(probit1,pseudo)
bfbs=mean(exp(probitlpost(probit2[,1:2],y,X1)+dnorm(probit2[,3],meanw(probit2[,1:2]),
sqrt(covw),log=T))/ (dmvnorm(probit2,expecta,cova)+dnorm(probit2[,3],expecta[3],
cova[3,3])))/ mean(exp(probitlpost(probit1p,y,X2))/(dmvnorm(probit1p,expecta,cova)+
dnorm(pseudo,expecta[3],cova[3,3])))
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Application to Bayesian Model Choice
Diabetes in Pima Indian women (cont’d)Comparison of the variation of the Bayes factor approximationsbased on 100× 20, 000 simulations from the prior (MC), the abovebridge sampler and the above importance sampler
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Application to Bayesian Model Choice
The original harmonic mean estimator
When θki ∼ πk(θ|x),
1T
T∑t=1
1L(θkt)
is an unbiased estimator of 1/Zk[Newton & Raftery, 1994]
Highly dangerous: Most often leads to an infinite variance!!! (as itresults from an application of IS targeting the prior and using theposterior as instrumental density)
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Application to Bayesian Model Choice
Approximating Zk from a posterior sample
Use of the [harmonic mean] identity
Eπk[
ϕ(θk)πk(θk)Lk(θk)
∣∣∣∣x] =∫
ϕ(θk)πk(θk)Lk(θk)
πk(θk)Lk(θk)Zk
dθk =1Zk
no matter what the proposal ϕ(·) is.[Gelfand & Dey, 1994; Bartolucci et al., 2006]
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Application to Bayesian Model Choice
Comparison with regular importance sampling
Harmonic mean: Constraint opposed to usual importance samplingconstraints: ϕ(θ) must have lighter (rather than fatter) tails thanπk(θk)Lk(θk) for the approximation
Z1k = 1
/1T
T∑t=1
ϕ(θ(t)k )
πk(θ(t)k )Lk(θ
(t)k )
to have a finite variance.E.g., use finite support kernels (like Epanechnikov’s kernel) for ϕ
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Application to Bayesian Model Choice
Comparison with regular importance sampling (cont’d)
Compare Z1k with a standard importance sampling approximation
Z2k =1T
T∑t=1
πk(θ(t)k )Lk(θ
(t)k )
ϕ(θ(t)k )
where the θ(t)k ’s are generated from the density ϕ(·) (with fatter
tails like t’s)
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Application to Bayesian Model Choice
HPD indicator as ϕUse the convex hull of MCMC simulations corresponding to the10% HPD region (easily derived!) and ϕ as indicator:
ϕ(θ) =10T
∑t∈HPD
Id(θ,θ(t))≤ε
Markov Chain Monte Carlo Methods
Monte Carlo Integration
Application to Bayesian Model Choice
Diabetes in Pima Indian women (cont’d)Comparison of the variation of the Bayes factor approximationsbased on 100 replicas for 20, 000 simulations for a simulation fromthe above harmonic mean sampler and importance samplers
Harmonic mean Importance sampling
3.102
3.104
3.106
3.108
3.110
3.112
3.114
3.116
Markov Chain Monte Carlo Methods
Notions on Markov Chains for MCMC
Notions on Markov Chains
Notions on Markov Chains for MCMCBasicsInvariant MeasuresErgodicity: (1) The Discrete CaseErgodicity: (2) The Continuous CaseLimit Theorems
Markov Chain Monte Carlo Methods
Notions on Markov Chains for MCMC
Basics
Basics
A Markov chain is a (discrete-time) stochastic process which issuch that the future evolution of the process is independent of itspast given only its current value.
The distribution of the chain is defined through its transitionkernel, a function K defined on X ×B(X ) such that
I ∀x ∈X , K(x, ·) is a probability measure;
I ∀A ∈ B(X ), K(·, A) is measurable.
Markov Chain Monte Carlo Methods
Notions on Markov Chains for MCMC
Basics
no discrete
• When X is a discrete (finite or countable) set, the transitionkernel may be represented as transition matrix K = (Kxy)whose elements satisfy
Kxy = P(Xn = y|Xn−1 = x) , x, y ∈X
Since, for all x ∈X , K(x, ·) is a probability, we must have
Kxy ≥ 0 and K(x,X ) =∑y∈X
Kxy = 1
The matrix K is said to be a stochastic matrix.
Markov Chain Monte Carlo Methods
Notions on Markov Chains for MCMC
Basics
• In the continuous case, if all probabilities K(x, ·) areabsolutely continuous wrt. Lebesgue measure, the kernel canbe represented by a transition density function k(x, x′):
P(Xn−1 ∈ A|Xn) =∫Ak(Xn, x
′)dx′.
The transition density function satisfies k(x, x′)dx′ ≥ 0 and∫X k(x, x′) = 1.
Markov Chain Monte Carlo Methods
Notions on Markov Chains for MCMC
Basics
Kh
For functions h such that∫X K(x, dy)h(y) <∞ for all xs, define
the function Kh by
Kh(x) = K(x, h) =∫
XK(x, dy)h(y).
Kh(Xn) = E(h(Xn+1)|Xn).
µK
For probability measures µ, define the measure µK by
µK(A) =∫
Xµ(dx)K(x,A).
µK correspond the law of Xn+1, had the chain been started attime n with distribution Xn ∼ µ.
Markov Chain Monte Carlo Methods
Notions on Markov Chains for MCMC
Basics
Composition of Kernels
More generally if K1 and K2 are two transition kernels,
K1K2(x,A) =∫x′∈X
K1(x, dx′)K2(x′, A)
defines the law of Xn+2 when the chain is started in Xn = x antwo successive transitions according to, respectively, K1 and K2
are performed.
Markov Chain Monte Carlo Methods
Notions on Markov Chains for MCMC
Basics
bypass remarks
On a finite state-space X = {s1, . . . , sk},I A function h is uniquely defined by the (column) vectorh = (h(s1), . . . , h(sk))T and
(Kh)s =∑s′∈X
Kss′hs′
can be interpreted as the sth component of the product of thetransition matrix K and of the vector h.
I A probability distribution on X is defined as a (row) vectorµ = (µ(s1), . . . , µ(sk)) and the probability distribution µK isdefined, for each s ∈X , as
(µK)s =∑s′∈X
µs′Ks′s ,
that is, the sth component of the product of the (row) vectorµ and of the transition matrix K.
Markov Chain Monte Carlo Methods
Notions on Markov Chains for MCMC
Basics
Markov Chainskip definition
Definition (Markov Chain)
(Xn)n≥0 forms a Markov chain if there exist a transition kernel Kand an initial probability µ such that for any n
Pµ(X0 ∈ A0, X1 ∈ A1, . . . , Xn ∈ An) =∫x0∈A0
. . .
∫xn−1∈An−1
µ(dx0)K(x0, dx1) . . .K(xn−1, An) .
In particular,
Pµ(Xn ∈ A) =∫
X. . .
∫Xµ(dx0)K(x0, dx1) . . .K(xn−1, A) ,
which is usually denoted more concisely byPµ(Xn ∈ A) = µKn(A).
Markov Chain Monte Carlo Methods
Notions on Markov Chains for MCMC
Basics
Example (Normal Random walk)
Xn = Xn−1 + τεn
where (εn)n≥1 is an iid. sequence of normal variables independentof X0.Then, k(x, x′) = N (x′;x, τ2I) and
Pµ(Xn ∈ A) =∫x0∈X
∫xn∈A
µ(dx0)N (xn;x0, nτ2I)︸ ︷︷ ︸
kn(x0,xn)
dxn
Markov Chain Monte Carlo Methods
Notions on Markov Chains for MCMC
Basics
−4 −2 0 2
02
46
810
x
y
A trajectory of length 100 of the random walk in R2 withτ = 1
Markov Chain Monte Carlo Methods
Notions on Markov Chains for MCMC
Basics
The Strong Markov Property
Let τ denote a random stopping time such that the event {τ = n}is σ(X0, . . . , Xn)-measurable for all n. Assuming that τ is Px-a.s.finite, we have
Px(Xτ+1 ∈ A1, . . . , Xτ+n ∈ An|X0, . . . , Xτ ) =∫x1∈A1
. . .
∫xn∈An
K(Xτ , dx1) . . .K(xn−1, dxn) .
Typical stopping times of interest in Markov chain include returntimes
τC = inf{n ≥ 1 : Xn ∈ C} .
Markov Chain Monte Carlo Methods
Notions on Markov Chains for MCMC
Invariant Measures
Invariant Measures
Definition (Invariant Measure)
A measure π is invariant for the transition kernel K if πK = π,that is,
π(A) =∫
Xπ(dx) K(x,A) , ∀A ∈ B(X ) .
I If π in an invariant measure, so is cπ for any scale factor c
I There can be several invariant measures (that differs by morethan just a scale)
The latter can be avoided by assuming that the chain is irreduciblego to irreducible chains
Markov Chain Monte Carlo Methods
Notions on Markov Chains for MCMC
Invariant Measures
Stability Properties and InvarianceAssuming that π is unique (up to the normalization issue), thereare two very different cases
Null Chains If π is not a finite measure
Positive Chains If π is a probability measure; in this case π alsocalled the stationary distribution since
X0 ∼ π implies that Xn ∼ π for every n
More generally, the process is then stationary in thesense that
Pπ(Xk ∈ A0, Xk+1 ∈ A1, . . . , Xk+n ∈ An) =∫x0∈A0
. . .
∫xn∈An
π(dx0)K(x0, dx1) . . .K(xn−1, dxn) ,
for any lag k ≥ 0¶.¶This restriction can be removed, but the corresponding construction is
omitted here.
Markov Chain Monte Carlo Methods
Notions on Markov Chains for MCMC
Invariant Measures
Stationary Distribution in MCMC
In MCMC, we construct kernels K that admit the target πas invariant measure
Hence, if we can show that π is the unique invariant measure, wewill always deal with positive chains.
Markov Chain Monte Carlo Methods
Notions on Markov Chains for MCMC
Invariant Measures
Reversible Chains
π-Reversibility
K is said to be π-reversible is π(dx)K(x, dx′) is a symmetricmeasure on X 2, that is∫
x∈A
∫x′∈A′
π(dx)K(x, dx′) =∫x′∈A′
∫x∈A
π(dx′)K(x′, dx)
π-reversibility implies that K is π-invariant as∫x∈X
∫x′∈A
π(dx)K(x, dx′) =∫x′∈A
∫x∈X
π(dx′)K(x′, dx)
=∫x′∈A
π(dx′)K(x′,X )︸ ︷︷ ︸=1
= π(A) .
Markov Chain Monte Carlo Methods
Notions on Markov Chains for MCMC
Invariant Measures
π-Reversibility and Time Reversal
If K is π-reversible, the stationary chain may be run backwards intime to obtain a backward chain that has the same distribution asthe (usual) forward chain
Pπ(Xn−k ∈ Ak, . . . , Xn−1 ∈ A1, Xn ∈ A0) =∫xn∈A0
. . .
∫xn−k∈Ak
π(dxn)K(xn, dxn−1) . . .K(xn−k+1, dxn−k) .
Markov Chain Monte Carlo Methods
Notions on Markov Chains for MCMC
Invariant Measures
π-Reversible Chains in MCMC
MCMC algorithms are all built from elementary transitionkernels that are π-reversible
It is indeed very easy to check that a kernel is π-reversible (and itdoes not require that the normalizing constant of π be known)
Markov Chain Monte Carlo Methods
Notions on Markov Chains for MCMC
Invariant Measures
Ergodicity
For MCMC, a minimum requirement is that the chain (Xn) beergodic in the sense that
Px(Xn ∈ A)→ π(A) for all x ,
ensuring that, for large enough simulation runs, we areapproximatively simulating from π, for any choice of the startingpoint X0.
Markov Chain Monte Carlo Methods
Notions on Markov Chains for MCMC
Invariant Measures
Total Variation Norm
Distances between probability measures will be assessed throughthe following criterion‖.
Total Variation Norm
For probability measures µ1 and µ2,
‖µ1 − µ2‖TV = supA|µ1(A)− µ2(A)|
Other equivalent definitions
(i) For a signed measure ξ on X ,‖ξ‖TV = 1/2{ξ+(X ) + ξ−(X )}
(ii) ‖ξ‖TV = 1/2 sup{|ξ(h)| : ‖h‖∞ = 1}
‖Alternative definition: ‖µ1 − µ2‖TV = 2 supA |µ1(A)− µ2(A)|.
Markov Chain Monte Carlo Methods
Notions on Markov Chains for MCMC
Invariant Measures
A First Step Towards Ergodicity
Proposition (6.52)
If K is π-invariant (with π a probability),
‖µKn − π‖TV
is non-increasing in n, for any choice of the initial distribution µ.
A weak statement that can be made stronger under restrictiveassumptions: If there exist a probability ν such thatK(x,A) > εν(A) with ε > 0, the chain is uniformly ergodic in thesense that
‖µKn − π‖TV ≤ (1− ε)n‖µ− π‖TV .
Markov Chain Monte Carlo Methods
Notions on Markov Chains for MCMC
Invariant Measures
Simple counter-examples show that additional conditions arerequired to ensure
Recurrence Infinitely many visits to all sets such that π(A) > 0,from any starting point.
K =
1 0 00 1− α α0 α 1− α
Aperidocity That the chain does not cycle through disconnected
regions of the state-space.
K =(
0 11 0
)
Markov Chain Monte Carlo Methods
Notions on Markov Chains for MCMC
Ergodicity: (1) The Discrete Case
Irreducibility
Definition (Irreducibility)
In the discrete case, the chain is irreducible if all statescommunicate, in the sense that
Px(τy <∞) > 0 , ∀x, y ∈X ,
τy being the first (strictly positive) time when y is visited (alsocalled return time).
This is equivalent to requesting that for all x, y ∈X , there existsn such that Kn(x, y) > 0.
Markov Chain Monte Carlo Methods
Notions on Markov Chains for MCMC
Ergodicity: (1) The Discrete Case
Recurrence and Transience
Define by ηy the total number of visits of the chain to state y,then Ex[ηy] =
∑n≥0K
n(x, y).
The Recurrence/Transience Dichotomy
An irreducible Markov chain is either
I transient and
Px(τy <∞) < 1 , Ex[ηy] <∞ for all x, y ∈X ,
I recurrent and
Px(τy <∞) = 1 , Ex[ηy] =∞ for all x, y ∈X .
Markov Chain Monte Carlo Methods
Notions on Markov Chains for MCMC
Ergodicity: (1) The Discrete Case
Recurrent Chain
Kac’s Theorem
A chain on a discrete state-space X that is both π-invariant (withπ a probability, such that π(x) > 0 for all x) and irreducible
I is recurrent
I and
Ex[τx] =1
π(x).
Markov Chain Monte Carlo Methods
Notions on Markov Chains for MCMC
Ergodicity: (1) The Discrete Case
Aperiodicity
The chain is said to be aperiodic if, for all states x, the greatestcommon divisor of the set {n ≥ 1 : Kn(x, x) > 0} is equal to 1.
Strong Aperiodicity (Sufficient condition for aperiodicity)
If the chain is irreducible, the existence of a state x such that
K(x, x) > 0
implies that the chain is strongly aperiodic.
Markov Chain Monte Carlo Methods
Notions on Markov Chains for MCMC
Ergodicity: (1) The Discrete Case
Ergodic Theorem
An irreducible and aperiodic chain on a discrete state-space Xwith invariant probability π satisfies
‖Px(Xn ∈ ·)− π‖TV → 0 for all x ∈X
Markov Chain Monte Carlo Methods
Notions on Markov Chains for MCMC
Ergodicity: (2) The Continuous Case
The General Case
For general chains the situation is more complex
I First, individual points of the state-space will generally not bereachable and we must adapt our definitions of irreducibility,recurrence and aperiodicity to deal with measurable setsrather points of the state-space.
I Second, we need a strengthened notion of recurrence.
Markov Chain Monte Carlo Methods
Notions on Markov Chains for MCMC
Ergodicity: (2) The Continuous Case
Phi-Irreducibility
The chain is said to be ϕ-irreducible for some measure ϕ if
I for all x ∈X ,
I for all A ∈ B(X ) such that ϕ(A) > 0,
there exists n ≥ 1 such that
Kn(x,A) > 0 . (3)
It the chain is π-invariant, with π a probability measure, (i.e., as inMCMC) one should choose ϕ = π, as π is a maximal irreducibilitymeasure (dominates all measures ϕ such that (3) holds).
Markov Chain Monte Carlo Methods
Notions on Markov Chains for MCMC
Ergodicity: (2) The Continuous Case
Harris Recurrence
Unfortunately (and in contrast to the discrete case),phi-irreducibility in itself is not sufficient to ensure thatPx(τA <∞) = 1, hence we will assume a stronger form ofrecurrence
Definition (Harris Recurrence)
A phi-irreducible chain is said to be Harris recurrent if for anyx ∈X and all sets A such that ϕ(A) > 0,
Px(τA <∞) = 1 .
Px(τA <∞) = 1 for all xs implies that Ex[ηA] =∞, but theconverse is not more true.
Markov Chain Monte Carlo Methods
Notions on Markov Chains for MCMC
Ergodicity: (2) The Continuous Case
Small Sets
To ensure (strong) aperidocity a sufficient condition is theexistence of a ν-small set C such that ν(C) > 0.
Definition (Small Set)
If there exist C ∈ B(X ), with ϕ(C) > 0, a measure ν and ε > 0such that, for all x ∈ C and all A ∈ B(X ),
K(x,A) ≥ εν(A)
C is called a ν-small set.
Small sets play a role somewhat similar to points (or atoms) in thediscrete case.
Markov Chain Monte Carlo Methods
Notions on Markov Chains for MCMC
Ergodicity: (2) The Continuous Case
General Ergodicity
Theorem (6.51)
If the chain (Xn) is Harris recurrent positive and aperiodic, then
limn→∞
‖Kn(x, ·)− π ‖TV = 0
for all x ∈X .
Convergence in total variation implies
limn→∞
Ex[h(Xn)] = Eπ[h]
for every bounded function h (and, hence, convergence indistribution).
Markov Chain Monte Carlo Methods
Notions on Markov Chains for MCMC
Limit Theorems
Limit Theorems for MCMCI The ergodic theorem (and particularly quantitative
refinements of it not discussed here) is important to guaranteethat the algorithm is forgetting its initial (arbitrary) state.
I However, the usual practice in MCMC is not only to exploit asingle sampled point Xn for n sufficiently large but to rely onempirical trajectory-wise averages
1n
n∑k=1
h(Xk)
to estimate the expectation of functions of interest h under π.I Hence, we also need LLNs and CLTs.
Classical LLNs and CLTs not directly applicable due to:
◦ Markovian dependence structure
◦ Non-stationarity of the chain
Markov Chain Monte Carlo Methods
Notions on Markov Chains for MCMC
Limit Theorems
LLN for Markov Chains
Theorem (6.63)
If the Markov chain (Xn) is Harris recurrent with stationarydistribution π, then for any function h such that Eπ |h| <∞,
limn→∞
1n
n∑k=1
h(Xk) = Eπ[h] .
Markov Chain Monte Carlo Methods
Notions on Markov Chains for MCMC
Limit Theorems
CLT for Markov Chains
Theorem (6.65)
For an Harris recurrent π-reversible Markov chain (Xn), leth = h− Eπ[h] and assume that
0 < γ2h = Eπ[h2(X0)] + 2
∞∑k=1
Eπ[h(X0)h(Xk)] <∞ .
Then,
1√n
(n∑k=1
(h(Xk)− Eπ[h])
)L−→ N (0, γ2
h) .
[Kipnis & Varadhan, 1986]
Markov Chain Monte Carlo Methods
The Metropolis-Hastings Algorithm
The Metropolis-Hastings Algorithm
Motivation and leading example
Random variable generation
Monte Carlo Integration
Notions on Markov Chains for MCMC
The Metropolis-Hastings AlgorithmMonte Carlo Methods based on Markov ChainsThe Metropolis-Hastings AlgorithmThe Independent Metropolis-Hastings AlgorithmThe Random Walk Metropolis-Hastings AlgorithmExtensions
The Gibbs Sampler
Case Study
Markov Chain Monte Carlo Methods
The Metropolis-Hastings Algorithm
Monte Carlo Methods based on Markov Chains
Running Monte Carlo via Markov Chains
It is not necessary to use a iid. sample from the distribution f toapproximate the integral
I =∫
h(x)f(x)dx ,
One can approximate I from a simulation of an ergodic Markovchain with stationary distribution f .
Markov Chain Monte Carlo Methods
The Metropolis-Hastings Algorithm
Monte Carlo Methods based on Markov Chains
Running Monte Carlo via Markov Chains (2)
Idea
For an arbitrary starting value x(0), an ergodic chain (X(t)) isgenerated using a transition kernel with stationary distribution f
I Insures the convergence in distribution of (X(t)) to a randomvariable from f .
I For a “large enough” T0, X(T0) can be considered asdistributed from f
I Produce a dependent sample X(t), X(t+1), . . ., which can beused to approximate integrals under f .
Problem: How can one build a Markov chain with a givenstationary distribution?
Markov Chain Monte Carlo Methods
The Metropolis-Hastings Algorithm
The Metropolis-Hastings Algorithm
The Metropolis-Hastings Algorithm
BasicsThe algorithm requires the knowledge of the objective (target)density
f
and uses a conditional density
q(y|x)
called the instrumental (or proposal) distribution
Markov Chain Monte Carlo Methods
The Metropolis-Hastings Algorithm
The Metropolis-Hastings Algorithm
The MH algorithm
Algorithm (Metropolis-Hastings)
Given X(t),
1. Generate Yt ∼ q(y|X(t)).
2. Take
X(t+1) =
{Yt with prob. ρ(X(t), Yt),X(t) with prob. 1− ρ(X(t), Yt),
where
ρ(x, y) = min{f(y)f(x)
q(x|y)q(y|x)
, 1}.
Markov Chain Monte Carlo Methods
The Metropolis-Hastings Algorithm
The Metropolis-Hastings Algorithm
Features
I Does not require the knowledge of normalizing constants ofeither f or q(·|x)
I Never move to values for which f(y) = 0I The chain (X(t)) may take the same value several times in a
row, even though f is a density wrt Lebesgue measure
I The sequence (Yt) is usually not a Markov chain
Markov Chain Monte Carlo Methods
The Metropolis-Hastings Algorithm
The Metropolis-Hastings Algorithm
Convergence properties
1. The MH Markov chain is reversible, with invariant/stationarydensity f since it satisfies the detailed balance condition
f(y)dyK(y, dx) = f(x)dxK(x, dy)
2. As f is a probability measure, the chain is positive recurrent
3. If
P
[f(Yt) q(X(t)|Yt)f(X(t)) q(Yt|X(t))
≥ 1
]< 1 ,
that is, the event {X(t+1) = X(t)} is possible, then the chainis aperiodic
Markov Chain Monte Carlo Methods
The Metropolis-Hastings Algorithm
The Metropolis-Hastings Algorithm
Convergence properties (2)
4. Ifq(y|x) > 0 for every (x, y) ,
the chain is f-irreducible
5. For MH, f -irreducibility implies Harris recurrence
6. Thus, for MH chains satisfying 3 and 4(i) For h, with Ef |h(X)| <∞,
limT→∞
1T
T∑t=1
h(X(t)) =∫h(x)df(x) Px(0) −a.s.
(ii) and
limn→∞
∥∥∥∥∫ Kn(x(0), ·)− f∥∥∥∥TV
= 0
for every initial starting point x(0).
Markov Chain Monte Carlo Methods
The Metropolis-Hastings Algorithm
The Independent Metropolis-Hastings Algorithm
The Independent Case
The instrumental distribution q does not depend on X(t), and isdenoted g by analogy with Accept-Reject.
Algorithm (Independent Metropolis-Hastings)
Given X(t),
a Generate Yt ∼ g(y)b Take
X(t+1) =
Yt with prob. min
{f(Yt) g(X(t))f(X(t)) g(Yt)
, 1
},
X(t) otherwise.
Markov Chain Monte Carlo Methods
The Metropolis-Hastings Algorithm
The Independent Metropolis-Hastings Algorithm
PropertiesThe resulting sample is not iid but there exist strong convergenceproperties:
Theorem (Ergodicity)
The algorithm produces a uniformly ergodic chain if there exists aconstant M such that
f(x) ≤Mg(x) ,
In this case,
‖Kn(x, ·)− f‖TV ≤(
1− 1M
)n.
[Mengersen & Tweedie, 1996]
Markov Chain Monte Carlo Methods
The Metropolis-Hastings Algorithm
The Independent Metropolis-Hastings Algorithm
Example (Noisy AR(1))
Hidden Markov chain from a regular AR(1) model,
xt+1 = ϕxt + εt+1 εt ∼ N (0, τ2)
and observablesyt|xt ∼ N (x2
t , σ2)
The distribution of xt given xt−1, xt+1 and yt is
exp−12τ2
{(xt − ϕxt−1)2 + (xt+1 − ϕxt)2 +
τ2
σ2(yt − x2
t )2
}.
Markov Chain Monte Carlo Methods
The Metropolis-Hastings Algorithm
The Independent Metropolis-Hastings Algorithm
Example (Noisy AR(1) Contd.)
Use for proposal the N (µt, ω2t ) distribution, with
µt = ϕxt−1 + xt+1
1 + ϕ2and ω2
t =τ2
1 + ϕ2.
Ratioπ(x)/qind(x) = exp−(yt − x2
t )2/2σ2
is bounded
Markov Chain Monte Carlo Methods
The Metropolis-Hastings Algorithm
The Independent Metropolis-Hastings Algorithm
(top) Last 500 samples of the chain (X(t)) run for 10, 000iterations; (bottom) histogram of the chain, compared withthe target distribution.
Markov Chain Monte Carlo Methods
The Metropolis-Hastings Algorithm
The Independent Metropolis-Hastings Algorithm
Example (Cauchy by normal)
go random W Given a Cauchy C (0, 1) distribution, consider a normalN (0, 1) proposalThe Metropolis–Hastings acceptance ratio is
π(ξ′)/ν(ξ′)π(ξ)/ν(ξ))
= exp[{ξ2 − (ξ′)2
}/2] 1 + (ξ′)2
(1 + ξ2).
Poor perfomances: the proposal distribution has lighter tails thanthe target Cauchy and convergence to the stationary distribution isnot even geometric!
[Mengersen & Tweedie, 1996]
Markov Chain Monte Carlo Methods
The Metropolis-Hastings Algorithm
The Independent Metropolis-Hastings AlgorithmDe
nsity
−4 −2 0 2 4
0.0
0.1
0.2
0.3
0.4
Histogram of Markov chain(ξ(t))1≤t≤5000 against targetC (0, 1) distribution.
0 1000 2000 3000 4000 5000
−3−2
−10
12
3iterations
Range and average of 1000parallel runs when initializedwith a normal N (0, 1002)distribution.
Markov Chain Monte Carlo Methods
The Metropolis-Hastings Algorithm
The Independent Metropolis-Hastings Algorithm
Performance of the Algorithm
Average Acceptance Rate
If f(x) ≤Mg(x), the average acceptance rate (under stationarity)satisfies
ρ = Ef⊗g
[min
{f(Y ) g(X)f(X) g(Y )
, 1}]
= 2 Pf⊗g
(f(Y )g(Y )
≥ f(X)g(X)
)≥ 2
MPf⊗f
(f(Y )g(Y )
≥ f(X)g(X)
)=
1M
.
Hence, maximizing the acceptance rate is related to theoptimization of the speed of convergence in the ergodic theorem.
When different proposals g are available, the one that leads to thehighest acceptance rate should be used.
Markov Chain Monte Carlo Methods
The Metropolis-Hastings Algorithm
The Independent Metropolis-Hastings Algorithm
Practical approachChoose a parameterized instrumental distribution g(·|θ) and adjustthe corresponding parameters θ based on the empirical estimate ofthe average acceptance rate computed on pilot runs∗∗
ρ(θ) =2m
m∑t=1
I{f(Yt)g(X(t)|θ)>f(X(t))g(Yt|θ)} .
∗∗Trying to modify θ continuously as the algorithm runs would result in anadaptive MCMC approach, whose convergence is harderto establish (see latter).
Markov Chain Monte Carlo Methods
The Metropolis-Hastings Algorithm
The Random Walk Metropolis-Hastings Algorithm
Random walk Metropolis–Hastings
Use of a local perturbation as proposal
Yt = X(t) + εt,
where εt ∼ g is independent of X(t).
The instrumental density is now of the form g(y − x) and theMarkov chain is a random walk if we take g to be symmetricg(x) = g(−x).
Markov Chain Monte Carlo Methods
The Metropolis-Hastings Algorithm
The Random Walk Metropolis-Hastings Algorithm
Algorithm (Random walk Metropolis-Hastings)
Given X(t)
1. Generate Yt ∼ g(y −X(t))2. Take
X(t+1) =
Yt with prob. min{
1,f(Yt)f(X(t))
},
X(t) otherwise.
Markov Chain Monte Carlo Methods
The Metropolis-Hastings Algorithm
The Random Walk Metropolis-Hastings Algorithm
Example (Random walk and normal target)
forget History! Generate N (0, 1) based on the uniform proposal [−δ, δ][Hastings (1970)]
The probability of acceptance is then
ρ(x(t), yt) = exp{(x(t)2 − y2t )/2} ∧ 1.
Markov Chain Monte Carlo Methods
The Metropolis-Hastings Algorithm
The Random Walk Metropolis-Hastings Algorithm
-1 0 1 2
050
100
150
200
250
(a)
-1.5
-1.0
-0.5
0.0
0.5
-2 0 2
010
020
030
040
0
(b)
-1.5
-1.0
-0.5
0.0
0.5
-3 -2 -1 0 1 2 3
010
020
030
040
0
(c)
-1.5
-1.0
-0.5
0.0
0.5
Three samples based on U [−δ, δ] with (a) δ = 0.1, (b) δ = 0.5and (c) δ = 1.0, superimposed with the convergence of themeans (15, 000 simulations).
Markov Chain Monte Carlo Methods
The Metropolis-Hastings Algorithm
The Random Walk Metropolis-Hastings Algorithm
Example (Mixture models (again!))
π(θ|x) ∝n∏j=1
(k∑`=1
p`f(xj |µ`, σ`)
)π(θ)
Metropolis-Hastings proposal:
θ(t+1) ={θ(t) + ωε(t) if u(t) < ρ(t)
θ(t) otherwise
where
ρ(t) =π(θ(t) + ωε(t)|x)
π(θ(t)|x)∧ 1
and ω scaled for good acceptance rate
Markov Chain Monte Carlo Methods
The Metropolis-Hastings Algorithm
The Random Walk Metropolis-Hastings Algorithm
p
thet
a
0.0 0.2 0.4 0.6 0.8 1.0
-10
12
tau
thet
a
0.2 0.4 0.6 0.8 1.0 1.2
-10
12
p
tau
0.0 0.2 0.4 0.6 0.8 1.0
0.2
0.4
0.6
0.8
1.0
1.2
-1 0 1 2
0.0
1.0
2.0
theta
0.2 0.4 0.6 0.8
02
4
tau
0.0 0.2 0.4 0.6 0.8 1.0
01
23
45
6
p
Random walk sampling (50000 iterations)
Gase of a 3 component normal mixture[Celeux & al., 2000]
Markov Chain Monte Carlo Methods
The Metropolis-Hastings Algorithm
The Random Walk Metropolis-Hastings Algorithm
Convergence properties
Uniform ergodicity prohibited by random walk structureAt best, geometric ergodicity:
Theorem (Sufficient ergodicity)
For a symmetric density f , log-concave in the tails, and a positiveand symmetric density g, the chain (X(t)) is geometrically ergodic.
[Mengersen & Tweedie, 1996]
no tail effect
Markov Chain Monte Carlo Methods
The Metropolis-Hastings Algorithm
The Random Walk Metropolis-Hastings Algorithm
Example (Cauchy by normal continued)
Again, Cauchy C (0, 1) target and Gaussian random walk proposal,ξ′ ∼ N (ξ, σ2), with acceptance probability
1 + ξ2
1 + (ξ′)2∧ 1 ,
Overall fit of the Cauchy density by the histogram satisfactory, butpoor exploration of the tails: 99% quantile of C (0, 1) equal to 3,but no simulation exceeds 14 out of 10, 000!
[Roberts & Tweedie, 2004]
Markov Chain Monte Carlo Methods
The Metropolis-Hastings Algorithm
The Random Walk Metropolis-Hastings Algorithm
Again, lack of geometric ergodicity![Mengersen & Tweedie, 1996]
Slow convergence shown by the non-stable range after 10, 000iterations.
Density
−5 0 5
0.000.05
0.100.15
0.200.25
0.300.35
Histogram of the 10, 000 first steps of a random walkMetropolis–Hastings algorithm using a N (ξ, 1) proposal
Markov Chain Monte Carlo Methods
The Metropolis-Hastings Algorithm
The Random Walk Metropolis-Hastings Algorithm
0 2000 4000 6000 8000 10000
−100
−500
50100
iterations
Range of 500 parallel runs for the same setup
Markov Chain Monte Carlo Methods
The Metropolis-Hastings Algorithm
The Random Walk Metropolis-Hastings Algorithm
Further convergence properties
Under assumptions skip detailed convergence
I (A1) f is super-exponential, i.e. it is positive with positivecontinuous first derivative such thatlim|x|→∞ n(x)′∇ log f(x) = −∞ where n(x) := x/|x|.In words : exponential decay of f in every direction with ratetending to ∞
I (A2) lim sup|x|→∞ n(x)′m(x) < 0, wherem(x) = ∇f(x)/|∇f(x)|.In words: non degeneracy of the countour manifoldCf(y) = {y : f(y) = f(x)}
Q is geometrically ergodic, andV (x) ∝ f(x)−1/2 verifies the drift condition
[Jarner & Hansen, 2000]
Markov Chain Monte Carlo Methods
The Metropolis-Hastings Algorithm
The Random Walk Metropolis-Hastings Algorithm
How to Chose the Scale of the Proposal in RW-MH?Try yourself at http://www.lbreyer.com/classic.html
From (Roberts & Rosenthal, 2001)
Markov Chain Monte Carlo Methods
The Metropolis-Hastings Algorithm
The Random Walk Metropolis-Hastings Algorithm
Performance Tradeoff
A high acceptance rate does not indicate that the algorithm ismoving correctly since it indicates that the random walk is movingtoo slowly on the surface of f .
If x(t) and yt are close, i.e. f(x(t)) ' f(yt) y is accepted withprobability
min(f(yt)f(x(t))
, 1)' 1 .
For multimodal densities with well separated modes, the negativeeffect of limited moves on the surface of f clearly shows.
Markov Chain Monte Carlo Methods
The Metropolis-Hastings Algorithm
The Random Walk Metropolis-Hastings Algorithm
Performance Tradeoff (2)
If the average acceptance rate is very low, the algorithm is oftenproposing values of Yt for which f(Yt) is small compared tof(X(t)): the random walk tries to move too quickly and theresulting MH sampler is often “stuck”.
Markov Chain Monte Carlo Methods
The Metropolis-Hastings Algorithm
The Random Walk Metropolis-Hastings Algorithm
Example (Noisy AR(1) continued)
For a Gaussian random walk with scale ω small enough, therandom walk never jumps to the other mode. But if the scale ω issufficiently large, the Markov chain explores both modes and give asatisfactory approximation of the target distribution.
Markov Chain Monte Carlo Methods
The Metropolis-Hastings Algorithm
The Random Walk Metropolis-Hastings Algorithm
Markov chain based on a random walk with scale ω = .1.
Markov Chain Monte Carlo Methods
The Metropolis-Hastings Algorithm
The Random Walk Metropolis-Hastings Algorithm
Markov chain based on a random walk with scale ω = .5.
Markov Chain Monte Carlo Methods
The Metropolis-Hastings Algorithm
The Random Walk Metropolis-Hastings Algorithm
How Does the Method Scale in Large Dimension?
The Scaling Construction (Roberts et al., 1997-2001)
Assume a d-dimensional target of product formf(d)(x1, . . . , xd) =
∏dk=1 f(xk) and an isotropic (d-dimensional)
proposal Gaussian distribution with variance σ2(d). Now let the
dimension d tends to infinity
I the acceptance rate does not degenerate only if σ2(d) = O(1/d)
And for a coordinate projection
I the optimal choice of σ2(d) is 2.382/d× varπ(X), which
correspond to an acceptance rate of 23%
I the asymptotic variance of the ergodic average isαd× varπ(X) (for a Gaussian target, α ≈ 3.7)
Markov Chain Monte Carlo Methods
The Metropolis-Hastings Algorithm
The Random Walk Metropolis-Hastings Algorithm
RW-MH Scaling Example
2 5 10 20
First 1 000 samples of the RW-MH algorithm started from mode for a multivariate Gaussian distribution withoptimal proposal scaling (arbitrary 2D marginal for increasing values of d)
Markov Chain Monte Carlo Methods
The Metropolis-Hastings Algorithm
The Random Walk Metropolis-Hastings Algorithm
The Geometry of the Target is Also Very Important
2D Gaussian RW-MH (Gaussian target) with accept. rate 50%(left: σ2
prop = 0.04 I2; right: σ2prop = 1.4 Σπ)
Number of Iterations 1, 2, 3, 4, 5, 10, 25, 50, 100, 500
−4 −3 −2 −1 0 1 2 3 4−4
−3
−2
−1
0
1
2
3
4
−4 −3 −2 −1 0 1 2 3 4−4
−3
−2
−1
0
1
2
3
4
Markov Chain Monte Carlo Methods
The Metropolis-Hastings Algorithm
The Random Walk Metropolis-Hastings Algorithm
Rule of thumb
Aim at an average acceptance rate of 25% when using therandom-walk MH algorithm.
This rule is to be taken with a pinch of salt!
Markov Chain Monte Carlo Methods
The Metropolis-Hastings Algorithm
Extensions
Extensions
There are many other families of MH algorithms
◦ Adaptive Rejection Metropolis Sampling
◦ Reversible Jump (later!)
◦ Langevin algorithms
to name just a few...
Markov Chain Monte Carlo Methods
The Metropolis-Hastings Algorithm
Extensions
Langevin Algorithms
Proposal based on the Langevin diffusion Lt is defined by thestochastic differential equation
dLt = dBt +12∇ log f(Lt)dt,
where Bt is the standard Brownian motion
Theorem
The Langevin diffusion is the only non-explosive diffusion which isreversible with respect to f .
Markov Chain Monte Carlo Methods
The Metropolis-Hastings Algorithm
Extensions
Discretization
Instead, consider the sequence
x(t+1) = x(t) +σ2
2∇ log f(x(t)) + σεt, εt ∼ Np(0, Ip)
where σ2 corresponds to the discretization step
Unfortunately, the discretized chain may be be transient, forinstance when
limx→±∞
∣∣σ2∇ log f(x)|x|−1∣∣ > 1
Markov Chain Monte Carlo Methods
The Metropolis-Hastings Algorithm
Extensions
MH correction
Accept the new value Yt with probability
f(Yt)f(X(t))
·exp
{−∥∥∥Yt − x(t) − σ2
2 ∇ log f(x(t))∥∥∥2/
2σ2
}exp
{−∥∥∥X(t) − Yt − σ2
2 ∇ log f(Yt)∥∥∥2/
2σ2
} ∧ 1 .
Choice of the scale factor σThe scaling approach suggests an acceptance rate of 58% toachieve optimal convergence rates
[Roberts & Rosenthal, 1998]
Markov Chain Monte Carlo Methods
The Gibbs Sampler
The Gibbs Sampler
The Gibbs SamplerThe Hammersley-Clifford theoremGeneral PrinciplesConvergenceData AugmentationThe Slice SamplerRao-BlackwellizationMarginalized Gibbs Sampling
Markov Chain Monte Carlo Methods
The Gibbs Sampler
The Hammersley-Clifford theorem
Hammersley-Clifford theorem
An illustration that conditionals determine the joint distribution
Theorem
If the joint density g(y1, y2) has conditional distributions g1(y1|y2)and g2(y2|y1), then
g(y1, y2) =g2(y2|y1)∫
g2(v|y1)/g1(y1|v) dv.
[Hammersley & Clifford, circa 1970]
Markov Chain Monte Carlo Methods
The Gibbs Sampler
The Hammersley-Clifford theorem
Positivity condition
g(i)(yi) > 0 for every i = 1, · · · , p, implies that g(y1, . . . , yp) > 0,where g(i) denotes the marginal distribution of Yi.
This implies that the support of g is the Cartesian product of thesupports of the marginal and that the conditionals do not restrictthe possible range of Yi.
Markov Chain Monte Carlo Methods
The Gibbs Sampler
The Hammersley-Clifford theorem
General HC decomposition
Under the positivity condition, the joint distribution g satisfies
g(y1, . . . , yp) ∝p∏j=1
g`j (y`j |y`1 , . . . , y`j−1, y′`j+1
, . . . , y′`p)
g`j (y′`j|y`1 , . . . , y`j−1
, y′`j+1, . . . , y′`p)
for every permutation ` of {1, 2, . . . , p} and every y′ ∈ Y .
Markov Chain Monte Carlo Methods
The Gibbs Sampler
General Principles
General Principles
A very specific simulation algorithm based on the structure of thetarget distribution f :
1. Uses the conditional densities f1, . . . , fp from f
2. Start with the random variable X = (x1, . . . , xp)3. Simulate from the conditional densities,
Xi|x1, x2, . . . , xi−1, xi+1, . . . , xp
∼ fi(xi|x1, x2, . . . , xi−1, xi+1, . . . , xp)
for i = 1, 2, . . . , p.
Markov Chain Monte Carlo Methods
The Gibbs Sampler
General Principles
Algorithm (Gibbs sampler)
Given X(t) = (x(t)1 , . . . , x
(t)p ), generate
1. X(t+1)1 ∼ f1(x1|x(t)
2 , . . . , x(t)p );
2. X(t+1)2 ∼ f2(x2|x(t+1)
1 , x(t)3 , . . . , x
(t)p ),
. . .
p. X(t+1)p ∼ fp(xp|x(t+1)
1 , . . . , x(t+1)p−1 )
Markov Chain Monte Carlo Methods
The Gibbs Sampler
General Principles
Properties
The full conditionals densities f1, . . . , fp are the only densities usedfor simulation. Thus, even in a high dimensional problem, all ofthe simulations may be univariate
The Gibbs sampler is not reversible with respect to f . However,each of its p coordinate-wise updates is.
Markov Chain Monte Carlo Methods
The Gibbs Sampler
General Principles
Random Scan Gibbs sampler
Modification of the above Gibbs sampler where, with probability1/p, the i-th component is drawn from fi(xi|X−i), ie when thecomponents are chosen at random
Motivation
The Random Scan Gibbs sampler is f -reversible.
Markov Chain Monte Carlo Methods
The Gibbs Sampler
General Principles
Example (Bivariate Gibbs sampler)
For a target of the form
(X,Y ) ∼ f(x, y)
Set X(0) = x(0)
For t = 1, 2, . . . , generate
Y (t) ∼ fY |X(·|X(t−1))
X(t) ∼ fX|Y (·|Y (t))
where fY |X and fX|Y are the conditional distributions
Markov Chain Monte Carlo Methods
The Gibbs Sampler
General Principles
A Very Simple Example: Independent N (µ, σ2)Observations
When Y1, . . . , Yniid∼ N (y|µ, σ2) with both µ and σ unknown, the
posterior in (µ, σ2) does not belong to a standard family (evenwhen choosing conjugate priors)
But...
µ|Y 0:n, σ2 ∼ N
(µ∣∣∣ 1n
∑ni=1 Yi,
σ2
n )
σ2|Y 1:n, µ ∼ IG(σ2∣∣n
2 − 1, 12
∑ni=1(Yi − µ)2
)assuming constant (improper) priors on both µ and σ2
I Hence we may use the Gibbs sampler for simulating from theposterior of (µ, σ2)
Markov Chain Monte Carlo Methods
The Gibbs Sampler
General Principles
R Gibbs Sampler for Gaussian posterior
n = length(Y);S = sum(Y);mu = S/n;for (i in 1:500)
S2 = sum((Y-mu)^2);sigma2 = 1/rgamma(1,n/2-1,S2/2);mu = S/n + sqrt(sigma2/n)*rnorm(1);
Markov Chain Monte Carlo Methods
The Gibbs Sampler
General Principles
Example of Results with, Left n = 10 Observations; Right,n = 100 Observations from the N (0, 1) Distribution
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
µ
σ2
Number of Iterations 1, 2, 3, 4, 5, 10, 25, 50, 100, 500
Markov Chain Monte Carlo Methods
The Gibbs Sampler
General Principles
Limitations of the Gibbs sampler
The Gibbs sampler
1. limits the choice of instrumental distributions
2. requires some knowledge of f
3. does not apply to problems where the number of parametersvaries.
Markov Chain Monte Carlo Methods
The Gibbs Sampler
Convergence
Properties of the Gibbs sampler
Theorem (Convergence)
Under[Positivity condition]
I f (i)(yi) > 0 for every i = 1, · · · , p, implies thatf(y1, . . . , yp) > 0, where f (i) denotes the marginaldistribution of Yi under f ,
the chain is irreducible and positive Harris recurrent.
Markov Chain Monte Carlo Methods
The Gibbs Sampler
Convergence
Properties of the Gibbs sampler (2)
Consequences
(i) If∫h(y)f(y)dy <∞, then
limnT→∞
1T
T∑t=1
h(Y (t)) =∫h(y)f(y)dy a.s.
(ii) If, in addition, (Y (t)) is aperiodic, then
limn→∞
∥∥∥∥∫ Kn(y(0), ·)− f∥∥∥∥TV
= 0
for every initial starting point y(0).
Markov Chain Monte Carlo Methods
The Gibbs Sampler
Data Augmentation
Latent variables are back
The Gibbs sampler is particularly adapted for cases where thetarget probability is defined on an augmented space.
A density g is a completion of f if∫Zg(x, z) dz = f(x)
Note
The variable z may be meaningless for the problem
Markov Chain Monte Carlo Methods
The Gibbs Sampler
Data Augmentation
Purpose
g should have full conditionals that are easy to simulate for aGibbs sampler to be implemented with g rather than f
For p > 1, write y = (x, z) and denote the conditional densities ofg(y) = g(y1, . . . , yp) by
Y1|y2, . . . , yp ∼ g1(y1|y2, . . . , yp),Y2|y1, y3, . . . , yp ∼ g2(y2|y1, y3, . . . , yp),
. . . ,
Yp|y1, . . . , yp−1 ∼ gp(yp|y1, . . . , yp−1).
Markov Chain Monte Carlo Methods
The Gibbs Sampler
Data Augmentation
The move from Y (t) to Y (t+1) is defined as follows:
Algorithm (Completion Gibbs sampler)
Given Y (t) = (y(t)1 , . . . , y
(t)p ), simulate
1. Y(t+1)
1 ∼ g1(y1|y(t)2 , . . . , y
(t)p ),
2. Y(t+1)
2 ∼ g2(y2|y(t+1)1 , y
(t)3 , . . . , y
(t)p ),
. . .
p. Y(t+1)p ∼ gp(yp|y(t+1)
1 , . . . , y(t+1)p−1 ).
Markov Chain Monte Carlo Methods
The Gibbs Sampler
Data Augmentation
Example (Mixtures all over again)
Hierarchical missing data structure:If
X1, . . . , Xn ∼k∑i=1
pif(x|θi),
then
X|Z ∼ f(x|θZ), Z ∼ p1I(z = 1) + . . .+ pkI(z = k),
Z is the component indicator associated with observation x
Markov Chain Monte Carlo Methods
The Gibbs Sampler
Data Augmentation
Example (Mixtures (2))
Conditionally on (Z1, . . . , Zn) = (z1, . . . , zn) :
π(p1, . . . , pk, θ1, . . . , θk|x1, . . . , xn, z1, . . . , zn)∝ pα1+n1−1
1 . . . pαk+nk−1k
×π(θ1|y1 + n1x1, λ1 + n1) . . . π(θk|yk + nkxk, λk + nk),
withni =
∑j
I(zj = i) and xi =∑j; zj=i
xj/ni.
Markov Chain Monte Carlo Methods
The Gibbs Sampler
Data Augmentation
Algorithm (Mixture Gibbs sampler)
1. Simulate
θi ∼ π(θi|yi + nixi, λi + ni) (i = 1, . . . , k)(p1, . . . , pk) ∼ D(α1 + n1, . . . , αk + nk)
2. Simulate (j = 1, . . . , n)
Zj |xj , p1, . . . , pk, θ1, . . . , θk ∼k∑i=1
pijI(zj = i)
with (i = 1, . . . , k)pij ∝ pif(xj |θi)
and update ni and xi (i = 1, . . . , k).
Markov Chain Monte Carlo Methods
The Gibbs Sampler
The Slice Sampler
Recall the Fundamental Observation that Underpins theAccept-Reject Method
Lemma
Simulating
X ∼ f(x)
is equivalent to simulating
(X,U) ∼ U{(x, u) : 0 < u < f(x)}0 2 4 6 8 10
0.00
0.05
0.10
0.15
0.20
0.25
xf(
x)
Markov Chain Monte Carlo Methods
The Gibbs Sampler
The Slice Sampler
Slice sampler as generic Gibbs
Algorithm (Slice sampler)
Simulate
1. U (t+1) ∼ U[0,f(X(t))];
2. X(t+1) ∼ UA(t+1) , with
A(t+1) = {x : f(x) ≥ U (t+1)} .
Markov Chain Monte Carlo Methods
The Gibbs Sampler
The Slice Sampler
Example of results with a truncated N (−3, 1) distribution
0.0 0.2 0.4 0.6 0.8 1.0
0.00
00.
002
0.00
40.
006
0.00
80.
010
x
y
Number of Iterations 2, 3, 4, 5, 10, 50, 100
Markov Chain Monte Carlo Methods
The Gibbs Sampler
The Slice Sampler
Example (Stochastic volatility conditional distribution)
In the stochastic volatility model, the single-site conditionaldistribution is of the form
π(x) ∝ exp−{σ2(x− µ)2 + β2 exp(−x)y2 + x
}/2 ,
simplified in exp−{x2 + α exp(−x)
}Slice sampling means simulation from a uniform distribution on
A ={x : exp−
{x2 + α exp(−x)
}/2 ≥ u
}=
{x : x2 + α exp(−x) ≤ ω
}if we set ω = −2 log u.
Note Inversion of x2 + α exp(−x) = ω needs to be donenumerically (eg., using Newton’s algorithm).
Markov Chain Monte Carlo Methods
The Gibbs Sampler
The Slice Sampler
0 10 20 30 40 50 60 70 80 90 100−0.1
−0.05
0
0.05
0.1
Lag
Cor
rela
tion
−1 −0.5 0 0.5 1 1.5 2 2.5 3 3.50
0.2
0.4
0.6
0.8
1D
ensi
ty
Histogram of a Markov chain produced by a slice samplerand target distribution in overlay.
Markov Chain Monte Carlo Methods
The Gibbs Sampler
The Slice Sampler
Convergence of the Slice Sampler
Theorem (Uniform ergodicity)
If f is bounded and suppf is bounded, the simple slice sampler isuniformly ergodic.
[Mira & Tierney, 1997]
Markov Chain Monte Carlo Methods
The Gibbs Sampler
The Slice Sampler
Good slices
The slice sampler usually enjoys very good theoretical properties(geometric ergodicity and even uniform ergodicity) in onedimensional cases.
As the dimension increases, the practical determination of the setA(t+1) may get increasingly complex (and the performanceguarantees of the method are usually not as strong).
Markov Chain Monte Carlo Methods
The Gibbs Sampler
Rao-Blackwellization
Rao-Blackwellization
If (Y1, Y2, . . . , Yp)(t), t = 1, 2, . . . T is the output from a Gibbssampler
δ0 =1T
T∑t=1
h(Y
(t)1
)→∫h(y1)f (1)(y1)dy1 .
The Rao-Blackwellization replaces δ0 with its conditionalexpectation
δrb =1T
T∑t=1
Ef[h(Y1)
∣∣∣Y (t)2 , . . . , Y (t)
p
].
Markov Chain Monte Carlo Methods
The Gibbs Sampler
Rao-Blackwellization
Rao-Blackwellization (2)
Then
◦ Both estimators converge to E[h(Y1)]◦ The Rao-Blackwellized estimator is unbiased,
◦ and
var(
E[h(Y1)|Y (t)
2 , . . . , Y (t)p
])≤ var(h(Y1)) .
Markov Chain Monte Carlo Methods
The Gibbs Sampler
Rao-Blackwellization
Examples of Rao-Blackwellization
Example
Bivariate normal Gibbs sampler
X | y ∼ N (ρy, 1− ρ2)Y | x ∼ N (ρx, 1− ρ2).
Then
δ0 =1T
T∑i=1
X(i) and δ1 =1T
T∑i=1
E[X(i)|Y (i)] =1T
T∑i=1
ρY (i),
estimate E[X] and σ2δ0/σ2
δ1= 1
ρ2> 1.
Markov Chain Monte Carlo Methods
The Gibbs Sampler
Rao-Blackwellization
Examples of Rao-Blackwellization (2)
Example (Grouped counting data)
360 consecutive records of the number of passages per unit time
Number ofpassages 0 1 2 3 4 or more
Number ofobservations 139 128 55 25 13
Markov Chain Monte Carlo Methods
The Gibbs Sampler
Rao-Blackwellization
Example (Grouped counting data (2))
Particularity Observations with 4 passages and more are grouped
If observations are Poisson P(λ), the likelihood is
`(x1, . . . , x5|λ)
∝ e−347λλ128+55×2+25×3
(1− e−λ
3∑i=0
λi
i!
)13
,
which can be difficult to work with.
Idea With a prior π(λ) = 1/λ, complete the vector (y1, . . . , y13) ofthe 13 units larger than 4
Markov Chain Monte Carlo Methods
The Gibbs Sampler
Rao-Blackwellization
Algorithm (Poisson-Gamma Gibbs)
a Simulate Y(t)i ∼P(λ(t−1)) Iy≥4 i = 1, . . . , 13
b Simulate
λ(t) ∼ Ga
(313 +
13∑i=1
y(t)i , 360
).
The Bayes estimator
δπ =1
360T
T∑t=1
(313 +
13∑i=1
y(t)i
)
converges quite rapidly
0 100 200 300 400 500
1.02
11.
022
1.02
31.
024
1.02
5
0.9 1.0 1.1 1.2
010
2030
40
lambda
Markov Chain Monte Carlo Methods
The Gibbs Sampler
Rao-Blackwellization
Rao-Blackwellization in Poisson-Gamma Gibbs
Naıve estimate
δ0 =1T
T∑t=1
λ(t)
and Rao-Blackwellized version
δπ =1T
T∑t=1
E[λ(t)|x1, x2, . . . , x5, y(i)1 , y
(i)2 , . . . , y
(i)13 ]
=1
360T
T∑t=1
(313 +
13∑i=1
y(t)i
).
Markov Chain Monte Carlo Methods
The Gibbs Sampler
Rao-Blackwellization
Rao-Blackwellization for “Non-Parametric” Estimation
Another substantial benefit of Rao-Blackwellization is in theapproximation of densities of different components of y withoutnonparametric density estimation methods.
Lemma
The estimator
1T
T∑t=1
fi(yi|Y (t)j , j 6= i) −→ f (i)(yi) .
Markov Chain Monte Carlo Methods
The Gibbs Sampler
Marginalized Gibbs Sampling
Another option is to take profit of Rao-Blackwellizationduring the simulations
Resulting in marginalized (or collapsed) Gibbs samplers
In a Bayesian latent variable model, a typical scheme for the Gibbssampler is to alternate{
x|y, θθ|x, y
wherey observationsx latent variablesθ parameters
For an exponential family complete-data model with conjugateprior, the pdf f(θ|x, y) is available in closed-form, thus allowing forRao-Blackwellization.
Markov Chain Monte Carlo Methods
The Gibbs Sampler
Marginalized Gibbs Sampling
Exponential Family Distributions
`(y|θ) = h(y) exp [〈s(y), ψ(θ)〉 −B(θ)]
where s(y) are the sufficient statistics
If ψ is an invertible mapping it is possible through thereparameterization η = ψ(θ) to rewrite the above in canonical (ornatural) form
`(y|η) = h(y) exp [〈s(y), η〉 −A(η)]
and A is called the log-partition function
Markov Chain Monte Carlo Methods
The Gibbs Sampler
Marginalized Gibbs Sampling
Conjugacy in Exponential Families
The conjugate distribution for
`(y|θ) = h(y) exp [〈s(y), ψ(θ)〉 −B(θ)]
isπ(θ|µ0, λ0) = Z−1(µ0, λ0) exp [〈µ0, ψ(θ)〉 − λ0B(θ)]
where µ0 has the same dimension as s(y) and λ0 ∈ R+
Markov Chain Monte Carlo Methods
The Gibbs Sampler
Marginalized Gibbs Sampling
The Bayesian Posterior
Assuming a complete-data exponential family model for (yi, xi)and conjugate prior.
After seeing n independent observations y = (y1, . . . , yn), the jointposterior is
f(x, θ|y) ∝ h(x, y) exp [〈s(x, y), ψ(θ)〉 − nB(θ)]
Z−1(µ0, λ0) exp [〈µ0, ψ(θ)〉 − λ0B(θ)]
Markov Chain Monte Carlo Methods
The Gibbs Sampler
Marginalized Gibbs Sampling
Hence, the full conditional is given by
f(θ|x, y) = Z−1(s(x, y) + µ0, n+ λ0)exp [〈s(x, y) + µ0, ψ(θ)〉 − (n+ λ0)B(θ)]
Thus a Rao-Blackwellized estimator of the posterior mean of θ,can be computed as
1m
m∑j=1
E(θ|X(j), Y )
Markov Chain Monte Carlo Methods
The Gibbs Sampler
Marginalized Gibbs Sampling
But as the normalizing constant Z is known explicitly, it is alsopossible to integrate out θ to obtain a closed-form expression ofthe marginal posterior f(x|y):
π(x|y) ∝ h(x, y)Z(s(x, y) + µ0, n+ λ0)
Z(µ0, λ0)
The resulting marginal posterior is usually complex but amenableto single site Gibbs sampling on the components ofx = (x1, . . . , xn), especially when these are discrete variables
This is the preferred sampling method for the Latent DirichletAllocation (LDA) model and related models.
Markov Chain Monte Carlo Methods
The Gibbs Sampler
Marginalized Gibbs Sampling
Mixture of Gaussian ExampleAssume that we observe Y1, . . . , Yn in the Gaussian mixture model∑r
i=1 αiN (y|θi, vi). To make derivations simpler, α and v aretreated as fixed parameters and we use an improper flat prior onthe vector of means θ
π(xk, θ|yk) ∝ exp
[r∑i=1
(logαi −
(yk − θi)2
2vi
)I{xk = i}
]and hence
π(x1:n, θ|y1:n) ∝ exp
[r∑i=1
logαini −θ2i ni2vi
+θisivi
]where {
ni =∑n
k=1 I{xk = i}si =
∑nk=1 ykI{xk = i}
Markov Chain Monte Carlo Methods
The Gibbs Sampler
Marginalized Gibbs Sampling
Upon completing the square,
f(x1:n, θ|y1:n) ∝ exp
[r∑i=1
logαini +s2i
2nivi
]exp
[r∑i=1
− 12vi/ni
(θi −
sini
)2]
and
f(x1:n|y1:n) ∝r∏i=1
αnii
√vini
exp[
s2i
2nivi
]Finally,
Pf (Xk = i|x−k, y1:n) ∝ αni,−k+1i
√vi
ni,−k + 1exp
[(si,−k + yk)2
2(ni,−k + 1)vi
]where {
ni,−k =∑j 6=k I{xj = i} = ni − I{xk = i}
si,−k =∑j 6=k yjI{xj = i} = si − ykI{xk = i}
Markov Chain Monte Carlo Methods
The Gibbs Sampler
Marginalized Gibbs Sampling
To run the collapsed (or marginalized) single site Gibbssampler in this example
I Repeatedly simulate from the conditionals f(xk|x−k, y1:n)I keeping track of the accumulated component statistics
(ni, si)1≤i≤r
The idea can be extended to mixture models with an unknownnumber of components
Markov Chain Monte Carlo Methods
Case Study
In this last course, we consider a very simple example and discussthe application of the methods discussed so far
Case StudyIndependent Sampler(Almost) Random-Walk SamplerGibbs Sampling(Baby) Reversible Jump
Markov Chain Monte Carlo Methods
Case Study
Contaminated Counts
Assume we have observed a vector y1, . . . , yn of count data whichwe want to model as a mixture of Poisson distributions
L(yi|α, λ) = (1− α)λyi0
yi!e−λ0 + α
λyi
yi!e−λ
where λ0 is known
In the following, we assume wlog. that λ0 = 1 and the the prior on(α, λ) is chosen as
π(α, λ) = e−λ
Markov Chain Monte Carlo Methods
Case Study
Independent Sampler
Independent Sampler
I (α∗, λ∗) ∼ U [0, 1]⊗ E xp ;
I `∗ =∑n
i=1 log{
(1− α∗) + α∗λyi∗ e−(λ∗−1)
};
I u ∼ U [0, 1] ;
I If u < exp(`∗ − `) ,
α← α∗ ;λ← λ∗ ;`← `∗ ;
Markov Chain Monte Carlo Methods
Case Study
Independent Sampler
A Small-Scale Exampley = (1, 1, 0, 3, 0, 5)
0 1 2 3 4 50
0.2
0.4
0.6
0.8
1
0 2 4 6 8 100
500
1000
1500
2000
λ
coun
ts
mean 1.8, std. 1.1
0
0.2
0.4
0.6
0.8
1
0200400600800
α
counts
mean 0.45, std. 0.27
0 5 10 15 20−0.2
0
0.2
0.4
0.6
0.8
1
1.2
lags
auto
corr
.
accept. rate (%) 46 (α), 46 (λ)
αλ
Markov Chain Monte Carlo Methods
Case Study
Independent Sampler
A Larger Exampley consisting of 100 values (1/6 0’s, 1/3 1’s, 1/3 2’s, 1/6 3’s)
1 1.5 2 2.50.2
0.4
0.6
0.8
1
1 1.5 2 2.50
500
1000
1500
2000
λ
coun
ts
mean 1.6, std. 0.18
0
0.2
0.4
0.6
0.8
1
05001000150020002500
α
counts
mean 0.8, std. 0.16
0 5 10 15 200
0.2
0.4
0.6
0.8
1
lags
auto
corr
.
accept. rate (%) 4.8 (α), 4.8 (λ)
αλ
Markov Chain Monte Carlo Methods
Case Study
(Almost) Random-Walk Sampler
Random-Walk Sampler
I v ∼ N (0, I2) ;
I (α∗, λ∗)← (α, λ) + sv ;I If (α∗, λ∗) ∈ [0, 1]× R+ ,
I `∗ = −λ∗ +∑ni=1 log
{(1− α∗) + α∗λ
yi∗ e−(λ∗−1)
};
I u ∼ U [0, 1] ;I If u < exp(`∗ − `) Φ((1−α)/s)−Φ(−α/s)
Φ((1−α∗)/s)−Φ(−α∗/s)1−Φ(−λ/s)1−Φ(−λ∗/s) ,
α← α∗ ;λ← λ∗ ;`← `∗ ;
Φ denotes the Gaussian cdf.
Markov Chain Monte Carlo Methods
Case Study
(Almost) Random-Walk Sampler
6 Observation Case
0 2 4 60
0.2
0.4
0.6
0.8
1
0 2 4 6 80
500
1000
1500
2000
λ
coun
ts
mean 1.7, std. 1.1
0
0.2
0.4
0.6
0.8
1
0200400600800
α
counts
mean 0.45, std. 0.27
0 5 10 15 200
0.2
0.4
0.6
0.8
1
lags
auto
corr
.
accept. rate (%) 22 (α), 22 (λ)
αλ
Figure: With s tuned to 1
Markov Chain Monte Carlo Methods
Case Study
(Almost) Random-Walk Sampler
100 Observation Case
1 1.5 2 2.5 30
0.2
0.4
0.6
0.8
1
1 1.5 2 2.5 30
500
1000
1500
2000
2500
λ
coun
ts
mean 1.6, std. 0.16
0
0.2
0.4
0.6
0.8
1
05001000150020002500
α
counts
mean 0.84, std. 0.14
0 5 10 15 200
0.2
0.4
0.6
0.8
1
lags
auto
corr
.
accept. rate (%) 21 (α), 21 (λ)
αλ
Figure: With s tuned to 0.3
Markov Chain Monte Carlo Methods
Case Study
Gibbs Sampling
Gibbs Sampling With Data Augmentation
I For i = 1, . . . , n ,
ui ∼ U [0, 1] ;
xi ← I
{ui <
αλyie−(λ−1)
(1− α) + αλyie−(λ−1)
};
I c =∑n
i=1 I(xi = 1) ;
I s =∑n
i=1 yiI(xi = 1) ;
I α ∼ Beta(c+ 1, n− c+ 1) ;
I λ ∼ G amma(s+ 1, c+ 1) ;
Markov Chain Monte Carlo Methods
Case Study
Gibbs Sampling
6 Observation Case
0 1 2 3 4
0.4
0.5
0.6
0.7
0.8
0.9
1
0 2 4 60
500
1000
1500
2000
λ
coun
ts
mean 1.6, std. 0.49
0
0.2
0.4
0.6
0.8
1
0100020003000
α
counts
mean 0.87, std. 0.11
0 5 10 15 20−0.2
0
0.2
0.4
0.6
0.8
1
1.2
lags
auto
corr
.
accept. rate (%) 100 (α), 100 (λ)
αλ
Rao-Blacwellization could be used for estimating the marginal distributions
Markov Chain Monte Carlo Methods
Case Study
Gibbs Sampling
100 Observation Case
1.4 1.6 1.8 2 2.20.94
0.95
0.96
0.97
0.98
0.99
1
1 1.5 2 2.50
500
1000
1500
2000
λ
coun
ts
mean 1.5, std. 0.12
0.9
0.92
0.94
0.96
0.98
1
01000200030004000
α
counts
mean 0.99, std. 0.0097
0 5 10 15 20−0.2
0
0.2
0.4
0.6
0.8
1
1.2
lags
auto
corr
.
accept. rate (%) 100 (α), 100 (λ)
αλ
Markov Chain Monte Carlo Methods
Case Study
Gibbs Sampling
Collapsed Gibbs Sampler
I For i = 1, . . . , n ,
c−i = c− I(xi = 1) ;s−i = s− yiI(xi = 1) ;ui ∼ U [0, 1] ;
xi ← I{ui <
A1
A1 +A0
},
where A1=(c−i+1)e(s−i+yi)!s−i!
“c−i+1
c−i+2
”(s−i+1)“1
c−i+2
”yi,
and A0=(n−c−i) ;
c← c−i + I(xi = 1) ;s← s−i + yiI(xi = 1) ;
Markov Chain Monte Carlo Methods
Case Study
(Baby) Reversible Jump
Gibbs Output on a 100 Observation Poisson DatasetSimulated from λ0 = 1
0.8 1 1.2 1.40.95
0.96
0.97
0.98
0.99
1
0.8 1 1.2 1.40
500
1000
1500
λ
coun
ts
mean 0.97, std. 0.098
0.88
0.9
0.92
0.94
0.96
0.98
1
010002000300040005000
α
counts
mean 0.99, std. 0.0097
0 5 10 15 20−0.2
0
0.2
0.4
0.6
0.8
1
1.2
lags
auto
corr
.
accept. rate (%) 100 (α), 100 (λ)
αλ
Markov Chain Monte Carlo Methods
Case Study
(Baby) Reversible Jump
We reformulate the model to allow for model comparison by
I Giving a prior π1 to the model M1 which postulates that
L(yi|α, λ) = (1− α)λyi0
yi!e−λ0 + α
λyi
yi!e−λ
I Giving a prior pi0 = (1− π1) to the baseline model M0
L0(yi) =λyi0
yi!e−λ0
Under M1, we keep the same prior for α and λ.
We further introduce a latent variable k ∈ {0, 1} which indicatesthe (unobservable) model in (M0,M1) that is assumed togenerate the data.
Markov Chain Monte Carlo Methods
Case Study
(Baby) Reversible Jump
The Posterior
The posterior f is now defined on {0} ∪ {1}× [0, 1]×R+ and suchthat
f(k, α, β|y) ∝ π0δ0(k)L0(y) + π1δ1(k)L(y|α, λ)1λ0e−λ/λ0
The Bayes factor for model M1 compared to model M0 is given by∫α∈[0,1]
∫λ∈R+
L(y|α, λ) 1λ0e−λ/λ0dαdλ
L0(y)
Instead of trying to compute the Bayes factor directly theReversible Jump Metropolis-Hastings algorithm is a sampler thattargets f on {0} ∪ {1} × [0, 1]× R+
Markov Chain Monte Carlo Methods
Case Study
(Baby) Reversible Jump
I If k = 0 ,
# Birth moveI (α∗, λ∗) ∼ U [0, 1]⊗ E xp(λ0) ;I u ∼ U [0, 1] ;I If u < π1L(y|α∗,λ∗)
π0L0(y) p ,
k ← 1 ; α← α∗ ; λ← λ∗ ;
I Else If k = 1 ,I u ∼ U [0, 1] ;I If u < p ,
# Death moveI v ∼ U [0, 1] ;I If v < π0L0(y)
π1L(y|α∗,λ∗)p,
k ← 0 ;
I Else ,
# Fixed dimension move (. . . )
Markov Chain Monte Carlo Methods
Case Study
(Baby) Reversible Jump
In the following, we assume that λ0 = 1 and π1 = 1/2 andconsider the same datasets as before.
For the 100 Observation Poisson case
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
0
1
k
The estimated posterior probability for M0 is 0.79
Markov Chain Monte Carlo Methods
Case Study
(Baby) Reversible Jump
100 Observation Poisson CaseResults conditional to k = 1
0 0.5 1 1.5 20
0.2
0.4
0.6
0.8
1
0 1 2 3 4 50
200
400
600
800
λ
coun
ts
mean 1.1, std. 0.43
0
0.2
0.4
0.6
0.8
1
0200400600
α
counts
mean 0.49, std. 0.37
0 5 10 15 20−0.2
0
0.2
0.4
0.6
0.8
1
1.2
lags
auto
corr
.
αλ
Markov Chain Monte Carlo Methods
Case Study
(Baby) Reversible Jump
For the other 100 observation dataset the estimated posteriorprobability in favor of M1 is 1
Markov Chain Monte Carlo Methods
Case Study
(Baby) Reversible Jump
In the case of y = (1, 1, 0, 3, 0, 5), the posterior probability in favorof M1 is estimated at 0.56
0 1 2 3 4 50
0.2
0.4
0.6
0.8
1
0 2 4 6 8 100
200
400
600
800
1000
1200
1400
λ
coun
ts
mean 1.7, std. 0.97
0
0.2
0.4
0.6
0.8
1
0200400600
α
counts
mean 0.59, std. 0.29
0 5 10 15 20−0.2
0
0.2
0.4
0.6
0.8
1
1.2
lags
auto
corr
.
αλ
Figure: Results conditional on k = 1