Markov Chain Monte Carlo Methodscappe/fr/Enseignement/mido-tsi.pdfMarkov Chain Monte Carlo Methods Motivation and leading example Bayesian troubles Conjugate Prior Conjugacy Given

Markov Chain Monte Carlo Methods


Olivier Cappe

LTCI, Telecom ParisTech & CNRShttp://perso.telecom-paristech.fr/~cappe

November 30, 2010

http://perso.telecom-paristech.fr/~cappe


Textbook: Monte Carlo Statistical Methodsby Christian. P. Robert and George Casella

Slides: Adapted from (mostly)http://www.ceremade.dauphine.fr/~xian/coursBC.pdf

available athttp://perso.telecom-paristech.fr/~cappe/cours/mido-tsi.pdf

Other suggestedreading...

http://www.ceremade.dauphine.fr/~xian/coursBC.pdf

http://perso.telecom-paristech.fr/~cappe/cours/mido-tsi.pdf


Outline

Motivations, Random Variable Generation Chapters 1 & 2

Monte Carlo Integration Chapter 3

Notions on Markov Chains Chapter 6

The Metropolis-Hastings Algorithm Chapter 7

The Gibbs Sampler Chapters 8–10

Case Study (with a glimpse of Chapter 11∗)

Corresponding homeworks

Date: Oct. 27 Nov. 3 Nov. 10 Nov. 24 Dec. 1Assignment: I (§1 & 2) II (§3) III (§6) IV (§7) V (§8–10)


Motivation and leading example


Motivation and leading exampleIntroductionLikelihood methodsMissing variable modelsBayesian MethodsBayesian troubles

Random variable generation

Monte Carlo Integration

Notions on Markov Chains for MCMC

The Metropolis-Hastings Algorithm

The Gibbs Sampler

Case Study



Introduction

Latent structures make life harder!

Even simple models may lead to computational complications, asin latent variable models

f(x|θ) =∫f?(x, x?|θ) dx?

If (x, x?) observed, fine!If only x observed, trouble!



Introduction

Example (Mixture models)

Models of mixtures of distributions:

X ∼ fj with probability pj ,

for j = 1, 2, . . . , k, with overall density

X ∼ p1f1(x) + · · ·+ pkfk(x) .

For a sample of independent random variables (X1, · · · , Xn),sample density

n∏i=1

{p1f1(xi) + · · ·+ pkfk(xi)} .

Expanding this product involves kn elementary terms: prohibitiveto compute in large samples.



Introduction

−1 0 1 2 3

−1

01

23

µ1

µ 2

Case of the 0.3N (µ1, 1) + 0.7N (µ2, 1) likelihood



Likelihood methods

Maximum likelihood methods

Go Bayes!!

◦ For an iid sample X1, . . . , Xn from a population with densityf(x|θ1, . . . , θk), the likelihood function is

L(x|θ) = L(x1, . . . , xn|θ1, . . . , θk)

=∏n

i=1f(xi|θ1, . . . , θk).

θn = arg maxθ

L(x|θ)

◦ Global justifications from asymptotics

◦ Computational difficulty depends on structure, eg latentvariables



Likelihood methods

Example (Mixtures again)

For a mixture of two normal distributions,

pN (µ, τ2) + (1− p)N (θ, σ2) ,

likelihood proportional to

n∏i=1

[pτ−1ϕ

(xi − µτ

)+ (1− p) σ−1 ϕ

(xi − θσ

)]containing 2n terms.



Likelihood methods

Standard maximization techniques often fail to find the globalmaximum because of multimodality or undesirable behavior(usually at the frontier of the domain) of the likelihood function.

Example

In the special case

f(x|µ, σ) = (1− ε) exp{(−1/2)x2}+ε

σexp{(−1/2σ2)(x− µ)2}

(1)with ε > 0 known, whatever n, the likelihood is unbounded:

limσ→0

L(x1, . . . , xn|µ = x1, σ) =∞



Missing variable models

The special case of missing variable models

Consider again a latent variable representation

g(x|θ) =∫Zf(x, z|θ) dz

Define the completed (but unobserved) likelihood

Lc(x, z|θ) = f(x, z|θ)

Useful for optimisation algorithm




The EM Algorithm

Gibbs connection Bayes rather than EM

Algorithm (Expectation–Maximisation)

Iterate (in m)

1. (E step) Compute

Q(θ; θ(m),x) = E[logLc(x,Z|θ)|θ(m),x] ,

2. (M step) Maximise Q(θ; θ(m),x) in θ and take

θ(m+1) = arg maxθ

Q(θ; θ(m),x).

until a fixed point [of Q] is reached




−2 −1 0 1 2

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Sample from the mixture model




−1 0 1 2 3

−1

01

23

µ1

µ 2

Likelihood of .7N (µ1, 1) + .3N (µ2, 1) and EM steps



Bayesian Methods

The Bayesian Perspective

In the Bayesian paradigm, the information brought by the data x,realization of

X ∼ f(x|θ),

is combined with prior information specified by prior distributionwith density

π(θ)



Bayesian Methods

Central tool

Summary in a probability distribution, π(θ|x), called the posteriordistributionDerived from the joint distribution f(x|θ)π(θ), according to

π(θ|x) =f(x|θ)π(θ)∫f(x|θ)π(θ)dθ

,

[Bayes Theorem]where

Z(x) =∫f(x|θ)π(θ)dθ

is the marginal density of X also called the (Bayesian) evidence



Bayesian Methods

Central tool... central to Bayesian inference

Posterior defined up to a constant as

π(θ|x) ∝ f(x|θ)π(θ)

I Operates conditional upon the observations

I Integrate simultaneously prior information and informationbrought by x

I Avoids averaging over the unobserved values of x

I Coherent updating of the information available on θ,independent of the order in which i.i.d. observations arecollected

I Provides a complete inferential scope and a unique motor ofinference



Bayesian troubles

Conjugate bonanza...

Example (Binomial)

For an observation X ∼ B(n, p) so-called conjugate prior is thefamily of beta Be(a, b) distributionsThe classical Bayes estimator δπ is the posterior mean

Γ(a+ b+ n)Γ(a+ x)Γ(n− x+ b)

∫ 1

0p px+a−1(1− p)n−x+b−1dp

=x+ a

a+ b+ n.



Bayesian troubles

Conjugate Prior

Conjugacy

Given a likelihood function L(y|θ), the family Π of priors π0 on Θis conjugate if the posterior π(θ|y) also belong to Π

In this case, posterior inference is tractable and reduces toupdating the hyperparameters∗ of the prior

∗The hyperparameters are parameters of the priors; they are most often nottreated as a random variables



Bayesian troubles

Discrete/Multinomial & Dirichlet

If the observations consist of positive counts Y1, . . . , Yd modelledby a Multinomial distribution

L(y|θ, n) =n!∏di=1 yi!

d∏i=1

θyii

The conjugate family is the D(α1, . . . , αd) distribution

π(θ|α) =Γ(∑d

i=1 αi)∏di=1 Γ(αi)

d∏i

θαi−1i

defined on the probability simplex (θi ≥ 0,∑d

i=1 θi = 1), where Γis the gamma function Γ(α) =

∫∞0 tα−1e−tdt (Γ(k) = (k − 1)! for

integers k)



Bayesian troubles

Figure: Dirichlet: 1D marginals



Bayesian troubles

Figure: Dirichlet: 3D examples (projected on two dimensions)



Bayesian troubles

Multinomial Posterior

Posteriorπ(θ|y) = D(y1 + α1, . . . , yd + αd)

Posterior Mean† (yi + αi∑dj=1 yj + αj

)1≤i≤d

MAP (yi + αi − 1∑dj=1 yj + αj − 1

)1≤i≤d

if yi + αi > 1 for i = 1, . . . , dEvidence

Z(y) =Γ(∑d

i=1 αi)∏di=1 Γ(yi + αi)∏d

i=1 Γ(αi)Γ(∑d

i=1 yi + αi)†Also known as Laplace smoothing when αi = 1



Bayesian troubles

Conjugate Priors for the Normal I

Conjugate Prior for the Normal Mean

For the N (y|µ,w) distribution with iid observations y1, . . . ,yn,the conjugate prior for the mean µ is Gaussian N (µ|m0, v0):

π(µ|y1:n) ∝ exp[−(µ−m0)2/2v0

] n∏k=1

exp[−(yk − µ)2/2w

]∝ exp

{−1

2

[µ2

(1v0

+n

w

)− 2µ

(m0

v0+snw

)]}= N

(µ

∣∣∣∣sn +m0w/v0

n+ w/v0,

w

n+ w/v0

)where sn =

∑nk=1 yk

a

aAnd y1:n denotes the collection y1, . . . , yn



Bayesian troubles

Conjugate Priors for the Normal II

Conjugate Priors for the Normal Variance

If w is to be estimated and µ is known, the conjugate prior for w isthe inverse Gamma distribution I G (w|α0, β0):

π0(w|β0, α0) =βα0

0

Γ(α0)w−α0+1e−β0/w

and

π(w|y1:n) ∝ w−(α0+1)e−β0/wn∏k=1

1√w

exp[−(yk − µ)2/2w

]= w−(n/2+α0+1) exp

[−(s(2)

n /2 + β0)/w]

where s(2)n =

∑nk=1(Yk − µ)2.



Bayesian troubles

The Gamma, Chi-Square and Inverses

The Gamma Distributiona

aA different convention is to use Gam*(a,b), where b = 1/β is the scaleparameter

G a(θ|α, β) =βα

Γ(α)θα−1e−βθ

where α is the shape and β the inverse scale parameter(E(θ) = α/β, Var(θ) = α/β2)

I θ ∼ I G (θ|α, β): 1/θ ∼ G a(θ|α, β)I θ ∼ χ2(θ|ν): θ ∼ G a(θ|ν/2, 1/2)



Bayesian troubles

Figure: Gamma pdf (k = α, θ = 1/β)



Bayesian troubles

Conjugate Priors for the Normal IV

Example (Normal)

In the normal N (µ,w) case, with both µ and w unknown,conjugate prior on θ = (µ,w) of the form

(w)−λw exp−{λµ(µ− ξ)2 + α

}/w

since

π((µ,w)|x1, . . . , xn) ∝ (w)−λw exp−{λµ(µ− ξ)2 + α

}/w

×(w)−n exp−{n(µ− x)2 + s2

x

}/w

∝ (w)−λw+n exp−{

(λµ + n)(µ− ξx)2

+α+ s2x +

nλµn+ λµ

}/w



Bayesian troubles

Conjugate Priors for the Normal III

Conjugate Priors are However Available Only in Simple Cases

In the previous example the conjugate prior when both µ and w areunknown is not particularly useful.

I Hence, it is very common to resort to independent marginallyconjugate priors: eg., in the Gaussian case, takeN (µ|m0, v0)I G (w|α0, β0) as prior, then π(µ|w, y) isGaussian, π(w|µ, y) is inverse-gamma but π(µ,w|y) does notbelong to a known family‡

I There nonetheless exists some important multivariateextensions : Bayesian normal linear model, inverse-Wishartdistribution for covariance matrices

‡Although closed-form expressions for π(µ|y)and π(w|y) are available



Bayesian troubles

...and conjugate curse

Conjugate priors are very limited in scope

In addition, the use ofconjugate priors only for computationalreasons

• implies a restriction on the modeling of the available priorinformation

• may be detrimental to the usefulness of the Bayesian approach

• gives an impression of subjective manipulation of the priorinformation disconnected from reality.



Bayesian troubles

A typology of Bayes computational problems

(i). latent variable models in general

(ii). use of a complex parameter space, as for instance inconstrained parameter sets like those resulting from imposingstationarity constraints in dynamic models;

(iii). use of a complex sampling model with an intractablelikelihood, as for instance in some graphical models;

(iv). use of a huge dataset;

(v). use of a complex prior distribution (which may be theposterior distribution associated with an earlier sample);





Random variable generationBasic methodsUniform pseudo-random generatorBeyond Uniform distributionsTransformation methodsAccept-Reject MethodsFundamental theorem of simulationLog-concave densities




The Gibbs Sampler

Case Study




• Rely on the possibility of producing (computer-wise) anendless flow of random variables (usually iid) from well-knowndistributions

• Given a uniform random number generator, illustration ofmethods that produce random variables from both standardand nonstandard distributions



Basic methods

The inverse transform method

For a function F on R, the generalized inverse of F , F−, is definedby

F−(u) = inf {x; F (x) ≥ u} .

Definition (Probability Integral Transform)

If U ∼ U[0,1], then the random variable F−(U) has the distributionF .



Basic methods

The inverse transform method (2)

To generate a random variable X ∼ F , simply generate

U ∼ U[0,1]

and then make the transform

x = F−(u)



Uniform pseudo-random generator

Desiderata and limitationsskip Uniform

• Production of a deterministic sequence of values in [0, 1] whichimitates a sequence of iid uniform random variables U[0,1].

• Can’t use the physical imitation of a “random draw” [noguarantee of uniformity, no reproducibility]

• Random sequence in the sense: Having generated(X1, · · · , Xn), knowledge of Xn [or of (X1, · · · , Xn)] impartsno discernible knowledge of the value of Xn+1.

• Deterministic: Given the initial value X0, sample(X1, · · · , Xn) always the same

• Validity of a random number generator based on a singlesample X1, · · · , Xn when n tends to +∞, not on replications

(X11, · · · , X1n), (X21, · · · , X2n), . . . (Xk1, · · · , Xkn)

where n fixed and k tends to infinity.





Algorithm starting from an initial value 0 ≤ u0 ≤ 1 and atransformation D, which produces a sequence

(ui) = (Di(u0))

in [0, 1].For all n,

(u1, · · · , un)

reproduces the behavior of an iid U[0,1] sample (V1, · · · , Vn) whencompared through usual tests




Uniform pseudo-random generator (2)

• Validity means the sequence U1, · · · , Un leads to accept thehypothesis

H : U1, · · · , Un are iid U[0,1].

• The set of tests used is generally of some consequence

◦ Kolmogorov–Smirnov and other nonparametric tests◦ Time series methods, for correlation between Ui and

(Ui−1, · · · , Ui−k)◦ Marsaglia’s battery of tests called Die Hard (!)




Usual generatorsIn R and S-plus, procedure runif()

The Uniform Distribution

Description:‘runif’ generates random deviates.

Example:u <- runif(20)

‘.Random.seed’ is an integer vector, containingthe random number generator state for randomnumber generation in R. It can be saved andrestored, but should not be altered by users.




500 520 540 560 580 600

0.0

0.2

0.4

0.6

0.8

1.0

uniform sample

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.5

1.0

1.5




Usual generators (2)In C, procedure rand() or random()

SYNOPSIS#include <stdlib.h>long int random(void);

DESCRIPTIONThe random() function uses a non-linear additivefeedback random number generator employing adefault table of size 31 long integers to returnsuccessive pseudo-random numbers in the rangefrom 0 to RAND_MAX. The period of this randomgenerator is very large, approximately16*((2**31)-1).RETURN VALUErandom() returns a value between 0 and RAND_MAX.




Usual generators (3)

In Matlab and Octave, procedure rand()

RAND Uniformly distributed pseudorandom numbers.R = RAND(M,N) returns an M-by-N matrix containingpseudorandom values drawn from the standard uniformdistribution on the open interval(0,1).

The sequence of numbers produced by RAND isdetermined by the internal state of the uniformpseudorandom number generator that underlies RAND,RANDI, and RANDN.




Usual generators (4)

In Scilab, procedure rand()

rand() : with no arguments gives a scalar whosevalue changes each time it is referenced. Bydefault, random numbers are uniformly distributedin the interval (0,1). rand(’normal’) switches toa normal distribution with mean 0 and variance 1.

EXAMPLEx=rand(10,10,’uniform’)



Beyond Uniform distributions

Beyond Uniform generators

• Generation of any sequence of random variables can beformally implemented through a uniform generator

◦ Distributions with explicit F− (for instance, exponential, andWeibull distributions), use the probability integral transform

here

◦ Case specific methods rely on properties of the distribution (forinstance, normal distribution, Poisson distribution)

◦ More generic methods (for instance, accept-reject)

• Simulation of the standard distributions is accomplished quiteefficiently by many numerical and statistical programmingpackages.



Transformation methods


Case where a distribution F is linked in a simple way to anotherdistribution easy to simulate.

Example (Exponential variables)

If U ∼ U[0,1], the random variable

X = − logU/λ

has distribution

P(X ≤ x) = P(− logU ≤ λx)= P(U ≥ e−λx) = 1− e−λx,

the exponential distribution E xp(λ).




Other random variables that can be generated starting from anexponential include

Y = −2ν∑j=1

log(Uj) ∼ χ22ν

Y = − 1β

a∑j=1

log(Uj) ∼ G a(a, β)

Y =

∑aj=1 log(Uj)∑a+bj=1 log(Uj)

∼ Be(a, b)




Points to note

◦ Transformation quite simple to use

◦ There are more efficient algorithms for gamma and betarandom variables

◦ Cannot generate gamma random variables with a non-integershape parameter

◦ For instance, cannot get a χ21 variable, which would get us a

N (0, 1) variable.




Box-Muller Algorithm

Example (Normal variables)

If r, θ polar coordinates of (X1, X2), then,

r2 = X21 +X2

2 ∼ χ22 = E (1/2) and θ ∼ U [0, 2π]

Consequence: If U1, U2 iid U[0,1],

X1 =√−2 log(U1) cos(2πU2)

X2 =√−2 log(U1) sin(2πU2)

iid N (0, 1).




Box-Muller Algorithm (2)

1. Generate U1, U2 iid U[0,1] ;

2. Define

x1 =√−2 log(u1) cos(2πu2) ,

x2 =√−2 log(u1) sin(2πu2) ;

3. Take x1 and x2 as two independent draws from N (0, 1).




Box-Muller Algorithm (3)

I Unlike algorithms based on the CLT,this algorithm is exact

I Get two normals for the price oftwo uniforms

I Drawback (in speed)in calculating log, cos and sin.

−4 −2 0 2 4

0.0

0.1

0.2

0.3

0.4




More transforms

Reject

Example (Poisson generation)

Poisson–exponential connection:If N ∼ P(λ) and Xi ∼ E xp(λ), i ∈ N∗,

Pλ(N = k) =Pλ(X1 + · · ·+Xk ≤ 1 < X1 + · · ·+Xk+1) .




More Poisson

Skip Poisson

• A Poisson can be simulated by generating E xp(1) till theirsum exceeds 1.

• This method is simple, but is really practical only for smallervalues of λ.

• On average, the number of exponential variables required is λ.

• Other approaches are more suitable for large λ’s.




Negative extension

I A generator of Poisson random variables can produce negativebinomial random variables since,

Y ∼ Ga(n, (1− p)/p) X|y ∼ P(y)

impliesX ∼ N eg(n, p)




Mixture representation

• The representation of the negative binomial is a particularcase of a mixture distribution

• The principle of a mixture representation is to represent adensity f as the marginal of another distribution, for example

f(x) =∑i∈Y

pi fi(x) ,

• If the component distributions fi(x) can be easily generated,X can be obtained by first choosing fi with probability pi andthen generating an observation from fi.




Partitioned sampling

Special case of mixture sampling when

fi(x) = f(x) IAi(x)/∫

Ai

f(x) dx

andpi = P(X ∈ Ai)

for a partition (Ai)i



Accept-Reject Methods

Accept-Reject algorithm

• Many distributions from which it is difficult, or evenimpossible, to directly simulate.

• Another class of methods that only require us to know thefunctional form of the density f of interest only up to amultiplicative constant.

• The key to this method is to use a simpler (simulation-wise)density g, the instrumental density , from which the simulationfrom the target density f is actually done.



Fundamental theorem of simulation


Lemma

Simulating

X ∼ f(x)

equivalent to simulating

(X,U) ∼ U{(x, u) : 0 < u < f(x)} 0 2 4 6 8 10

0.00

0.05

0.10

0.15

0.20

0.25

x

f(x)




The Accept-Reject algorithm

Given a density of interest f , find a density g and a constant Msuch that

f(x) ≤Mg(x)

on the support of f .

Accept-Reject Algorithm

1. Generate X ∼ g, U ∼ U[0,1] ;

2. Accept Y = X if U ≤ f(X)/Mg(X) ;

3. Return to 1. otherwise.




Validation of the Accept-Reject method

Warranty:

This algorithm produces a variable Y distributed according to f




Two interesting properties

◦ First, it provides a generic method to simulate from anydensity f that is known up to a multiplicative factorProperty particularly important in Bayesian calculations wherethe posterior distribution

π(θ|x) ∝ π(θ) f(x|θ) .

is specified up to a normalizing constant

◦ Second, the probability of acceptance in the algorithm is1/M , e.g., expected number of trials until a variable isaccepted is M




More interesting properties

◦ In cases f and g both probability densities, the constant M isnecessarily larger that 1.

◦ The size of M , and thus the efficiency of the algorithm, arefunctions of how closely g can imitate f , especially in the tails

◦ For f/g to remain bounded, necessary for g to have tailsthicker than those of f .It is therefore impossible to use the A-R algorithm to simulatea Cauchy distribution f using a normal distribution g, howeverthe reverse works quite well.




No Cauchy!

Example (Normal from a Cauchy)

Take

f(x) =1√2π

exp(−x2/2)

and

g(x) =1π

11 + x2

,

densities of the normal and Cauchy distributions.Then

f(x)g(x)

=√π

2(1 + x2) e−x

2/2 ≤√

2πe

= 1.52

attained at x = ±1.




Example (Normal from a Cauchy (2))

So probability of acceptance

1/1.52 = 0.66,

and, on the average, one out of every three simulated Cauchyvariables is rejected.




No Double!

Example (Normal/Double Exponential)

Generate a N (0, 1) by using a double-exponential distributionwith density

g(x|α) = (α/2) exp(−α|x|)

Thenf(x)g(x|α)

≤√

2πα−1e−α

2/2

and minimum of this bound (in α) attained for

α? = 1




Example (Normal/Double Exponential (2))

Probability of acceptance √π/2e = .76

To produce one normal random variable requires on the average1/.76 ≈ 1.3 uniform variables.




truncate

Example (Gamma generation)

Illustrates a real advantage of the Accept-Reject algorithmThe gamma distribution Ga(α, β) represented as the sum of αexponential random variables, only if α is an integer




Example (Gamma generation (2))

Can use the Accept-Reject algorithm with instrumental distribution

Ga(a, b), with a = [α], α ≥ 0.

(Without loss of generality, β = 1.)Up to a normalizing constant,

f/gb = b−axα−a exp{−(1− b)x} ≤ b−a(

α− a(1− b)e

)α−afor b ≤ 1.The maximum is attained at b = a/α.




Truncated Normal simulation

Example (Truncated Normal distributions)

Constraint x ≥ µ produces density proportional to

e−(x−µ)2/2σ2Ix≥µ

for a bound µ large compared with µThere exists alternatives far superior to the naıve method ofgenerating a N (µ, σ2) until exceeding µ, which requires an averagenumber of

1/Φ((µ− µ)/σ)

simulations from N (µ, σ2) for a single acceptance.




Example (Truncated Normal distributions (2))

Instrumental distribution: translated exponential distribution,E (α, µ), with density

gα(z) = αe−α(z−µ) Iz≥µ .

The ratio f/gα is bounded by

f/gα ≤

{1/α exp(α2/2− αµ) if α > µ ,

1/α exp(−µ2/2) otherwise.



Log-concave densities

Log-concave densities (1)

move to next chapter Densities f whose logarithm is concave, forinstance Bayesian posterior distributions such that

log π(θ|x) = log π(θ) + log f(x|θ) + c

concave





TakeSn = {xi, i = 0, 1, . . . , n+ 1} ⊂ supp(f)

such that h(xi) = log f(xi) known up to the same constant.

By concavity of h, line Li,i+1 through(xi, h(xi)) and (xi+1, h(xi+1))

I below h in [xi, xi+1] and

I above this graph outside this interval x 1 x 2 x 3 x 4

x

L (x)2,3

log f(x)





For x ∈ [xi, xi+1], if

hn(x) = min{Li−1,i(x), Li+1,i+2(x)} and hn(x) = Li,i+1(x) ,

the envelopes arehn(x) ≤ h(x) ≤ hn(x)

uniformly on the support of f , with

hn(x) = −∞ and hn(x) = min(L0,1(x), Ln,n+1(x))

on [x0, xn+1]c.





Therefore, if

fn(x) = exphn(x) and fn(x) = exphn(x)

thenfn(x) ≤ f(x) ≤ fn(x) = $n gn(x) ,

where $n normalizing constant of fn




ARS Algorithm

1. Initialize n and Sn.

2. Generate X ∼ gn(x), U ∼ U[0,1].

3. If U ≤ fn(X)/$n gn(X), accept X;

otherwise, if U ≤ f(X)/$n gn(X), accept X




kill ducks

Example (Northern Pintail ducks)

Ducks captured at time i with both probability pi andsize N of the population unknown.Dataset

(n1, . . . , n11) = (32, 20, 8, 5, 1, 2, 0, 2, 1, 1, 0)

Number of recoveries over the years 1957–1968 of 1612Northern Pintail ducks banded in 1956




Example (Northern Pintail ducks (2))

Corresponding conditional likelihood

L(n1, . . . , nI |N, p1, . . . , pI) ∝I∏i=1

pnii (1− pi)N−ni ,

where I number of captures, ni number of captured animalsduring the ith capture, and r is the total number of differentcaptured animals.





Prior selectionIf

N ∼P(λ)

and

αi = log(

pi1− pi

)∼ N (µi, σ2),

[Normal logistic]





Posterior distribution

π(α,N |, n1, . . . , nI) ∝ N !(N − r)!

λN

N !

I∏i=1

(1 + eαi)−N

I∏i=1

exp{αini −

12σ2

(αi − µi)2

}





For the conditional posterior distribution

π(αi|N,n1, . . . , nI) ∝ exp{αini −

12σ2

(αi − µi)2

}/(1+eαi)N ,

the ARS algorithm can be implemented since

αini −1

2σ2(αi − µi)2 −N log(1 + eαi)

is concave in αi.




Posterior distributions of capture log-odds ratios for theyears 1957–1965.

−10 −9 −8 −7 −6 −5 −4 −3

0.0

0.2

0.4

0.6

0.8

1.0

1957

−10 −9 −8 −7 −6 −5 −4 −3

0.0

0.2

0.4

0.6

0.8

1.0

1958

−10 −9 −8 −7 −6 −5 −4 −3

0.0

0.2

0.4

0.6

0.8

1.0

1959

−10 −9 −8 −7 −6 −5 −4 −3

0.0

0.2

0.4

0.6

0.8

1.0

1960

−10 −9 −8 −7 −6 −5 −4 −3

0.0

0.2

0.4

0.6

0.8

1.0

1961

−10 −9 −8 −7 −6 −5 −4 −3

0.0

0.2

0.4

0.6

0.8

1.0

1962

−10 −9 −8 −7 −6 −5 −4 −3

0.0

0.2

0.4

0.6

0.8

1.0

1963

−10 −9 −8 −7 −6 −5 −4 −3

0.0

0.2

0.4

0.6

0.8

1.0

1964

−10 −9 −8 −7 −6 −5 −4 −3

0.0

0.2

0.4

0.6

0.8

1.0

1965




1960

−8 −7 −6 −5 −4

0.0

0.2

0.4

0.6

0.8

True distribution versus histogram of simulated sample



Monte Carlo integration



Monte Carlo IntegrationIntroductionMonte Carlo IntegrationImportance SamplingSelf-Normalised Importance SamplingAcceleration MethodsApplication to Bayesian Model Choice



The Gibbs Sampler

Case Study



Introduction

Quick reminder

Two major classes of numerical problems that arise in statisticalinference

◦ Optimization - generally associated with the likelihoodapproach

◦ Integration- generally associated with the Bayesian approach



Introduction

skip Example!

Example (Bayesian decision theory)

Bayes estimators are not always posterior expectations, but rathersolutions of the minimization problem

minδ

∫Θ

L(θ, δ) π(θ) f(x|θ) dθ .

Proper loss:For L(θ, δ) = (θ − δ)2, the Bayes estimator is the posterior meanAbsolute error loss:For L(θ, δ) = |θ − δ|, the Bayes estimator is the posterior medianWithout loss functionUse the maximum a posteriori (MAP) estimator

arg maxθf(x|, θ)π(θ)




Monte Carlo integration

Theme:Generic problem of evaluating the integral

I = Ef [h(X)] =∫

Xh(x) f(x) dx

where X is uni- or multidimensional, f is a closed form, partlyclosed form, or implicit density, and h is a function




Monte Carlo integration (2)

Monte Carlo solutionFirst use a sample (X1, . . . , Xm) from the density f toapproximate the integral I by the empirical average

hm =1m

m∑j=1

h(Xj)

which convergeshm −→ Ef [h(X)]

by the Strong Law of Large Numbers




Monte Carlo precision

Estimate the variance with

vm =1

m− 1

m∑j=1

[h(Xj)− hm]2,

and for m large,

√mhm − Ef [h(X)]

√vm

∼ N (0, 1),

(assuming Ef [h2(X)] <∞).

Note: This can lead to the construction of a convergence test andof confidence bounds on the approximation of Ef [h(X)].




Example (Cauchy prior/normal sample)

For estimating a normal mean, a robust prior is a Cauchy prior

X ∼ N (θ, 1), θ ∼ C(0, 1).

Under squared error loss, posterior mean

δπ(x) =

∫ ∞−∞

θ

1 + θ2e−(x−θ)2/2dθ∫ ∞

−∞

11 + θ2

e−(x−θ)2/2dθ




Example (Cauchy prior/normal sample (2))

Form of δπ suggests simulating iid variables

θ1, · · · , θm ∼ N (x, 1)

and calculating

δπm(x) =m∑i=1

θi1 + θ2

i

/ m∑i=1

11 + θ2

i

.

The Law of Large Numbers implies

δπm(x) −→ δπ(x) as m −→∞.




0 200 400 600 800 1000

9.6

9.8

10.0

10.2

10.4

10.6

iterations

Range of estimators δπm for 100 runs and x = 10



Importance Sampling

Importance sampling

Paradox

Simulation from f (the true density) is not necessarily optimal

Alternative to direct sampling from f is importance sampling,based on the alternative representation

Ef [h(X)] =∫

X

[h(x)

f(x)g(x)

]g(x) dx .

which allows us to use other distributions than f



Importance Sampling

Importance sampling (2)

Importance sampling algorithm

To evaluation

Ef [h(X)] =∫

Xh(x) f(x) dx ,

do

1. Generate a sample X1, . . . , Xm from a distribution g

2. Use the weighted approximation

1m

m∑j=1

f(Xj)g(Xj)

h(Xj)



Importance Sampling

Consitency of importance sampling

Convergence of the estimator

1m

m∑j=1

f(Xj)g(Xj)

h(Xj) −→∫

Xh(x) f(x) dx

converges almost surely for any choice of the distribution g[as long as supp(g) ⊃ supp(f)]



Importance Sampling

Important details

◦ Instrumental distribution g chosen from distributions easy tosimulate

◦ The same sample (generated from g) can be used repeatedly,not only for different functions h, but also for differentdensities f

◦ Even dependent proposals can be used, as seen laterPMC chapter



Importance Sampling

Although g can be any density, some choices are better thanothers:

◦ Finite variance only when

Ef

[h2(X)

f(X)g(X)

]=∫

Xh2(x)

f2(X)g(X)

dx <∞ .

◦ Instrumental distributions with tails lighter than those of f(that is, with sup f/g =∞) not appropriate.

◦ If sup f/g =∞, the weights f(xj)/g(xj) vary widely, givingtoo much importance to a few values xj .

◦ If sup f/g = M <∞, the accept-reject algorithm can be usedas well to simulate f directly.



Importance Sampling

Example (Counter-Example: Cauchy target)

Case of Cauchy distribution C(0, 1) when importance function isGaussian N (0, 1).Ratio of the densities

ω(x) =p?(x)p0(x)

=√

2πexpx2/2π (1 + x2)

very badly behaved: e.g.,∫ ∞−∞

ω(x)2p0(x)dx =∞ .

Poor performances of the associated importance samplingestimator



Importance Sampling

0 2000 4000 6000 8000 10000

050

100150

200

iterations

Range and average of 500 replications of IS estimate ofE[exp(−X)] over 10, 000 iterations.



Importance Sampling

Optimal importance function

The choice of g that minimizes§ the variance of theimportance sampling estimator is

g∗(x) =|h(x)| f(x)∫

X |h(x′)| f(x′) dx′.

Rather formal optimality result since optimal choice of g∗(x)requires the knowledge of I, the integral of interest!

§This choice ensures that, for a positive function h, hm = I, even for asingle draw!



Self-Normalised Importance Sampling

Self-Normalized Importance Sampling (Hammersley &Handscomb, 1964)

In self-normalized (or Bayesian) importance sampling, expectationsunder the target pdf f

Ef [h(X)]

are estimated as

h(g)m =

m∑i=1

ωi∑mj=1 ωj︸︷︷︸Wi

h(Xi) =1m

∑mi=1 ωih(Xi)

1m

∑mj=1 ωj

,

whereI Xi ∼ iid g, where g is an instrumental pdfI ωi = f

g (Xi).

This form of importance sampling does not necessitate that f beproperly normalized.




Performance of self-normalized IS

Assuming that Ef [fg (X)(1 + h2(X))] <∞, h(g)m is consistent and

asymptotically normal, with asymptotic variance given by

υ(g) = Ef

[f

g(X) (h(X)− Ef (h))2

].

The asymptotic variance can be estimated from the IS sample by

υ(g)m = m

m∑i=1

(Wi)2{h(Xi)− h(g)

m }2 ,

where Wi = ωi/∑m

j=1 ωj are the normalized weights.




Elements of proofConsistency If f = f/c, where c =

∫f(x)dx, is a pdf,

Eg[ωih(Xi)] = Eg

[f

g(X)h(X)

]= cEf [h(X)] .

CLT

√m(h(g)

m − Ef (h)) =1√m

∑mi=1 ωi(h(Xi)− Ef (h))

1m

∑mj=1 ωj

and varg[ωi(h(Xi)− Ef (h))] = c2 Ef [ fg (X)(h(X)− Ef (h))2].

Empirical Estimate

υmg (h) =m∑i=1

Wi

fg (Xi)

1m

∑mj=1

fg (Xj)

{h(Xi)− h(g)m }2




Empirical Diagnostic Tool for IS

Effective Sample Size

ESS(g)m =

[m∑i=1

(Wi)2

]−1

1. 1 ≤ ESS(g)m ≤ m

2. m/ESS(g)m is an estimate of Ef [fg (X)] which is the maximal

IS asymptotic variance for functions h such that|h(x)− Ef (h)| ≤ 1 (note: these functions have maximalMonte Carlo variance of 1 under f).

3. m/ESS(g)m −1 is an estimator of the χ2 divergence∫

(f(x)− g(x))2

g(x)dx

(

√m/ESS(g)

m is the empirical coefficient of variation).




Optimal IS Instrumental DensityWhen considering large classes of functions h, IS generally appearsto perform worse than Monte Carlo under f (e.g., for functionssuch that |h(x)− Ef (h)| ≤ 1, the maximal Monte Carloasymptotic variance is 1 whereas, the maximal IS asymptoticvariance is Ef [fg (X)] which is strictly larger than 1 by Jensen’sinequality, unless g = f).

However, for a fixed function h, the optimal IS density is not f but

|h(x)− Ef (h)|f(x)∫|h(x′)− Ef (h)|f(x′)dx′

as Cauchy-Schwarz inequality implies that

Ef

[f

g(X) (h(X)− Ef (h))2

]Ef

[g

f(X)

]︸︷︷︸

=1

≥ E2f [|h(X)− Ef (h)|] .




Example (Bayesian Posterior)

Consider a simple model in which

θ ∼ N (1, 22) ,

X|θ ∼ N (θ2, σ2) .

And apply IS to compute expectations under the posterior forX = 2, using the prior as instrumental pdf.




−6 −4 −2 0 2 4 6 80

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

InstrumentalTarget

σ = 2lim ESS(g)

m /m 0.68

υ(g)[h(θ) = θ] 2.0

υ(f) 1.4

−6 −4 −2 0 2 4 6 80

0.5

1

1.5

InstrumentalTarget

σ = 0.5lim ESS(g)

m /m 0.19

υ(g)[h(θ) = θ] 9.0

υ(f) 1.7




IS suffers from curse of dimensionalityAs dimension increases, discrepancy between importance andtarget worsens

To illustrate this issue, consider the simple model where both thetarget and instrumental are d-dimensional product distributions:

fn(x1:d) =d∏i=1

f(xi)

gn(x1:d) =d∏i=1

g(xi)

Then, the unormalized importance weights

ω =d∏i=1

f

g(Xi)

are a product of iid terms of expectation 1.




IS suffers from curse of dimensionality (Contd.)

For a function of interest h that only depends on the firstcoordinate x1, the asymptotic variance of the Self-normalized ISapproximation to Ef (h) is given by

υn(h) =∫· · ·∫ ( d∏

i=1

f

g(xi)

)2

(h(x1)− Ef (h))2d∏i=1

g(xi) dx1 . . . dxn

=(∫ f

g(x)f(x)dx︸︷︷︸>1

)d−1∫f

g(x1) (h(x1)− Ef (h))2 g(x1)dx1 .

In practise, this situation can usually be detected by monitoring theeffective sample size criterion, which become abnormally small.




Importance Sampling Scaling Example

2 5 10 20

Figure: Importance sampling (m = 1000 samples) for a multivariateNormal distribution truncated at ±4σ with uniform proposals (2Dmarginal for increasing values of d; radius of point increases with theimportance weight; circle shows 99% region for the 2D marginal).

Here Ef[f(X)g(X)

]= 2.26




too volatile!

Example (Stochastic volatility model)

yk = β exp (xk/2) εk , εk ∼ N (0, 1)

with AR(1) log-variance process (or volatility)

xk+1 = ϕxk + σuk , uk ∼ N (0, 1)




Evolution of IBM stocks (corrected from trend and log-ratio-ed)

0 100 200 300 400 500

−10

−9−8

−7−6

days




Example (Stochastic volatility model (2))

Observed likelihood unavailable in closed from.Joint posterior distribution of the hidden state sequence{Xk}1≤k≤n can be evaluated explicitly

n∏k=2

exp[−{σ−2(xk − φxk−1)2 + β−2 exp(−xk)y2

k + xk}/2],

(2)up to a normalizing constant.




Computational problems


Direct simulation from this distribution impossible because of

(a) dependence among the Xk’s,

(b) dimension of the sequence {Xk}1≤k≤n, and

(c) exponential term exp(−xk)y2k within (2).




Importance sampling


We illustrate below the behavior of (self-normalized) importancesampling on trajectories simulated under the model with ϕ = 0.98,σ = 0.17 and β = 0.64 when proposing whole trajectories{Xk}1≤k≤n from the AR(1) prior, using m = 1000 simulations.

The importance weight associated to x1, . . . , xn reduces to

n∏k=1

exp[−{β−2 exp(−xk)y2

k + xk}/2],




Typical Evolution of ESS for a Single ObservationTrajectory

1 5 10 15 20 25 30 35 40 45 500

100

200

300

400

500

600

700

800

900

1000

ES

S (

500

inde

p. r

eplic

atio

ns o

f the

par

ticle

s)

time index




Evolution of ESS When Averaging Over Trajectories

1 5 10 15 20 25 30 35 40 45 500

100

200

300

400

500

600

700

800

900

1000E

SS

(50

0 in

dep.

sim

ulat

ions

)

time index




Typical Evolution of the Weight Distribution

−25 −20 −15 −10 −5 00

500

1000

−25 −20 −15 −10 −5 00

500

1000

−25 −20 −15 −10 −5 00

50

100

Importance Weights (base 10 logarithm)

Figure: Histograms of the base 10 logarithm of the normalizedimportance weights for subtrajectories of length (from top to bottom) 1,10 and 100 samples.



Acceleration Methods

Correlated simulations

Negative correlation reduces variance

Special technique — but efficient when it applies

Two samples (X1, . . . , Xm) and (Y1, . . . , Ym) from f to estimate

I =∫

Rh(x)f(x)dx

by

I1 =1m

m∑i=1

h(Xi) and I2 =1m

m∑i=1

h(Yi)

with mean I and variance σ2




Variance reduction

Variance of the average

var

(I1 + I2

2

)=σ2

2+

12

cov(I1, I2).

If the two samples are negatively correlated,

cov(I1, I2) ≤ 0 ,

they improve on two independent samples of same size




Antithetic variables

◦ If f symmetric about µ, take Yi = 2µ−Xi

◦ If Xi = F−1(Ui), take Yi = F−1(1− Ui)◦ If (Ai)i partition of X , partitioned sampling by samplingXj ’s in each Ai (requires to know Pr(Ai))




Control variates

out of control!

For

I =∫h(x)f(x)dx

unknown and

I0 =∫h0(x)f(x)dx

known,

I0 estimated by I0 and

I estimated by I




Control variates (2)

Combined estimator

I∗ = I + β(I0 − I0)

I∗ is unbiased for I and

var(I∗) = var(I) + β2var(I0) + 2β cov(I, I0)




Optimal control

Optimal choice of β

β? = −cov(I, I0)

var(I0),

withvar(I?) = (1− ρ2) var(I) ,

where ρ correlation between I and I0

Usual solution: regression coefficient of h(xi) over h0(xi)




Example (Quantile Approximation)

Evaluate

% = P(X > a) =∫ ∞a

f(x)dx

by

% =1n

n∑i=1

I(Xi > a),

with Xi iid f .

Assuming that the median µ (P(X > µ) = 12) is known.




Example (Quantile Approximation (2))

Control variate

% =1n

n∑i=1

I(Xi > a) + β

(1n

n∑i=1

I(Xi > µ)− P(X > µ)

)

improves upon % if

β < 0 and |β| < 2cov(%, %0)var(%0)

2P(X > a)P(X > µ)

.




Integration by conditioning

Use Rao-Blackwell Theorem

var(E[δ(X)|Y]) ≤ var(δ(X))




Consequence

If I unbiased estimator of I = Ef [h(X)], with X simulated from ajoint density f(x, y), where∫

f(x, y)dy = f(x),

the estimatorI∗ = Ef [I|Y1, . . . , Yn]

dominate I(X1, . . . , Xn) variance-wise (and is unbiased)




skip expectation

Example (Student’s t expectation)

ForE[h(X)] = E[exp(−X2)] with X ∼ T (ν, 0, σ2)

a Student’s t distribution can be simulated as

X|y ∼ N (µ, σ2y) and Y −1 ∼ χ2ν .




Example (Student’s t expectation (2))

Empirical distribution

1m

m∑j=1

exp(−X2j ) ,

can be improved from the joint sample

((X1, Y1), . . . , (Xm, Ym))

since

1m

m∑j=1

E[exp(−X2)|Yj ] =1m

m∑j=1

1√2σ2Yj + 1

is the conditional expectation.In this example, precision ten times better




0 2000 4000 6000 8000 10000

0.50

0.52

0.54

0.56

0.58

0.60

0 2000 4000 6000 8000 10000

0.50

0.52

0.54

0.56

0.58

0.60

Estimators of E[exp(−X2)]: empirical average (full) andconditional expectation (dotted) for (ν, µ, σ) = (4.6, 0, 1).



Application to Bayesian Model Choice

Bayesian model choice

directly Markovian

Probabilise the entire model/parameter space

I allocate probabilities pi to all models Mi

I define priors πi(θi) for each parameter space Θi

I compute

π(Mi|x) =pi

∫Θi

fi(x|θi)πi(θi)dθi∑j

pj

∫Θj

fj(x|θj)πj(θj)dθj

I take largest π(Mi|x) to determine “best” model,




Bayes factor

Definition (Bayes factors)

For testing hypotheses H0 : θ ∈ Θ0 vs. Ha : θ 6∈ Θ0, under prior

π(Θ0)π0(θ) + π(Θc0)π1(θ) ,

central quantity

B01 =π(Θ0|x)π(Θc

0|x)

/π(Θ0)π(Θc

0)=

∫Θ0

f(x|θ)π0(θ)dθ∫Θc0

f(x|θ)π1(θ)dθ

[Jeffreys, 1939]




Evidence

A similar quantity, the evidence

Ek =∫

Θk

πk(θk)Lk(θk) dθk,

a.k.a. the marginal likelihood.[Jeffreys, 1939]




Bayes factor approximation

When approximating the Bayes factor

B01 =

∫Θ0

f0(x|θ0)π0(θ0)dθ0∫Θ1

f1(x|θ1)π1(θ1)dθ1

use of importance functions $0 and $1 and

B01 =n−1

0

∑n0i=1 f0(x|θi0)π0(θi0)/$0(θi0)

n−11

∑n1i=1 f1(x|θi1)π1(θi1)/$1(θi1)




Diabetes in Pima Indian women

Example (R benchmark)

“A population of women who were at least 21 years old, of PimaIndian heritage and living near Phoenix (AZ), was tested fordiabetes according to WHO criteria. The data were collected bythe US National Institute of Diabetes and Digestive and KidneyDiseases.”200 Pima Indian women with observed variables

I plasma glucose concentration in oral glucose tolerance test

I diastolic blood pressure

I diabetes pedigree function

I presence/absence of diabetes




Probit modelling on Pima Indian women

Probability of diabetes function of above variables

P(y = 1|x) = Φ(x1β1 + x2β2 + x3β3) ,

Test of H0 : β3 = 0 for 200 observations of Pima.tr based on ag-prior modelling:

β ∼ N3

(0, n(XTX)−1

)




Importance sampling for the Pima Indian dataset

Use of the importance function inspired from the MLE estimatedistribution

β ∼ N (β, Σ)

R Importance sampling codemodel1=summary(glm(y~-1+X1,family=binomial(link="probit")))

is1=rmvnorm(Niter,mean=model1$coeff[,1],sigma=2*model1$cov.unscaled)

is2=rmvnorm(Niter,mean=model2$coeff[,1],sigma=2*model2$cov.unscaled)

bfis=mean(exp(probitlpost(is1,y,X1)-dmvlnorm(is1,mean=model1$coeff[,1],

sigma=2*model1$cov.unscaled))) / mean(exp(probitlpost(is2,y,X2)-

dmvlnorm(is2,mean=model2$coeff[,1],sigma=2*model2$cov.unscaled)))




Diabetes in Pima Indian womenComparison of the variation of the Bayes factor approximationsbased on 100 replicas for 20, 000 simulations from the prior andthe above MLE importance sampler

Basic Monte Carlo Importance sampling

23

45




Bridge sampling

Special case:If

π1(θ1|x) ∝ π1(θ1|x)π2(θ2|x) ∝ π2(θ2|x)

live on the same space (Θ1 = Θ2), then

B12 =1n

n∑i=1

π1(θi|x)π2(θi|x)

with θi ∼ π2(θ|x)

[Gelman & Meng, 1998; Chen, Shao & Ibrahim, 2000]

Requires simulating from π2(θ|x) which will be done(approximately) using MCMC




Bridge sampling variance

The bridge sampling estimator does poorly if

var(B12)B2

12

=1n

Eπ2

[(π1(θ)π2(θ)

)2

− 1

]

=1n

Eπ2

[(π1(θ)− π2(θ)

π2(θ)

)2]

︸︷︷︸R (π1(θ)−π2(θ))2

π2(θ)dθ

is large, i.e. if the posteriors π1 and π2 have little overlap...




(Further) bridge sampling

General identity:

B12 =

∫π2(θ|x)α(θ)π1(θ|x)dθ∫π1(θ|x)α(θ)π2(θ|x)dθ

∀ α(·)

≈

1n1

n1∑i=1

π2(θ1i|x)α(θ1i)

1n2

n2∑i=1

π1(θ2i|x)α(θ2i)θji ∼ πj(θ|x)




Optimal bridge sampling

The optimal choice of auxiliary function is

α?(θ) =n1 + n2

n1π1(θ|x) + n2π2(θ|x)

leading to

B12 ≈

1n1

n1∑i=1

π2(θ1i|x)n1π1(θ1i|x) + n2π2(θ1i|x)

1n2

n2∑i=1

π1(θ2i|x)n1π1(θ2i|x) + n2π2(θ2i|x)




Optimal bridge sampling (2)

Reason:

Var(B12)B2

12

≈ 1n1n2

{∫π1(θ)π2(θ)[n1π1(θ) + n2π2(θ)]α(θ)2 dθ(∫

π1(θ)π2(θ)α(θ) dθ)2 − 1

}

(by the δ method)

Drawback: Dependence on the unknown normalising constantssolved iteratively




Extension to varying dimensions

When dim(Θ1) 6= dim(Θ2), e.g. θ2 = (θ1, ψ), introduction of apseudo-posterior density, ω(ψ|θ1, x), augmenting π1(θ1|x) intojoint distribution

π1(θ1|x)× ω(ψ|θ1, x)

on Θ2 so that

B12 =

∫π1(θ1|x)α(θ1, ψ)π2(θ1, ψ|x)dθ1ω(ψ|θ1, x) dψ∫π2(θ1, ψ|x)α(θ1, ψ)π1(θ1|x)dθ1 ω(ψ|θ1, x) dψ

= Eπ2

[π1(θ1)ω(ψ|θ1)π2(θ1, ψ)

]=

Eϕ [π1(θ1)ω(ψ|θ1)/ϕ(θ1, ψ)]Eϕ [π2(θ1, ψ)/ϕ(θ1, ψ)]

for any conditional density ω(ψ|θ1) and any joint density ϕ.




Illustration for the Pima Indian dataset

Use of the MLE induced conditional of β3 given (β1, β2) as apseudo-posterior and mixture of both MLE approximations on β3

in bridge sampling estimate

R bridge sampling codecova=model2$cov.unscaled

expecta=model2$coeff[,1]

covw=cova[3,3]-t(cova[1:2,3])%*%ginv(cova[1:2,1:2])%*%cova[1:2,3]

probit1=hmprobit(Niter,y,X1)

probit2=hmprobit(Niter,y,X2)

pseudo=rnorm(Niter,meanw(probit1),sqrt(covw))

probit1p=cbind(probit1,pseudo)

bfbs=mean(exp(probitlpost(probit2[,1:2],y,X1)+dnorm(probit2[,3],meanw(probit2[,1:2]),

sqrt(covw),log=T))/ (dmvnorm(probit2,expecta,cova)+dnorm(probit2[,3],expecta[3],

cova[3,3])))/ mean(exp(probitlpost(probit1p,y,X2))/(dmvnorm(probit1p,expecta,cova)+

dnorm(pseudo,expecta[3],cova[3,3])))




Diabetes in Pima Indian women (cont’d)Comparison of the variation of the Bayes factor approximationsbased on 100× 20, 000 simulations from the prior (MC), the abovebridge sampler and the above importance sampler




The original harmonic mean estimator

When θki ∼ πk(θ|x),

1T

T∑t=1

1L(θkt)

is an unbiased estimator of 1/Zk[Newton & Raftery, 1994]

Highly dangerous: Most often leads to an infinite variance!!! (as itresults from an application of IS targeting the prior and using theposterior as instrumental density)




Approximating Zk from a posterior sample

Use of the [harmonic mean] identity

Eπk[

ϕ(θk)πk(θk)Lk(θk)

∣∣∣∣x] =∫

ϕ(θk)πk(θk)Lk(θk)

πk(θk)Lk(θk)Zk

dθk =1Zk

no matter what the proposal ϕ(·) is.[Gelfand & Dey, 1994; Bartolucci et al., 2006]




Comparison with regular importance sampling

Harmonic mean: Constraint opposed to usual importance samplingconstraints: ϕ(θ) must have lighter (rather than fatter) tails thanπk(θk)Lk(θk) for the approximation

Z1k = 1

/1T

T∑t=1

ϕ(θ(t)k )

πk(θ(t)k )Lk(θ

(t)k )

to have a finite variance.E.g., use finite support kernels (like Epanechnikov’s kernel) for ϕ




Comparison with regular importance sampling (cont’d)

Compare Z1k with a standard importance sampling approximation

Z2k =1T

T∑t=1

πk(θ(t)k )Lk(θ

(t)k )

ϕ(θ(t)k )

where the θ(t)k ’s are generated from the density ϕ(·) (with fatter

tails like t’s)




HPD indicator as ϕUse the convex hull of MCMC simulations corresponding to the10% HPD region (easily derived!) and ϕ as indicator:

ϕ(θ) =10T

∑t∈HPD

Id(θ,θ(t))≤ε




Diabetes in Pima Indian women (cont’d)Comparison of the variation of the Bayes factor approximationsbased on 100 replicas for 20, 000 simulations for a simulation fromthe above harmonic mean sampler and importance samplers

Harmonic mean Importance sampling

3.102

3.104

3.106

3.108

3.110

3.112

3.114

3.116



Notions on Markov Chains

Notions on Markov Chains for MCMCBasicsInvariant MeasuresErgodicity: (1) The Discrete CaseErgodicity: (2) The Continuous CaseLimit Theorems



Basics

Basics

A Markov chain is a (discrete-time) stochastic process which issuch that the future evolution of the process is independent of itspast given only its current value.

The distribution of the chain is defined through its transitionkernel, a function K defined on X ×B(X ) such that

I ∀x ∈X , K(x, ·) is a probability measure;

I ∀A ∈ B(X ), K(·, A) is measurable.



Basics

no discrete

• When X is a discrete (finite or countable) set, the transitionkernel may be represented as transition matrix K = (Kxy)whose elements satisfy

Kxy = P(Xn = y|Xn−1 = x) , x, y ∈X

Since, for all x ∈X , K(x, ·) is a probability, we must have

Kxy ≥ 0 and K(x,X ) =∑y∈X

Kxy = 1

The matrix K is said to be a stochastic matrix.



Basics

• In the continuous case, if all probabilities K(x, ·) areabsolutely continuous wrt. Lebesgue measure, the kernel canbe represented by a transition density function k(x, x′):

P(Xn−1 ∈ A|Xn) =∫Ak(Xn, x

′)dx′.

The transition density function satisfies k(x, x′)dx′ ≥ 0 and∫X k(x, x′) = 1.



Basics

Kh

For functions h such that∫X K(x, dy)h(y) <∞ for all xs, define

the function Kh by

Kh(x) = K(x, h) =∫

XK(x, dy)h(y).

Kh(Xn) = E(h(Xn+1)|Xn).

µK

For probability measures µ, define the measure µK by

µK(A) =∫

Xµ(dx)K(x,A).

µK correspond the law of Xn+1, had the chain been started attime n with distribution Xn ∼ µ.



Basics

Composition of Kernels

More generally if K1 and K2 are two transition kernels,

K1K2(x,A) =∫x′∈X

K1(x, dx′)K2(x′, A)

defines the law of Xn+2 when the chain is started in Xn = x antwo successive transitions according to, respectively, K1 and K2

are performed.



Basics

bypass remarks

On a finite state-space X = {s1, . . . , sk},I A function h is uniquely defined by the (column) vectorh = (h(s1), . . . , h(sk))T and

(Kh)s =∑s′∈X

Kss′hs′

can be interpreted as the sth component of the product of thetransition matrix K and of the vector h.

I A probability distribution on X is defined as a (row) vectorµ = (µ(s1), . . . , µ(sk)) and the probability distribution µK isdefined, for each s ∈X , as

(µK)s =∑s′∈X

µs′Ks′s ,

that is, the sth component of the product of the (row) vectorµ and of the transition matrix K.



Basics

Markov Chainskip definition

Definition (Markov Chain)

(Xn)n≥0 forms a Markov chain if there exist a transition kernel Kand an initial probability µ such that for any n

Pµ(X0 ∈ A0, X1 ∈ A1, . . . , Xn ∈ An) =∫x0∈A0

. . .

∫xn−1∈An−1

µ(dx0)K(x0, dx1) . . .K(xn−1, An) .

In particular,

Pµ(Xn ∈ A) =∫

X. . .

∫Xµ(dx0)K(x0, dx1) . . .K(xn−1, A) ,

which is usually denoted more concisely byPµ(Xn ∈ A) = µKn(A).



Basics

Example (Normal Random walk)

Xn = Xn−1 + τεn

where (εn)n≥1 is an iid. sequence of normal variables independentof X0.Then, k(x, x′) = N (x′;x, τ2I) and

Pµ(Xn ∈ A) =∫x0∈X

∫xn∈A

µ(dx0)N (xn;x0, nτ2I)︸︷︷︸

kn(x0,xn)

dxn



Basics

−4 −2 0 2

02

46

810

x

y

A trajectory of length 100 of the random walk in R2 withτ = 1



Basics

The Strong Markov Property

Let τ denote a random stopping time such that the event {τ = n}is σ(X0, . . . , Xn)-measurable for all n. Assuming that τ is Px-a.s.finite, we have

Px(Xτ+1 ∈ A1, . . . , Xτ+n ∈ An|X0, . . . , Xτ ) =∫x1∈A1

. . .

∫xn∈An

K(Xτ , dx1) . . .K(xn−1, dxn) .

Typical stopping times of interest in Markov chain include returntimes

τC = inf{n ≥ 1 : Xn ∈ C} .



Invariant Measures

Invariant Measures

Definition (Invariant Measure)

A measure π is invariant for the transition kernel K if πK = π,that is,

π(A) =∫

Xπ(dx) K(x,A) , ∀A ∈ B(X ) .

I If π in an invariant measure, so is cπ for any scale factor c

I There can be several invariant measures (that differs by morethan just a scale)

The latter can be avoided by assuming that the chain is irreduciblego to irreducible chains



Invariant Measures

Stability Properties and InvarianceAssuming that π is unique (up to the normalization issue), thereare two very different cases

Null Chains If π is not a finite measure

Positive Chains If π is a probability measure; in this case π alsocalled the stationary distribution since

X0 ∼ π implies that Xn ∼ π for every n

More generally, the process is then stationary in thesense that

Pπ(Xk ∈ A0, Xk+1 ∈ A1, . . . , Xk+n ∈ An) =∫x0∈A0

. . .

∫xn∈An

π(dx0)K(x0, dx1) . . .K(xn−1, dxn) ,

for any lag k ≥ 0¶.¶This restriction can be removed, but the corresponding construction is

omitted here.



Invariant Measures

Stationary Distribution in MCMC

In MCMC, we construct kernels K that admit the target πas invariant measure

Hence, if we can show that π is the unique invariant measure, wewill always deal with positive chains.



Invariant Measures

Reversible Chains

π-Reversibility

K is said to be π-reversible is π(dx)K(x, dx′) is a symmetricmeasure on X 2, that is∫

x∈A

∫x′∈A′

π(dx)K(x, dx′) =∫x′∈A′

∫x∈A

π(dx′)K(x′, dx)

π-reversibility implies that K is π-invariant as∫x∈X

∫x′∈A

π(dx)K(x, dx′) =∫x′∈A

∫x∈X

π(dx′)K(x′, dx)

=∫x′∈A

π(dx′)K(x′,X )︸︷︷︸=1

= π(A) .



Invariant Measures

π-Reversibility and Time Reversal

If K is π-reversible, the stationary chain may be run backwards intime to obtain a backward chain that has the same distribution asthe (usual) forward chain

Pπ(Xn−k ∈ Ak, . . . , Xn−1 ∈ A1, Xn ∈ A0) =∫xn∈A0

. . .

∫xn−k∈Ak

π(dxn)K(xn, dxn−1) . . .K(xn−k+1, dxn−k) .



Invariant Measures

π-Reversible Chains in MCMC

MCMC algorithms are all built from elementary transitionkernels that are π-reversible

It is indeed very easy to check that a kernel is π-reversible (and itdoes not require that the normalizing constant of π be known)



Invariant Measures

Ergodicity

For MCMC, a minimum requirement is that the chain (Xn) beergodic in the sense that

Px(Xn ∈ A)→ π(A) for all x ,

ensuring that, for large enough simulation runs, we areapproximatively simulating from π, for any choice of the startingpoint X0.



Invariant Measures

Total Variation Norm

Distances between probability measures will be assessed throughthe following criterion‖.

Total Variation Norm

For probability measures µ1 and µ2,

‖µ1 − µ2‖TV = supA|µ1(A)− µ2(A)|

Other equivalent definitions

(i) For a signed measure ξ on X ,‖ξ‖TV = 1/2{ξ+(X ) + ξ−(X )}

(ii) ‖ξ‖TV = 1/2 sup{|ξ(h)| : ‖h‖∞ = 1}

‖Alternative definition: ‖µ1 − µ2‖TV = 2 supA |µ1(A)− µ2(A)|.



Invariant Measures

A First Step Towards Ergodicity

Proposition (6.52)

If K is π-invariant (with π a probability),

‖µKn − π‖TV

is non-increasing in n, for any choice of the initial distribution µ.

A weak statement that can be made stronger under restrictiveassumptions: If there exist a probability ν such thatK(x,A) > εν(A) with ε > 0, the chain is uniformly ergodic in thesense that

‖µKn − π‖TV ≤ (1− ε)n‖µ− π‖TV .



Invariant Measures

Simple counter-examples show that additional conditions arerequired to ensure

Recurrence Infinitely many visits to all sets such that π(A) > 0,from any starting point.

K =

1 0 00 1− α α0 α 1− α

Aperidocity That the chain does not cycle through disconnected

regions of the state-space.

K =(

0 11 0

)



Ergodicity: (1) The Discrete Case

Irreducibility

Definition (Irreducibility)

In the discrete case, the chain is irreducible if all statescommunicate, in the sense that

Px(τy <∞) > 0 , ∀x, y ∈X ,

τy being the first (strictly positive) time when y is visited (alsocalled return time).

This is equivalent to requesting that for all x, y ∈X , there existsn such that Kn(x, y) > 0.




Recurrence and Transience

Define by ηy the total number of visits of the chain to state y,then Ex[ηy] =

∑n≥0K

n(x, y).

The Recurrence/Transience Dichotomy

An irreducible Markov chain is either

I transient and

Px(τy <∞) < 1 , Ex[ηy] <∞ for all x, y ∈X ,

I recurrent and

Px(τy <∞) = 1 , Ex[ηy] =∞ for all x, y ∈X .




Recurrent Chain

Kac’s Theorem

A chain on a discrete state-space X that is both π-invariant (withπ a probability, such that π(x) > 0 for all x) and irreducible

I is recurrent

I and

Ex[τx] =1

π(x).




Aperiodicity

The chain is said to be aperiodic if, for all states x, the greatestcommon divisor of the set {n ≥ 1 : Kn(x, x) > 0} is equal to 1.

Strong Aperiodicity (Sufficient condition for aperiodicity)

If the chain is irreducible, the existence of a state x such that

K(x, x) > 0

implies that the chain is strongly aperiodic.




Ergodic Theorem

An irreducible and aperiodic chain on a discrete state-space Xwith invariant probability π satisfies

‖Px(Xn ∈ ·)− π‖TV → 0 for all x ∈X



Ergodicity: (2) The Continuous Case

The General Case

For general chains the situation is more complex

I First, individual points of the state-space will generally not bereachable and we must adapt our definitions of irreducibility,recurrence and aperiodicity to deal with measurable setsrather points of the state-space.

I Second, we need a strengthened notion of recurrence.




Phi-Irreducibility

The chain is said to be ϕ-irreducible for some measure ϕ if

I for all x ∈X ,

I for all A ∈ B(X ) such that ϕ(A) > 0,

there exists n ≥ 1 such that

Kn(x,A) > 0 . (3)

It the chain is π-invariant, with π a probability measure, (i.e., as inMCMC) one should choose ϕ = π, as π is a maximal irreducibilitymeasure (dominates all measures ϕ such that (3) holds).




Harris Recurrence

Unfortunately (and in contrast to the discrete case),phi-irreducibility in itself is not sufficient to ensure thatPx(τA <∞) = 1, hence we will assume a stronger form ofrecurrence

Definition (Harris Recurrence)

A phi-irreducible chain is said to be Harris recurrent if for anyx ∈X and all sets A such that ϕ(A) > 0,

Px(τA <∞) = 1 .

Px(τA <∞) = 1 for all xs implies that Ex[ηA] =∞, but theconverse is not more true.




Small Sets

To ensure (strong) aperidocity a sufficient condition is theexistence of a ν-small set C such that ν(C) > 0.

Definition (Small Set)

If there exist C ∈ B(X ), with ϕ(C) > 0, a measure ν and ε > 0such that, for all x ∈ C and all A ∈ B(X ),

K(x,A) ≥ εν(A)

C is called a ν-small set.

Small sets play a role somewhat similar to points (or atoms) in thediscrete case.




General Ergodicity

Theorem (6.51)

If the chain (Xn) is Harris recurrent positive and aperiodic, then

limn→∞

‖Kn(x, ·)− π ‖TV = 0

for all x ∈X .

Convergence in total variation implies

limn→∞

Ex[h(Xn)] = Eπ[h]

for every bounded function h (and, hence, convergence indistribution).



Limit Theorems

Limit Theorems for MCMCI The ergodic theorem (and particularly quantitative

refinements of it not discussed here) is important to guaranteethat the algorithm is forgetting its initial (arbitrary) state.

I However, the usual practice in MCMC is not only to exploit asingle sampled point Xn for n sufficiently large but to rely onempirical trajectory-wise averages

1n

n∑k=1

h(Xk)

to estimate the expectation of functions of interest h under π.I Hence, we also need LLNs and CLTs.

Classical LLNs and CLTs not directly applicable due to:

◦ Markovian dependence structure

◦ Non-stationarity of the chain



Limit Theorems

LLN for Markov Chains

Theorem (6.63)

If the Markov chain (Xn) is Harris recurrent with stationarydistribution π, then for any function h such that Eπ |h| <∞,

limn→∞

1n

n∑k=1

h(Xk) = Eπ[h] .



Limit Theorems

CLT for Markov Chains

Theorem (6.65)

For an Harris recurrent π-reversible Markov chain (Xn), leth = h− Eπ[h] and assume that

0 < γ2h = Eπ[h2(X0)] + 2

∞∑k=1

Eπ[h(X0)h(Xk)] <∞ .

Then,

1√n

(n∑k=1

(h(Xk)− Eπ[h])

)L−→ N (0, γ2

h) .

[Kipnis & Varadhan, 1986]








The Metropolis-Hastings AlgorithmMonte Carlo Methods based on Markov ChainsThe Metropolis-Hastings AlgorithmThe Independent Metropolis-Hastings AlgorithmThe Random Walk Metropolis-Hastings AlgorithmExtensions

The Gibbs Sampler

Case Study



Monte Carlo Methods based on Markov Chains

Running Monte Carlo via Markov Chains

It is not necessary to use a iid. sample from the distribution f toapproximate the integral

I =∫

h(x)f(x)dx ,

One can approximate I from a simulation of an ergodic Markovchain with stationary distribution f .



Monte Carlo Methods based on Markov Chains

Running Monte Carlo via Markov Chains (2)

Idea

For an arbitrary starting value x(0), an ergodic chain (X(t)) isgenerated using a transition kernel with stationary distribution f

I Insures the convergence in distribution of (X(t)) to a randomvariable from f .

I For a “large enough” T0, X(T0) can be considered asdistributed from f

I Produce a dependent sample X(t), X(t+1), . . ., which can beused to approximate integrals under f .

Problem: How can one build a Markov chain with a givenstationary distribution?





BasicsThe algorithm requires the knowledge of the objective (target)density

f

and uses a conditional density

q(y|x)

called the instrumental (or proposal) distribution




The MH algorithm

Algorithm (Metropolis-Hastings)

Given X(t),

1. Generate Yt ∼ q(y|X(t)).

2. Take

X(t+1) =

{Yt with prob. ρ(X(t), Yt),X(t) with prob. 1− ρ(X(t), Yt),

where

ρ(x, y) = min{f(y)f(x)

q(x|y)q(y|x)

, 1}.




Features

I Does not require the knowledge of normalizing constants ofeither f or q(·|x)

I Never move to values for which f(y) = 0I The chain (X(t)) may take the same value several times in a

row, even though f is a density wrt Lebesgue measure

I The sequence (Yt) is usually not a Markov chain




Convergence properties

1. The MH Markov chain is reversible, with invariant/stationarydensity f since it satisfies the detailed balance condition

f(y)dyK(y, dx) = f(x)dxK(x, dy)

2. As f is a probability measure, the chain is positive recurrent

3. If

P

[f(Yt) q(X(t)|Yt)f(X(t)) q(Yt|X(t))

≥ 1

]< 1 ,

that is, the event {X(t+1) = X(t)} is possible, then the chainis aperiodic




Convergence properties (2)

4. Ifq(y|x) > 0 for every (x, y) ,

the chain is f-irreducible

5. For MH, f -irreducibility implies Harris recurrence

6. Thus, for MH chains satisfying 3 and 4(i) For h, with Ef |h(X)| <∞,

limT→∞

1T

T∑t=1

h(X(t)) =∫h(x)df(x) Px(0) −a.s.

(ii) and

limn→∞

∥∥∥∥∫ Kn(x(0), ·)− f∥∥∥∥TV

= 0

for every initial starting point x(0).



The Independent Metropolis-Hastings Algorithm

The Independent Case

The instrumental distribution q does not depend on X(t), and isdenoted g by analogy with Accept-Reject.

Algorithm (Independent Metropolis-Hastings)

Given X(t),

a Generate Yt ∼ g(y)b Take

X(t+1) =

Yt with prob. min

{f(Yt) g(X(t))f(X(t)) g(Yt)

, 1

},

X(t) otherwise.




PropertiesThe resulting sample is not iid but there exist strong convergenceproperties:

Theorem (Ergodicity)

The algorithm produces a uniformly ergodic chain if there exists aconstant M such that

f(x) ≤Mg(x) ,

In this case,

‖Kn(x, ·)− f‖TV ≤(

1− 1M

)n.

[Mengersen & Tweedie, 1996]




Example (Noisy AR(1))

Hidden Markov chain from a regular AR(1) model,

xt+1 = ϕxt + εt+1 εt ∼ N (0, τ2)

and observablesyt|xt ∼ N (x2

t , σ2)

The distribution of xt given xt−1, xt+1 and yt is

exp−12τ2

{(xt − ϕxt−1)2 + (xt+1 − ϕxt)2 +

τ2

σ2(yt − x2

t )2

}.




Example (Noisy AR(1) Contd.)

Use for proposal the N (µt, ω2t ) distribution, with

µt = ϕxt−1 + xt+1

1 + ϕ2and ω2

t =τ2

1 + ϕ2.

Ratioπ(x)/qind(x) = exp−(yt − x2

t )2/2σ2

is bounded




(top) Last 500 samples of the chain (X(t)) run for 10, 000iterations; (bottom) histogram of the chain, compared withthe target distribution.




Example (Cauchy by normal)

go random W Given a Cauchy C (0, 1) distribution, consider a normalN (0, 1) proposalThe Metropolis–Hastings acceptance ratio is

π(ξ′)/ν(ξ′)π(ξ)/ν(ξ))

= exp[{ξ2 − (ξ′)2

}/2] 1 + (ξ′)2

(1 + ξ2).

Poor perfomances: the proposal distribution has lighter tails thanthe target Cauchy and convergence to the stationary distribution isnot even geometric!




The Independent Metropolis-Hastings AlgorithmDe

nsity

−4 −2 0 2 4

0.0

0.1

0.2

0.3

0.4

Histogram of Markov chain(ξ(t))1≤t≤5000 against targetC (0, 1) distribution.

0 1000 2000 3000 4000 5000

−3−2

−10

12

3iterations

Range and average of 1000parallel runs when initializedwith a normal N (0, 1002)distribution.




Performance of the Algorithm

Average Acceptance Rate

If f(x) ≤Mg(x), the average acceptance rate (under stationarity)satisfies

ρ = Ef⊗g

[min

{f(Y ) g(X)f(X) g(Y )

, 1}]

= 2 Pf⊗g

(f(Y )g(Y )

≥ f(X)g(X)

)≥ 2

MPf⊗f

(f(Y )g(Y )

≥ f(X)g(X)

)=

1M

.

Hence, maximizing the acceptance rate is related to theoptimization of the speed of convergence in the ergodic theorem.

When different proposals g are available, the one that leads to thehighest acceptance rate should be used.




Practical approachChoose a parameterized instrumental distribution g(·|θ) and adjustthe corresponding parameters θ based on the empirical estimate ofthe average acceptance rate computed on pilot runs∗∗

ρ(θ) =2m

m∑t=1

I{f(Yt)g(X(t)|θ)>f(X(t))g(Yt|θ)} .

∗∗Trying to modify θ continuously as the algorithm runs would result in anadaptive MCMC approach, whose convergence is harderto establish (see latter).



The Random Walk Metropolis-Hastings Algorithm

Random walk Metropolis–Hastings

Use of a local perturbation as proposal

Yt = X(t) + εt,

where εt ∼ g is independent of X(t).

The instrumental density is now of the form g(y − x) and theMarkov chain is a random walk if we take g to be symmetricg(x) = g(−x).




Algorithm (Random walk Metropolis-Hastings)

Given X(t)

1. Generate Yt ∼ g(y −X(t))2. Take

X(t+1) =

Yt with prob. min{

1,f(Yt)f(X(t))

},

X(t) otherwise.




Example (Random walk and normal target)

forget History! Generate N (0, 1) based on the uniform proposal [−δ, δ][Hastings (1970)]

The probability of acceptance is then

ρ(x(t), yt) = exp{(x(t)2 − y2t )/2} ∧ 1.




-1 0 1 2

050

100

150

200

250

(a)

-1.5

-1.0

-0.5

0.0

0.5

-2 0 2

010

020

030

040

0

(b)

-1.5

-1.0

-0.5

0.0

0.5

-3 -2 -1 0 1 2 3

010

020

030

040

0

(c)

-1.5

-1.0

-0.5

0.0

0.5

Three samples based on U [−δ, δ] with (a) δ = 0.1, (b) δ = 0.5and (c) δ = 1.0, superimposed with the convergence of themeans (15, 000 simulations).




Example (Mixture models (again!))

π(θ|x) ∝n∏j=1

(k∑`=1

p`f(xj |µ`, σ`)

)π(θ)

Metropolis-Hastings proposal:

θ(t+1) ={θ(t) + ωε(t) if u(t) < ρ(t)

θ(t) otherwise

where

ρ(t) =π(θ(t) + ωε(t)|x)

π(θ(t)|x)∧ 1

and ω scaled for good acceptance rate




p

thet

a

0.0 0.2 0.4 0.6 0.8 1.0

-10

12

tau

thet

a

0.2 0.4 0.6 0.8 1.0 1.2

-10

12

p

tau

0.0 0.2 0.4 0.6 0.8 1.0

0.2

0.4

0.6

0.8

1.0

1.2

-1 0 1 2

0.0

1.0

2.0

theta

0.2 0.4 0.6 0.8

02

4

tau

0.0 0.2 0.4 0.6 0.8 1.0

01

23

45

6

p

Random walk sampling (50000 iterations)

Gase of a 3 component normal mixture[Celeux & al., 2000]




Convergence properties

Uniform ergodicity prohibited by random walk structureAt best, geometric ergodicity:

Theorem (Sufficient ergodicity)

For a symmetric density f , log-concave in the tails, and a positiveand symmetric density g, the chain (X(t)) is geometrically ergodic.


no tail effect




Example (Cauchy by normal continued)

Again, Cauchy C (0, 1) target and Gaussian random walk proposal,ξ′ ∼ N (ξ, σ2), with acceptance probability

1 + ξ2

1 + (ξ′)2∧ 1 ,

Overall fit of the Cauchy density by the histogram satisfactory, butpoor exploration of the tails: 99% quantile of C (0, 1) equal to 3,but no simulation exceeds 14 out of 10, 000!

[Roberts & Tweedie, 2004]




Again, lack of geometric ergodicity![Mengersen & Tweedie, 1996]

Slow convergence shown by the non-stable range after 10, 000iterations.

Density

−5 0 5

0.000.05

0.100.15

0.200.25

0.300.35

Histogram of the 10, 000 first steps of a random walkMetropolis–Hastings algorithm using a N (ξ, 1) proposal




0 2000 4000 6000 8000 10000

−100

−500

50100

iterations

Range of 500 parallel runs for the same setup




Further convergence properties

Under assumptions skip detailed convergence

I (A1) f is super-exponential, i.e. it is positive with positivecontinuous first derivative such thatlim|x|→∞ n(x)′∇ log f(x) = −∞ where n(x) := x/|x|.In words : exponential decay of f in every direction with ratetending to ∞

I (A2) lim sup|x|→∞ n(x)′m(x) < 0, wherem(x) = ∇f(x)/|∇f(x)|.In words: non degeneracy of the countour manifoldCf(y) = {y : f(y) = f(x)}

Q is geometrically ergodic, andV (x) ∝ f(x)−1/2 verifies the drift condition

[Jarner & Hansen, 2000]




How to Chose the Scale of the Proposal in RW-MH?Try yourself at http://www.lbreyer.com/classic.html

From (Roberts & Rosenthal, 2001)

http://www.lbreyer.com/classic.html




Performance Tradeoff

A high acceptance rate does not indicate that the algorithm ismoving correctly since it indicates that the random walk is movingtoo slowly on the surface of f .

If x(t) and yt are close, i.e. f(x(t)) ' f(yt) y is accepted withprobability

min(f(yt)f(x(t))

, 1)' 1 .

For multimodal densities with well separated modes, the negativeeffect of limited moves on the surface of f clearly shows.




Performance Tradeoff (2)

If the average acceptance rate is very low, the algorithm is oftenproposing values of Yt for which f(Yt) is small compared tof(X(t)): the random walk tries to move too quickly and theresulting MH sampler is often “stuck”.




Example (Noisy AR(1) continued)

For a Gaussian random walk with scale ω small enough, therandom walk never jumps to the other mode. But if the scale ω issufficiently large, the Markov chain explores both modes and give asatisfactory approximation of the target distribution.




Markov chain based on a random walk with scale ω = .1.




Markov chain based on a random walk with scale ω = .5.




How Does the Method Scale in Large Dimension?

The Scaling Construction (Roberts et al., 1997-2001)

Assume a d-dimensional target of product formf(d)(x1, . . . , xd) =

∏dk=1 f(xk) and an isotropic (d-dimensional)

proposal Gaussian distribution with variance σ2(d). Now let the

dimension d tends to infinity

I the acceptance rate does not degenerate only if σ2(d) = O(1/d)

And for a coordinate projection

I the optimal choice of σ2(d) is 2.382/d× varπ(X), which

correspond to an acceptance rate of 23%

I the asymptotic variance of the ergodic average isαd× varπ(X) (for a Gaussian target, α ≈ 3.7)




RW-MH Scaling Example

2 5 10 20

First 1 000 samples of the RW-MH algorithm started from mode for a multivariate Gaussian distribution withoptimal proposal scaling (arbitrary 2D marginal for increasing values of d)




The Geometry of the Target is Also Very Important

2D Gaussian RW-MH (Gaussian target) with accept. rate 50%(left: σ2

prop = 0.04 I2; right: σ2prop = 1.4 Σπ)

Number of Iterations 1, 2, 3, 4, 5, 10, 25, 50, 100, 500

−4 −3 −2 −1 0 1 2 3 4−4

−3

−2

−1

0

1

2

3

4

−4 −3 −2 −1 0 1 2 3 4−4

−3

−2

−1

0

1

2

3

4




Rule of thumb

Aim at an average acceptance rate of 25% when using therandom-walk MH algorithm.

This rule is to be taken with a pinch of salt!



Extensions

Extensions

There are many other families of MH algorithms

◦ Adaptive Rejection Metropolis Sampling

◦ Reversible Jump (later!)

◦ Langevin algorithms

to name just a few...



Extensions

Langevin Algorithms

Proposal based on the Langevin diffusion Lt is defined by thestochastic differential equation

dLt = dBt +12∇ log f(Lt)dt,

where Bt is the standard Brownian motion

Theorem

The Langevin diffusion is the only non-explosive diffusion which isreversible with respect to f .



Extensions

Discretization

Instead, consider the sequence

x(t+1) = x(t) +σ2

2∇ log f(x(t)) + σεt, εt ∼ Np(0, Ip)

where σ2 corresponds to the discretization step

Unfortunately, the discretized chain may be be transient, forinstance when

limx→±∞

∣∣σ2∇ log f(x)|x|−1∣∣ > 1



Extensions

MH correction

Accept the new value Yt with probability

f(Yt)f(X(t))

·exp

{−∥∥∥Yt − x(t) − σ2

2 ∇ log f(x(t))∥∥∥2/

2σ2

}exp

{−∥∥∥X(t) − Yt − σ2

2 ∇ log f(Yt)∥∥∥2/

2σ2

} ∧ 1 .

Choice of the scale factor σThe scaling approach suggests an acceptance rate of 58% toachieve optimal convergence rates

[Roberts & Rosenthal, 1998]


The Gibbs Sampler

The Gibbs Sampler

The Gibbs SamplerThe Hammersley-Clifford theoremGeneral PrinciplesConvergenceData AugmentationThe Slice SamplerRao-BlackwellizationMarginalized Gibbs Sampling


The Gibbs Sampler

The Hammersley-Clifford theorem

Hammersley-Clifford theorem

An illustration that conditionals determine the joint distribution

Theorem

If the joint density g(y1, y2) has conditional distributions g1(y1|y2)and g2(y2|y1), then

g(y1, y2) =g2(y2|y1)∫

g2(v|y1)/g1(y1|v) dv.

[Hammersley & Clifford, circa 1970]


The Gibbs Sampler


Positivity condition

g(i)(yi) > 0 for every i = 1, · · · , p, implies that g(y1, . . . , yp) > 0,where g(i) denotes the marginal distribution of Yi.

This implies that the support of g is the Cartesian product of thesupports of the marginal and that the conditionals do not restrictthe possible range of Yi.


The Gibbs Sampler


General HC decomposition

Under the positivity condition, the joint distribution g satisfies

g(y1, . . . , yp) ∝p∏j=1

g`j (y`j |y`1 , . . . , y`j−1, y′`j+1

, . . . , y′`p)

g`j (y′`j|y`1 , . . . , y`j−1

, y′`j+1, . . . , y′`p)

for every permutation ` of {1, 2, . . . , p} and every y′ ∈ Y .


The Gibbs Sampler

General Principles

General Principles

A very specific simulation algorithm based on the structure of thetarget distribution f :

1. Uses the conditional densities f1, . . . , fp from f

2. Start with the random variable X = (x1, . . . , xp)3. Simulate from the conditional densities,

Xi|x1, x2, . . . , xi−1, xi+1, . . . , xp

∼ fi(xi|x1, x2, . . . , xi−1, xi+1, . . . , xp)

for i = 1, 2, . . . , p.


The Gibbs Sampler

General Principles

Algorithm (Gibbs sampler)

Given X(t) = (x(t)1 , . . . , x

(t)p ), generate

1. X(t+1)1 ∼ f1(x1|x(t)

2 , . . . , x(t)p );

2. X(t+1)2 ∼ f2(x2|x(t+1)

1 , x(t)3 , . . . , x

(t)p ),

. . .

p. X(t+1)p ∼ fp(xp|x(t+1)

1 , . . . , x(t+1)p−1 )


The Gibbs Sampler

General Principles

Properties

The full conditionals densities f1, . . . , fp are the only densities usedfor simulation. Thus, even in a high dimensional problem, all ofthe simulations may be univariate

The Gibbs sampler is not reversible with respect to f . However,each of its p coordinate-wise updates is.


The Gibbs Sampler

General Principles

Random Scan Gibbs sampler

Modification of the above Gibbs sampler where, with probability1/p, the i-th component is drawn from fi(xi|X−i), ie when thecomponents are chosen at random

Motivation

The Random Scan Gibbs sampler is f -reversible.


The Gibbs Sampler

General Principles

Example (Bivariate Gibbs sampler)

For a target of the form

(X,Y ) ∼ f(x, y)

Set X(0) = x(0)

For t = 1, 2, . . . , generate

Y (t) ∼ fY |X(·|X(t−1))

X(t) ∼ fX|Y (·|Y (t))

where fY |X and fX|Y are the conditional distributions


The Gibbs Sampler

General Principles

A Very Simple Example: Independent N (µ, σ2)Observations

When Y1, . . . , Yniid∼ N (y|µ, σ2) with both µ and σ unknown, the

posterior in (µ, σ2) does not belong to a standard family (evenwhen choosing conjugate priors)

But...

µ|Y 0:n, σ2 ∼ N

(µ∣∣∣ 1n

∑ni=1 Yi,

σ2

n )

σ2|Y 1:n, µ ∼ IG(σ2∣∣n

2 − 1, 12

∑ni=1(Yi − µ)2

)assuming constant (improper) priors on both µ and σ2

I Hence we may use the Gibbs sampler for simulating from theposterior of (µ, σ2)


The Gibbs Sampler

General Principles

R Gibbs Sampler for Gaussian posterior

n = length(Y);S = sum(Y);mu = S/n;for (i in 1:500)

S2 = sum((Y-mu)^2);sigma2 = 1/rgamma(1,n/2-1,S2/2);mu = S/n + sqrt(sigma2/n)*rnorm(1);


The Gibbs Sampler

General Principles

Example of Results with, Left n = 10 Observations; Right,n = 100 Observations from the N (0, 1) Distribution

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

µ

σ2

Number of Iterations 1, 2, 3, 4, 5, 10, 25, 50, 100, 500


The Gibbs Sampler

General Principles

Limitations of the Gibbs sampler

The Gibbs sampler

1. limits the choice of instrumental distributions

2. requires some knowledge of f

3. does not apply to problems where the number of parametersvaries.


The Gibbs Sampler

Convergence

Properties of the Gibbs sampler

Theorem (Convergence)

Under[Positivity condition]

I f (i)(yi) > 0 for every i = 1, · · · , p, implies thatf(y1, . . . , yp) > 0, where f (i) denotes the marginaldistribution of Yi under f ,

the chain is irreducible and positive Harris recurrent.


The Gibbs Sampler

Convergence

Properties of the Gibbs sampler (2)

Consequences

(i) If∫h(y)f(y)dy <∞, then

limnT→∞

1T

T∑t=1

h(Y (t)) =∫h(y)f(y)dy a.s.

(ii) If, in addition, (Y (t)) is aperiodic, then

limn→∞

∥∥∥∥∫ Kn(y(0), ·)− f∥∥∥∥TV

= 0

for every initial starting point y(0).


The Gibbs Sampler

Data Augmentation

Latent variables are back

The Gibbs sampler is particularly adapted for cases where thetarget probability is defined on an augmented space.

A density g is a completion of f if∫Zg(x, z) dz = f(x)

Note

The variable z may be meaningless for the problem


The Gibbs Sampler

Data Augmentation

Purpose

g should have full conditionals that are easy to simulate for aGibbs sampler to be implemented with g rather than f

For p > 1, write y = (x, z) and denote the conditional densities ofg(y) = g(y1, . . . , yp) by

Y1|y2, . . . , yp ∼ g1(y1|y2, . . . , yp),Y2|y1, y3, . . . , yp ∼ g2(y2|y1, y3, . . . , yp),

. . . ,

Yp|y1, . . . , yp−1 ∼ gp(yp|y1, . . . , yp−1).


The Gibbs Sampler

Data Augmentation

The move from Y (t) to Y (t+1) is defined as follows:

Algorithm (Completion Gibbs sampler)

Given Y (t) = (y(t)1 , . . . , y

(t)p ), simulate

1. Y(t+1)

1 ∼ g1(y1|y(t)2 , . . . , y

(t)p ),

2. Y(t+1)

2 ∼ g2(y2|y(t+1)1 , y

(t)3 , . . . , y

(t)p ),

. . .

p. Y(t+1)p ∼ gp(yp|y(t+1)

1 , . . . , y(t+1)p−1 ).


The Gibbs Sampler

Data Augmentation

Example (Mixtures all over again)

Hierarchical missing data structure:If

X1, . . . , Xn ∼k∑i=1

pif(x|θi),

then

X|Z ∼ f(x|θZ), Z ∼ p1I(z = 1) + . . .+ pkI(z = k),

Z is the component indicator associated with observation x


The Gibbs Sampler

Data Augmentation

Example (Mixtures (2))

Conditionally on (Z1, . . . , Zn) = (z1, . . . , zn) :

π(p1, . . . , pk, θ1, . . . , θk|x1, . . . , xn, z1, . . . , zn)∝ pα1+n1−1

1 . . . pαk+nk−1k

×π(θ1|y1 + n1x1, λ1 + n1) . . . π(θk|yk + nkxk, λk + nk),

withni =

∑j

I(zj = i) and xi =∑j; zj=i

xj/ni.


The Gibbs Sampler

Data Augmentation

Algorithm (Mixture Gibbs sampler)

1. Simulate

θi ∼ π(θi|yi + nixi, λi + ni) (i = 1, . . . , k)(p1, . . . , pk) ∼ D(α1 + n1, . . . , αk + nk)

2. Simulate (j = 1, . . . , n)

Zj |xj , p1, . . . , pk, θ1, . . . , θk ∼k∑i=1

pijI(zj = i)

with (i = 1, . . . , k)pij ∝ pif(xj |θi)

and update ni and xi (i = 1, . . . , k).


The Gibbs Sampler

The Slice Sampler

Recall the Fundamental Observation that Underpins theAccept-Reject Method

Lemma

Simulating

X ∼ f(x)

is equivalent to simulating

(X,U) ∼ U{(x, u) : 0 < u < f(x)}0 2 4 6 8 10

0.00

0.05

0.10

0.15

0.20

0.25

xf(

x)


The Gibbs Sampler

The Slice Sampler

Slice sampler as generic Gibbs

Algorithm (Slice sampler)

Simulate

1. U (t+1) ∼ U[0,f(X(t))];

2. X(t+1) ∼ UA(t+1) , with

A(t+1) = {x : f(x) ≥ U (t+1)} .


The Gibbs Sampler

The Slice Sampler

Example of results with a truncated N (−3, 1) distribution

0.0 0.2 0.4 0.6 0.8 1.0

0.00

00.

002

0.00

40.

006

0.00

80.

010

x

y

Number of Iterations 2, 3, 4, 5, 10, 50, 100


The Gibbs Sampler

The Slice Sampler

Example (Stochastic volatility conditional distribution)

In the stochastic volatility model, the single-site conditionaldistribution is of the form

π(x) ∝ exp−{σ2(x− µ)2 + β2 exp(−x)y2 + x

}/2 ,

simplified in exp−{x2 + α exp(−x)

}Slice sampling means simulation from a uniform distribution on

A ={x : exp−

{x2 + α exp(−x)

}/2 ≥ u

}=

{x : x2 + α exp(−x) ≤ ω

}if we set ω = −2 log u.

Note Inversion of x2 + α exp(−x) = ω needs to be donenumerically (eg., using Newton’s algorithm).


The Gibbs Sampler

The Slice Sampler

0 10 20 30 40 50 60 70 80 90 100−0.1

−0.05

0

0.05

0.1

Lag

Cor

rela

tion

−1 −0.5 0 0.5 1 1.5 2 2.5 3 3.50

0.2

0.4

0.6

0.8

1D

ensi

ty

Histogram of a Markov chain produced by a slice samplerand target distribution in overlay.


The Gibbs Sampler

The Slice Sampler

Convergence of the Slice Sampler

Theorem (Uniform ergodicity)

If f is bounded and suppf is bounded, the simple slice sampler isuniformly ergodic.

[Mira & Tierney, 1997]


The Gibbs Sampler

The Slice Sampler

Good slices

The slice sampler usually enjoys very good theoretical properties(geometric ergodicity and even uniform ergodicity) in onedimensional cases.

As the dimension increases, the practical determination of the setA(t+1) may get increasingly complex (and the performanceguarantees of the method are usually not as strong).


The Gibbs Sampler

Rao-Blackwellization


If (Y1, Y2, . . . , Yp)(t), t = 1, 2, . . . T is the output from a Gibbssampler

δ0 =1T

T∑t=1

h(Y

(t)1

)→∫h(y1)f (1)(y1)dy1 .

The Rao-Blackwellization replaces δ0 with its conditionalexpectation

δrb =1T

T∑t=1

Ef[h(Y1)

∣∣∣Y (t)2 , . . . , Y (t)

p

].


The Gibbs Sampler


Rao-Blackwellization (2)

Then

◦ Both estimators converge to E[h(Y1)]◦ The Rao-Blackwellized estimator is unbiased,

◦ and

var(

E[h(Y1)|Y (t)

2 , . . . , Y (t)p

])≤ var(h(Y1)) .


The Gibbs Sampler


Examples of Rao-Blackwellization

Example

Bivariate normal Gibbs sampler

X | y ∼ N (ρy, 1− ρ2)Y | x ∼ N (ρx, 1− ρ2).

Then

δ0 =1T

T∑i=1

X(i) and δ1 =1T

T∑i=1

E[X(i)|Y (i)] =1T

T∑i=1

ρY (i),

estimate E[X] and σ2δ0/σ2

δ1= 1

ρ2> 1.


The Gibbs Sampler


Examples of Rao-Blackwellization (2)

Example (Grouped counting data)

360 consecutive records of the number of passages per unit time

Number ofpassages 0 1 2 3 4 or more

Number ofobservations 139 128 55 25 13


The Gibbs Sampler


Example (Grouped counting data (2))

Particularity Observations with 4 passages and more are grouped

If observations are Poisson P(λ), the likelihood is

`(x1, . . . , x5|λ)

∝ e−347λλ128+55×2+25×3

(1− e−λ

3∑i=0

λi

i!

)13

,

which can be difficult to work with.

Idea With a prior π(λ) = 1/λ, complete the vector (y1, . . . , y13) ofthe 13 units larger than 4


The Gibbs Sampler


Algorithm (Poisson-Gamma Gibbs)

a Simulate Y(t)i ∼P(λ(t−1)) Iy≥4 i = 1, . . . , 13

b Simulate

λ(t) ∼ Ga

(313 +

13∑i=1

y(t)i , 360

).

The Bayes estimator

δπ =1

360T

T∑t=1

(313 +

13∑i=1

y(t)i

)

converges quite rapidly

0 100 200 300 400 500

1.02

11.

022

1.02

31.

024

1.02

5

0.9 1.0 1.1 1.2

010

2030

40

lambda


The Gibbs Sampler


Rao-Blackwellization in Poisson-Gamma Gibbs

Naıve estimate

δ0 =1T

T∑t=1

λ(t)

and Rao-Blackwellized version

δπ =1T

T∑t=1

E[λ(t)|x1, x2, . . . , x5, y(i)1 , y

(i)2 , . . . , y

(i)13 ]

=1

360T

T∑t=1

(313 +

13∑i=1

y(t)i

).


The Gibbs Sampler


Rao-Blackwellization for “Non-Parametric” Estimation

Another substantial benefit of Rao-Blackwellization is in theapproximation of densities of different components of y withoutnonparametric density estimation methods.

Lemma

The estimator

1T

T∑t=1

fi(yi|Y (t)j , j 6= i) −→ f (i)(yi) .


The Gibbs Sampler

Marginalized Gibbs Sampling

Another option is to take profit of Rao-Blackwellizationduring the simulations

Resulting in marginalized (or collapsed) Gibbs samplers

In a Bayesian latent variable model, a typical scheme for the Gibbssampler is to alternate{

x|y, θθ|x, y

wherey observationsx latent variablesθ parameters

For an exponential family complete-data model with conjugateprior, the pdf f(θ|x, y) is available in closed-form, thus allowing forRao-Blackwellization.


The Gibbs Sampler


Exponential Family Distributions

`(y|θ) = h(y) exp [〈s(y), ψ(θ)〉 −B(θ)]

where s(y) are the sufficient statistics

If ψ is an invertible mapping it is possible through thereparameterization η = ψ(θ) to rewrite the above in canonical (ornatural) form

`(y|η) = h(y) exp [〈s(y), η〉 −A(η)]

and A is called the log-partition function


The Gibbs Sampler


Conjugacy in Exponential Families

The conjugate distribution for

`(y|θ) = h(y) exp [〈s(y), ψ(θ)〉 −B(θ)]

isπ(θ|µ0, λ0) = Z−1(µ0, λ0) exp [〈µ0, ψ(θ)〉 − λ0B(θ)]

where µ0 has the same dimension as s(y) and λ0 ∈ R+


The Gibbs Sampler


The Bayesian Posterior

Assuming a complete-data exponential family model for (yi, xi)and conjugate prior.

After seeing n independent observations y = (y1, . . . , yn), the jointposterior is

f(x, θ|y) ∝ h(x, y) exp [〈s(x, y), ψ(θ)〉 − nB(θ)]

Z−1(µ0, λ0) exp [〈µ0, ψ(θ)〉 − λ0B(θ)]


The Gibbs Sampler


Hence, the full conditional is given by

f(θ|x, y) = Z−1(s(x, y) + µ0, n+ λ0)exp [〈s(x, y) + µ0, ψ(θ)〉 − (n+ λ0)B(θ)]

Thus a Rao-Blackwellized estimator of the posterior mean of θ,can be computed as

1m

m∑j=1

E(θ|X(j), Y )


The Gibbs Sampler


But as the normalizing constant Z is known explicitly, it is alsopossible to integrate out θ to obtain a closed-form expression ofthe marginal posterior f(x|y):

π(x|y) ∝ h(x, y)Z(s(x, y) + µ0, n+ λ0)

Z(µ0, λ0)

The resulting marginal posterior is usually complex but amenableto single site Gibbs sampling on the components ofx = (x1, . . . , xn), especially when these are discrete variables

This is the preferred sampling method for the Latent DirichletAllocation (LDA) model and related models.


The Gibbs Sampler


Mixture of Gaussian ExampleAssume that we observe Y1, . . . , Yn in the Gaussian mixture model∑r

i=1 αiN (y|θi, vi). To make derivations simpler, α and v aretreated as fixed parameters and we use an improper flat prior onthe vector of means θ

π(xk, θ|yk) ∝ exp

[r∑i=1

(logαi −

(yk − θi)2

2vi

)I{xk = i}

]and hence

π(x1:n, θ|y1:n) ∝ exp

[r∑i=1

logαini −θ2i ni2vi

+θisivi

]where {

ni =∑n

k=1 I{xk = i}si =

∑nk=1 ykI{xk = i}


The Gibbs Sampler


Upon completing the square,

f(x1:n, θ|y1:n) ∝ exp

[r∑i=1

logαini +s2i

2nivi

]exp

[r∑i=1

− 12vi/ni

(θi −

sini

)2]

and

f(x1:n|y1:n) ∝r∏i=1

αnii

√vini

exp[

s2i

2nivi

]Finally,

Pf (Xk = i|x−k, y1:n) ∝ αni,−k+1i

√vi

ni,−k + 1exp

[(si,−k + yk)2

2(ni,−k + 1)vi

]where {

ni,−k =∑j 6=k I{xj = i} = ni − I{xk = i}

si,−k =∑j 6=k yjI{xj = i} = si − ykI{xk = i}


The Gibbs Sampler


To run the collapsed (or marginalized) single site Gibbssampler in this example

I Repeatedly simulate from the conditionals f(xk|x−k, y1:n)I keeping track of the accumulated component statistics

(ni, si)1≤i≤r

The idea can be extended to mixture models with an unknownnumber of components


Case Study

In this last course, we consider a very simple example and discussthe application of the methods discussed so far

Case StudyIndependent Sampler(Almost) Random-Walk SamplerGibbs Sampling(Baby) Reversible Jump


Case Study

Contaminated Counts

Assume we have observed a vector y1, . . . , yn of count data whichwe want to model as a mixture of Poisson distributions

L(yi|α, λ) = (1− α)λyi0

yi!e−λ0 + α

λyi

yi!e−λ

where λ0 is known

In the following, we assume wlog. that λ0 = 1 and the the prior on(α, λ) is chosen as

π(α, λ) = e−λ


Case Study

Independent Sampler

Independent Sampler

I (α∗, λ∗) ∼ U [0, 1]⊗ E xp ;

I `∗ =∑n

i=1 log{

(1− α∗) + α∗λyi∗ e−(λ∗−1)

};

I u ∼ U [0, 1] ;

I If u < exp(`∗ − `) ,

α← α∗ ;λ← λ∗ ;`← `∗ ;


Case Study

Independent Sampler

A Small-Scale Exampley = (1, 1, 0, 3, 0, 5)

0 1 2 3 4 50

0.2

0.4

0.6

0.8

1

0 2 4 6 8 100

500

1000

1500

2000

λ

coun

ts

mean 1.8, std. 1.1

0

0.2

0.4

0.6

0.8

1

0200400600800

α

counts

mean 0.45, std. 0.27

0 5 10 15 20−0.2

0

0.2

0.4

0.6

0.8

1

1.2

lags

auto

corr

.

accept. rate (%) 46 (α), 46 (λ)

αλ


Case Study

Independent Sampler

A Larger Exampley consisting of 100 values (1/6 0’s, 1/3 1’s, 1/3 2’s, 1/6 3’s)

1 1.5 2 2.50.2

0.4

0.6

0.8

1

1 1.5 2 2.50

500

1000

1500

2000

λ

coun

ts

mean 1.6, std. 0.18

0

0.2

0.4

0.6

0.8

1

05001000150020002500

α

counts

mean 0.8, std. 0.16

0 5 10 15 200

0.2

0.4

0.6

0.8

1

lags

auto

corr

.

accept. rate (%) 4.8 (α), 4.8 (λ)

αλ


Case Study

(Almost) Random-Walk Sampler

Random-Walk Sampler

I v ∼ N (0, I2) ;

I (α∗, λ∗)← (α, λ) + sv ;I If (α∗, λ∗) ∈ [0, 1]× R+ ,

I `∗ = −λ∗ +∑ni=1 log

{(1− α∗) + α∗λ

yi∗ e−(λ∗−1)

};

I u ∼ U [0, 1] ;I If u < exp(`∗ − `) Φ((1−α)/s)−Φ(−α/s)

Φ((1−α∗)/s)−Φ(−α∗/s)1−Φ(−λ/s)1−Φ(−λ∗/s) ,

α← α∗ ;λ← λ∗ ;`← `∗ ;

Φ denotes the Gaussian cdf.


Case Study


6 Observation Case

0 2 4 60

0.2

0.4

0.6

0.8

1

0 2 4 6 80

500

1000

1500

2000

λ

coun

ts

mean 1.7, std. 1.1

0

0.2

0.4

0.6

0.8

1

0200400600800

α

counts

mean 0.45, std. 0.27

0 5 10 15 200

0.2

0.4

0.6

0.8

1

lags

auto

corr

.

accept. rate (%) 22 (α), 22 (λ)

αλ

Figure: With s tuned to 1


Case Study


100 Observation Case

1 1.5 2 2.5 30

0.2

0.4

0.6

0.8

1

1 1.5 2 2.5 30

500

1000

1500

2000

2500

λ

coun

ts

mean 1.6, std. 0.16

0

0.2

0.4

0.6

0.8

1

05001000150020002500

α

counts

mean 0.84, std. 0.14

0 5 10 15 200

0.2

0.4

0.6

0.8

1

lags

auto

corr

.

accept. rate (%) 21 (α), 21 (λ)

αλ

Figure: With s tuned to 0.3


Case Study

Gibbs Sampling

Gibbs Sampling With Data Augmentation

I For i = 1, . . . , n ,

ui ∼ U [0, 1] ;

xi ← I

{ui <

αλyie−(λ−1)

(1− α) + αλyie−(λ−1)

};

I c =∑n

i=1 I(xi = 1) ;

I s =∑n

i=1 yiI(xi = 1) ;

I α ∼ Beta(c+ 1, n− c+ 1) ;

I λ ∼ G amma(s+ 1, c+ 1) ;


Case Study

Gibbs Sampling

6 Observation Case

0 1 2 3 4

0.4

0.5

0.6

0.7

0.8

0.9

1

0 2 4 60

500

1000

1500

2000

λ

coun

ts

mean 1.6, std. 0.49

0

0.2

0.4

0.6

0.8

1

0100020003000

α

counts

mean 0.87, std. 0.11

0 5 10 15 20−0.2

0

0.2

0.4

0.6

0.8

1

1.2

lags

auto

corr

.

accept. rate (%) 100 (α), 100 (λ)

αλ

Rao-Blacwellization could be used for estimating the marginal distributions


Case Study

Gibbs Sampling

100 Observation Case

1.4 1.6 1.8 2 2.20.94

0.95

0.96

0.97

0.98

0.99

1

1 1.5 2 2.50

500

1000

1500

2000

λ

coun

ts

mean 1.5, std. 0.12

0.9

0.92

0.94

0.96

0.98

1

01000200030004000

α

counts

mean 0.99, std. 0.0097

0 5 10 15 20−0.2

0

0.2

0.4

0.6

0.8

1

1.2

lags

auto

corr

.

accept. rate (%) 100 (α), 100 (λ)

αλ


Case Study

Gibbs Sampling

Collapsed Gibbs Sampler

I For i = 1, . . . , n ,

c−i = c− I(xi = 1) ;s−i = s− yiI(xi = 1) ;ui ∼ U [0, 1] ;

xi ← I{ui <

A1

A1 +A0

},

where A1=(c−i+1)e(s−i+yi)!s−i!

“c−i+1

c−i+2

”(s−i+1)“1

c−i+2

”yi,

and A0=(n−c−i) ;

c← c−i + I(xi = 1) ;s← s−i + yiI(xi = 1) ;


Case Study

(Baby) Reversible Jump

Gibbs Output on a 100 Observation Poisson DatasetSimulated from λ0 = 1

0.8 1 1.2 1.40.95

0.96

0.97

0.98

0.99

1

0.8 1 1.2 1.40

500

1000

1500

λ

coun

ts

mean 0.97, std. 0.098

0.88

0.9

0.92

0.94

0.96

0.98

1

010002000300040005000

α

counts

mean 0.99, std. 0.0097

0 5 10 15 20−0.2

0

0.2

0.4

0.6

0.8

1

1.2

lags

auto

corr

.

accept. rate (%) 100 (α), 100 (λ)

αλ


Case Study


We reformulate the model to allow for model comparison by

I Giving a prior π1 to the model M1 which postulates that

L(yi|α, λ) = (1− α)λyi0

yi!e−λ0 + α

λyi

yi!e−λ

I Giving a prior pi0 = (1− π1) to the baseline model M0

L0(yi) =λyi0

yi!e−λ0

Under M1, we keep the same prior for α and λ.

We further introduce a latent variable k ∈ {0, 1} which indicatesthe (unobservable) model in (M0,M1) that is assumed togenerate the data.


Case Study


The Posterior

The posterior f is now defined on {0} ∪ {1}× [0, 1]×R+ and suchthat

f(k, α, β|y) ∝ π0δ0(k)L0(y) + π1δ1(k)L(y|α, λ)1λ0e−λ/λ0

The Bayes factor for model M1 compared to model M0 is given by∫α∈[0,1]

∫λ∈R+

L(y|α, λ) 1λ0e−λ/λ0dαdλ

L0(y)

Instead of trying to compute the Bayes factor directly theReversible Jump Metropolis-Hastings algorithm is a sampler thattargets f on {0} ∪ {1} × [0, 1]× R+


Case Study


I If k = 0 ,

# Birth moveI (α∗, λ∗) ∼ U [0, 1]⊗ E xp(λ0) ;I u ∼ U [0, 1] ;I If u < π1L(y|α∗,λ∗)

π0L0(y) p ,

k ← 1 ; α← α∗ ; λ← λ∗ ;

I Else If k = 1 ,I u ∼ U [0, 1] ;I If u < p ,

# Death moveI v ∼ U [0, 1] ;I If v < π0L0(y)

π1L(y|α∗,λ∗)p,

k ← 0 ;

I Else ,

# Fixed dimension move (. . . )


Case Study


In the following, we assume that λ0 = 1 and π1 = 1/2 andconsider the same datasets as before.

For the 100 Observation Poisson case

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

0

1

k

The estimated posterior probability for M0 is 0.79


Case Study


100 Observation Poisson CaseResults conditional to k = 1

0 0.5 1 1.5 20

0.2

0.4

0.6

0.8

1

0 1 2 3 4 50

200

400

600

800

λ

coun

ts

mean 1.1, std. 0.43

0

0.2

0.4

0.6

0.8

1

0200400600

α

counts

mean 0.49, std. 0.37

0 5 10 15 20−0.2

0

0.2

0.4

0.6

0.8

1

1.2

lags

auto

corr

.

αλ


Case Study


For the other 100 observation dataset the estimated posteriorprobability in favor of M1 is 1


Case Study


In the case of y = (1, 1, 0, 3, 0, 5), the posterior probability in favorof M1 is estimated at 0.56

0 1 2 3 4 50

0.2

0.4

0.6

0.8

1

0 2 4 6 8 100

200

400

600

800

1000

1200

1400

λ

coun

ts

mean 1.7, std. 0.97

0

0.2

0.4

0.6

0.8

1

0200400600

α

counts

mean 0.59, std. 0.29

0 5 10 15 20−0.2

0

0.2

0.4

0.6

0.8

1

1.2

lags

auto

corr

.

αλ

Figure: Results conditional on k = 1

Documents

Markov Chain Monte Carlo Methodscappe/fr/Enseignement/mido-tsi.pdfMarkov Chain Monte Carlo Methods Motivation and leading example Bayesian troubles Conjugate Prior Conjugacy Given