Monte Carlo Notes From Main Book

7/26/2019 Monte Carlo Notes From Main Book

1/584

Markov Chain Monte Carlo Methods


Christian P. Robert

Universite Paris Dauphine and CREST-INSEEhttp://www.ceremade.dauphine.fr/~xian

3-6 Mayo 2005
http://www.ceremade.dauphine.fr/~xianhttp://www.ceremade.dauphine.fr/~xianhttp://find/


2/584


Outline

Motivation and leading example

Random variable generation

Monte Carlo Integration

Notions on Markov Chains

The Metropolis-Hastings Algorithm

The Gibbs Sampler

MCMC tools for variable dimension problems

Sequential importance sampling
http://find/


3/584


New [2004] edition:
http://find/


4/584




Motivation and leading exampleIntroductionLikelihood methodsMissing variable models

Bayesian MethodsBayesian troubles




http://goforward/http://find/http://goback/


5/584



Introduction

Latent structures make life harder!

Even simple models may lead to computational complications, as

in latent variable models

f(x|) =

f(x, x|) dx

M k Ch i M C l M h d
http://find/


6/584



Introduction


Even simple models may lead to computational complications, as

in latent variable models

f(x|) =

f(x, x|) dx

If (x, x) observed, fine!

M k Ch i M t C l M th d
http://find/http://goback/


7/584



Introduction


Even simple models may lead to computational complications, asin latent variable models

f(x|) =

f(x, x|) dx

If (x, x) observed, fine!

Ifonly x observed, trouble!



8/584



Introduction

Example (Mixture models)

Models ofmixtures of distributions:

Xfj with probability pj,

forj = 1, 2, . . . , k, with overall density

Xp1f1(x) + +pkfk(x).

http://find/


9/584



Introduction






For a sample of independent random variables (X1, , Xn),sample density

ni=1

{p1f1(xi) + +pkfk(xi)} .

http://find/


10/584



Introduction






For a sample of independent random variables (X1, , Xn),sample density

ni=1

{p1f1(xi) + +pkfk(xi)} .

Expanding this product involves kn elementary terms: prohibitive

to compute in large samples.

http://find/


11/584



Introduction

1 0 1 2 3

1

0

1

2

3

1

2

Case of the 0.3N(1, 1) + 0.7N(2, 1) likelihood



12/584


Likelihood methods

Maximum likelihood methods

Go Bayes!!

For an iid sample X1, . . . , X n from a population with densityf(x|1, . . . , k), the likelihood function is

L(|x) = L(1, . . . , k|x1, . . . , xn)=n

i=1f(xi|1, . . . , k).

http://find/


13/584


Likelihood methods


Go Bayes!!


L(|x) = L(1, . . . , k|x1, . . . , xn)=n

i=1f(xi|1, . . . , k).

Global justifications from asymptotics

http://find/


14/584


Likelihood methods


Go Bayes!!


L(|x) = L(1, . . . , k|x1, . . . , xn)=n

i=1f(xi|1, . . . , k).

Global justifications from asymptotics Computational difficulty depends on structure, eg latent

variables

http://find/


15/584


Likelihood methods

Example (Mixtures again)

For a mixture of two normal distributions,

p

N(, 2) + (1

p)

N(, 2),

likelihood proportional to

n

i=1 p1

xi

+ (1 p)1

xi

containing2n terms.

http://find/


16/584


Likelihood methods

Standard maximization techniques often fail to find the globalmaximum because of multimodality of the likelihood function.

Example

In the special case

f(x|, ) = (1)exp{(1/2)x2}+

exp{(1/22)(x)2} (1)

with >0 known,

http://find/


17/584


Likelihood methods

Standard maximization techniques often fail to find the globalmaximum because of multimodality of the likelihood function.

Example

In the special case

f(x|, ) = (1)exp{(1/2)x2}+

exp{(1/22)(x)2} (1)

with >0 known, whatever n, the likelihood is unbounded:

lim0 (= x1, |x1, . . . , xn) =

http://find/


18/584


Missing variable models

The special case of missing variable models

Consider again a latent variable representation

g(x|) = Zf(x, z|) dz

http://find/


19/584





g(x|) = Zf(x, z|) dzDefine the completed (but unobserved) likelihood

Lc(

|x, z) =f(x, z

|)


M i i d l di l
http://find/


20/584





g(x|) = Zf(x, z|) dzDefine the completed (but unobserved) likelihood

Lc(

|x, z) =f(x, z

|)

Useful for optimisation algorithm


M ti ti d l di l
http://find/


21/584



The EM Algorithm

Gibbs connection Bayes rather than EM

Algorithm (ExpectationMaximisation)

Iterate (in m)

1. (E step) Compute

Q(|(m),x) = E[log Lc(|x,Z)|(m),x] ,


http://find/


22/584



The EM Algorithm

Gibbs connection Bayes rather than EM

Algorithm (ExpectationMaximisation)

Iterate (in m)

1. (E step) Compute

Q(|(m),x) = E[log Lc(|x,Z)|(m),x] ,2. (M step) Maximise Q(|(m),x) in and take

(m+1)= arg max

Q(|(m),x).

until a fixed point [ofQ] is reached


http://find/


23/584



Echantillon N(0,1)

2 1 0 1 2

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Sample from (1)


http://find/


24/584



1 0 1 2 3

1

0

1

2

3

1

2

Likelihood of .7N(1, 1) + .3N(2, 1) and EM steps


http://find/


25/584


Bayesian Methods

The Bayesian Perspective

In the Bayesian paradigm, the information brought by the data x,realization of

Xf(x|),


http://find/


26/584

g p

Bayesian Methods

The Bayesian Perspective

In the Bayesian paradigm, the information brought by the data x,realization of

Xf(x|),is combined with prior information specified by prior distributionwith density

()


http://find/


27/584

g p

Bayesian Methods

Central tool

Summary in a probability distribution, (|x), called the posteriordistribution

Markov Chain Monte Carlo MethodsMotivation and leading example
http://find/


30/584

Bayesian Methods

Central tool...central to Bayesian inference

Posterior defined up to a constant as

(|x)f(x|) ()

Operatesconditionalupon the observations

http://find/


31/584

Bayesian Methods



(|x)f(x|) ()

Operatesconditionalupon the observations Integrate simultaneously prior informationandinformation

brought by x

http://find/


32/584

Bayesian Methods



(|x)f(x|) ()


brought by x

Avoids averaging over theunobservedvalues ofx



33/584

Bayesian Methods



(|x)f(x|) ()


brought by x


Coherentupdating of the information available on ,independent of the order in which i.i.d. observations arecollected


B i M h d
http://find/


34/584

Bayesian Methods



(|x)f(x|) ()


brought by x


Coherentupdating of the information available on ,independent of the order in which i.i.d. observations arecollected

Provides acompleteinferential scope and a unique motor of

inference


B i t bl


35/584

Bayesian troubles

Conjugate bonanza...

Example (Binomial)

For an observation X

B(n, p) so-called conjugate prior is the

family of beta Be(a, b) distributions


Bayesian troubles
http://find/


36/584

Bayesian troubles

Conjugate bonanza...

Example (Binomial)

For an observation X

B(n, p) so-called conjugate prior is the

family of beta Be(a, b) distributionsThe classical Bayes estimator is the posterior mean

(a + b + n)

(a + x)(n

x + b)

1

0p px+a1(1 p)nx+b1dp

= x + aa + b + n

.


Bayesian troubles
http://find/


37/584

Bayesian troubles

Example (Normal)

In the normalN(, 2) case, with both and unknown,conjugate prior on = (, 2) of the form

(2) exp

( )2 +

/2


Bayesian troubles


38/584

Bayesian troubles

Example (Normal)

In the normalN(, 2) case, with both and unknown,conjugate prior on = (, 2) of the form

(2) exp

( )2 +

/2

since

((, 2)|x1, . . . , xn) (2) exp

( )2 +

/2

(2)n exp

n( x)2 + s2x

/2

(2)+n exp (+ n)( x)2+ + s2x+

nn +

/2


Bayesian troubles
http://find/


39/584

Bayesian troubles

...and conjugate curse

The use ofconjugate priors for computational reasons

implies a restriction on the modeling of the available priorinformation


Bayesian troubles
http://find/


40/584

y



implies a restriction on the modeling of the available priorinformation may be detrimental to the usefulness of the Bayesian approach


Bayesian troubles
http://find/


41/584

y



implies a restriction on the modeling of the available priorinformation may be detrimental to the usefulness of the Bayesian approach gives an impression of subjective manipulation of the prior

information disconnected from reality.


Bayesian troubles
http://find/


42/584

A typology of Bayes computational problems

(i). use of a complex parameter space, as for instance inconstrained parameter sets like those resulting from imposingstationarity constraints in dynamic models;


Bayesian troubles
http://find/


43/584



(ii). use of a complex sampling model with an intractablelikelihood, as for instance in missing data and graphicalmodels;


Bayesian troubles
http://find/


44/584




(iii). use of a huge dataset;


Bayesian troubles
http://find/


45/584





(iv). use of a complex prior distribution (which may be theposterior distribution associated with an earlier sample);


Bayesian troubles
http://find/


46/584





(iv). use of a complex prior distribution (which may be theposterior distribution associated with an earlier sample);

(v). use of a complex inferential procedure as for instance, Bayesfactors

B01(x) =P(0 | x)P(

1

|x)

(0)(

1)

.


Bayesian troubles
http://find/


47/584

Example (Mixture once again)

Observations from

x1, . . . , xnf(x|) =p(x; 1, 1) + (1 p)(x; 2, 2)


Bayesian troubles


48/584


Observations from

x1, . . . , xnf(x|) =p(x; 1, 1) + (1 p)(x; 2, 2)

Prior

i|iN(i, 2i /ni), 2i IG(i/2, s2i /2), p Be(, )


Bayesian troubles


49/584


Observations from

x1, . . . , xnf(x|) =p(x; 1, 1) + (1 p)(x; 2, 2)

Prior

i|iN(i, 2i /ni), 2i IG(i/2, s2i /2), p Be(, )

Posterior

(|x1, . . . , xn) n

j=1 {p(xj; 1, 1) + (1 p)(xj; 2, 2)} ()

=n=0

(kt)

(kt)(|(kt))

n


Bayesian troubles
http://find/


50/584

Example (Mixture once again (contd))

For a given permutation (kt), conditional posterior distribution

(|(kt)) = N1(kt), 21n1+ IG((1+ )/2, s1(kt)/2)N

2(kt),

22n2+ n

IG((2+ n )/2, s2(kt)/2)

Be( + , + n

)


Bayesian troubles
http://find/


51/584

Example (Mixture once again (contd))

where

x1(kt) = 1

t=1 xkt , s1(kt) =

t=1(xktx1(kt))2,

x2(kt) = 1n

nt=+1 xkt , s2(kt) =

nt=+1(xktx2(kt))2

and

1(kt) = n11+ x1(kt)

n1+ , 2(kt) =

n22+ (n )x2(kt)n2+ n ,

s1(kt) = s21+ s

21(kt) +

n1

n1+

(1

x1(kt))

2,

s2(kt) = s22+ s

22(kt) +

n2(n )n2+ n (2x2(kt))

2,

posterior updates of the hyperparameters


Bayesian troubles
http://find/


52/584


Bayes estimator of:

(x1, . . . , xn) =n=0

(kt)

(kt)E[|x, (kt)]

Too costly: 2n terms



Bayesian troubles
http://find/


53/584

press for AR

Example (Poly-t priors)

Normal observation x N(, 1), with conjugate prior

N(, )

Closed form expression for the posterior mean

f(x|) () d

f(x|) () d=

= x + 2

1 + 2 .



Bayesian troubles
http://find/


54/584

Example (Poly-t priors (2))

More involved prior distribution:poly-t distribution

[Bauwens,1985]

() =ki=1

i+ ( i)2

i i, i >0



Bayesian troubles
http://find/


55/584

Example (Poly-t priors (2))

More involved prior distribution:poly-t distribution

[Bauwens,1985]

() =ki=1

i+ ( i)2

i i, i >0Computation ofE[|x]???



Bayesian troubles
http://find/


56/584

Example (AR(p) model)

Auto-regressive representation of a time series,

xt =

p

i=1 ixti+ t



Bayesian troubles
http://find/


57/584

Example (AR(p) model)

Auto-regressive representation of a time series,

xt =

p

i=1 ixti+ tIf order p unknown, predictive distribution ofxt+1 given by

(xt+1|xt, . . . , x1) f(xt+1|xt, . . . , xtp+1)(, p|xt, . . . , x1)dpd,



Bayesian troubles
http://find/


58/584

Example (AR(p) model (contd))

Integration over the parameters of all models

p=0

f(xt+1|xt, . . . , xtp+1)(|p, xt, . . . , x1) d (p|xt, . . . , x1) .



Bayesian troubles


59/584

Example (AR(p) model (contd))Multiple layers of complexity

(i). Complex parameter space within each AR(p) model becauseof stationarity constraint

(ii). ifp unbounded, infinity of models(iii). varies between models AR(p) and AR(p + 1), with a

different stationarity constraint (except for rootreparameterisation).

(iv). if prediction used sequentially, every tick/second/hour/day,posterior distribution(, p|xt, . . . , x1) must be re-evaluated



R d i bl ti
http://find/


60/584



Random variable generationBasic methods

Uniform pseudo-random generatorBeyond Uniform distributionsTransformation methodsAccept-Reject MethodsFundamental theorem of simulation

Log-concave densities


Notions on Markov Chains Markov Chain Monte Carlo Methods


R d i bl ti
http://find/


61/584


Rely on the possibility of producing (computer-wise) anendless flow of random variables (usually iid) from well-knowndistributions



R d i bl ti
http://find/


62/584


Rely on the possibility of producing (computer-wise) anendless flow of random variables (usually iid) from well-knowndistributions

Given a uniform random number generator, illustration ofmethods that produce random variables from both standardand nonstandard distributions



Basic methods

The inverse transform method
http://find/


63/584


For a functionF on R, thegeneralized inverseofF, F, is definedby

F

(u) = inf{x; F(x)u}.



Basic methods

http://find/


64/584


For a functionF on R, thegeneralized inverseofF, F, is definedby

F

(u) = inf{x; F(x)u}.Definition (Probability Integral Transform)

IfU U[0,1], then the random variable F(U) has the distributionF.



Basic methods

The inverse transform method (2)
http://find/


65/584


To generate a random variable XF, simply generate

U U[0,1]



Basic methods

http://find/


66/584


To generate a random variable XF, simply generate

U U[0,1]and then make the transform

x=F(u)



Uniform pseudo-random generator

Desiderata and limitations
http://find/


67/584

Desiderata and limitationsskip Uniform

Production of a deterministicsequence of values in [0, 1]which imitates a sequence ofiiduniform random variablesU[0,1].




http://find/


68/584



Cant use the physical imitation of a random draw [noguarantee of uniformity, no reproducibility]




http://find/


69/584




Randomsequence in the sense: Having generated(X1, , Xn), knowledge ofXn [or of (X1, , Xn)] impartsno discernible knowledge of the value ofXn+1.




http://find/


70/584





Deterministic: Given the initial value X0, sample(X1, , Xn) always the same




http://find/


71/584





Deterministic: Given the initial value X0, sample(X1, , Xn) always the same Validity of a random number generator based on a singlesample X1, , Xn when n tends to +, not on replications

(X11, , X1n), (X21, , X2n), . . . (Xk1, , Xkn)

where n fixed and k tends to infinity Markov Chain Monte Carlo Methods





72/584

Uniform pseudo random generator

Algorithm starting from an initial value 0u01 and atransformationD, which produces a sequence

(ui) = (Di(u

0))

in [0, 1].




http://find/


73/584

p g

Algorithm starting from an initial value 0u01 and atransformationD, which produces a sequence

(ui) = (Di(u

0))

in [0, 1].For all n,

(u1, , un)reproduces the behavior of an iid U[0,1] sample (V1, , Vn) whencompared through usual tests




Uniform pseudo-random generator (2)
http://find/


74/584

p g ( )

Validity means the sequence U1, , Un leads to accept thehypothesis

H :U1

,

, Un

are iid U[0,1]

.




Uniform pseudo-random generator (2)
http://find/


75/584

p g ( )

Validity means the sequence U1, , Un leads to accept thehypothesis

H :U1

,

, Un

are iid U[0,1]

.

The set of tests used is generally of some consequence KolmogorovSmirnov and other nonparametric tests Time series methods, for correlation between Ui and(Ui1, , Uik) Marsaglias battery of tests called Die Hard(!)




Usual generators
http://find/


76/584

g

In R and S-plus, procedure runif()

The Uniform Distribution

Description:

runif generates random deviates.

Example:

u


77/584

500 520 540 560 580 600

0.

0

0.

2

0.

4

0.6

0.8

1.

0

uniform sample

0.0 0.2 0.4 0.6 0.8 1.0

0.

0

0.

5

1.0

1.

5




Usual generators (2)
http://find/


78/584

( )

In C, procedure rand() orrandom()

SYNOPSIS

#include

long int random(void);

DESCRIPTION

The random() function uses a non-linear additivefeedback random number generator employing a

default table of size 31 long integers to return

successive pseudo-random numbers in the range

from 0 to RAND_MAX. The period of this random

generator is very large, approximately

16*((2**31)-1).

RETURN VALUE

random() returns a value between 0 and RAND_MAX.




Usual generators(3)
http://find/


79/584

In Scilab, procedure rand()

rand() : with no arguments gives a scalar whose

value changes each time it is referenced. By

default, random numbers are uniformly distributedin the interval (0,1). rand(normal) switches to

a normal distribution with mean 0 and variance 1.

EXAMPLE

x=rand(10,10,uniform)



Beyond Uniform distributions

Beyond Uniform generators
http://find/


80/584

Generation of any sequence of random variables can beformally implemented through a uniform generator

Distributions with explicit F (for instance, exponential, andWeibull distributions), use the probability integral

transform here




http://find/


81/584



transform here Case specific methods rely on properties of the distribution (forinstance, normal distribution, Poisson distribution)




http://find/


82/584




More generic methods (for instance, accept-reject andratio-of-uniform)




http://find/


83/584




More generic methods (for instance, accept-reject andratio-of-uniform)

Simulation of the standard distributions is accomplished quiteefficiently by many numerical and statistical programmingpackages.



Transformation methods

http://find/


84/584

Case where a distribution F is linked in a simple way to anotherdistribution easy to simulate.

Example (Exponential variables)

IfU

U[0,1], the random variable

X= log U/

has distribution

P(Xx) = P( log Ux)= P(U ex) = 1 ex,

the exponential distribution Exp().




Other random variables that can be generated starting from an
http://find/


85/584

g gexponential include

Y =2

j=1

log(Uj)22

Y = 1

aj=1

log(Uj) Ga(a, )

Y =aj=1log(Uj)a+bj=1log(Uj)

Be(a, b)




Points to note
http://find/


86/584

Transformation quite simple to use There are more efficient algorithms for gamma and beta

random variables Cannot generate gamma random variables with a non-integershape parameter

For instance, cannot get a 21 variable, which would get us a

N(0, 1) variable.




Box-Muller Algorithm
http://find/


87/584

Example (Normal variables)

Ifr, polar coordinates of (X1, X2), then,

r2 =X21 + X22

22 = E(1/2) and

U[0, 2]


Random variable generationTransformation methods

Box-Muller Algorithm
http://find/


88/584

Example (Normal variables)

Ifr, polar coordinates of (X1, X2), then,

r2 =X21 + X22

22 = E(1/2) and

U[0, 2]

Consequence: IfU1, U2 iidU[0,1],

X1 =

2log(U1) cos(2U2)

X2 = 2log(U1) sin(2U2)iidN(0, 1).



Box-Muller Algorithm (2)
http://find/


89/584

1. Generate U1, U2 iidU[0,1] ;2. Define

x1 = 2log(u1) cos(2u2),x2 =

2log(u1) sin(2u2) ;

3. Take x1 and x2 as two independent draws from

N(0, 1).



Box-Muller Algorithm (3)
http://find/


90/584

4 2 0 2 4

4

3

2

1

0

1

2

3

Unlike algorithms based on the CLT,this algorithm is exact

Get two normals for the price oftwo uniforms

4 2 0 2 4

0.

0

0.

1

0.

2

0.

3

0.

4

Drawback (in speed)in calculating log, cos and sin.



More transforms
http://find/


91/584

Reject

Example (Poisson generation)

Poissonexponential connection:IfN P() and Xi Exp(), i N,

P(N=k) =

P(X1+

+ Xk

1< X1+

+ Xk+1).



More Poisson
http://find/


92/584

Skip Poisson

A Poisson can be simulated by generating Exp(1) till theirsum exceeds 1.

This method is simple, but is really practical only for smallervalues of.

On average, the number of exponential variables required is .

Other approaches are more suitable for large s.



Atkinsons PoissonTo generate N P():
http://find/


93/584

To generate N P():1. Define

=/

3, = and k= log clog ;

2. Generate U1

U[0,1] and calculate

x={ log{(1 u1)/u1}}/

until x >0.5 ;3. Define N=

x + 0.5

and generate

U2 U[0,1];4. Accept N if

x+log (u2/{1+exp(x)}2)k+Nlog log N!.



Negative extension
http://find/


94/584

A generator of Poisson random variables can produce negativebinomial random variables since,

Y Ga(n, (1 p)/p) X|y P(y)

impliesX Neg(n, p)



Mixture representation
http://find/


95/584

The representation of the negative binomial is a particularcase of a mixture distribution

The principle of a mixture representation is to represent adensityfas the marginal of another distribution, for example

f(x) =iY

pi fi(x),

If the component distributions fi(x) can be easily generated,

Xcan be obtained by first choosing fi with probability pi andthen generating an observation from fi.



Partitioned sampling


96/584

Special case of mixture sampling when

fi(x) =f(x) IAi

(x)Ai f(x) dxand

pi= Pr(XAi)for a partition (A

i)i


Random variable generationAccept-Reject Methods

Accept-Reject algorithm
http://find/


97/584

Many distributions from which it is difficult, or evenimpossible, to directly simulate.

Another class of methods that only require us to know thefunctional form of the density fof interest only up to amultiplicative constant.

The key to this method is to use a simpler (simulation-wise)densityg, the instrumental density, from which the simulation

from the target density f is actually done.


Random variable generationFundamental theorem of simulation

Fundamental theorem of simulation


98/584

Lemma

Simulating

Xf(x)

equivalent to simulating

(X, U) U{(x, u) : 0< u < f(x)}0 2 4 6 8 10

0.

00

0.

05

0.

10

0.

15

0.

20

0.

25

x

f(x)



The Accept-Reject algorithm


99/584

Given a density of interest f, find a density g and a constant Msuch that

f(x)M g(x)on the support off.



The Accept-Reject algorithm
http://find/


100/584

Given a density of interest f, find a density g and a constant Msuch that

f(x)M g(x)on the support off.

1. Generate Xg, U U[0,1] ;2. Accept Y =X ifU f(X)/M g(X) ;

3. Return to 1. otherwise.



Validation of the Accept-Reject method

Warranty:


101/584

Warranty:

This algorithm produces a variable Ydistributed according to f

4 2 0 2 4

0

1

2

3

4

5



Two interesting properties


102/584

First, it provides a generic method to simulate from anydensityf that is known up to a multiplicative factorProperty particularly important in Bayesian calculations wherethe posterior distribution

(|x)()f(x|).

is specified up to a normalizing constant



Two interesting properties
http://find/


103/584

First, it provides a generic method to simulate from anydensityf that is known up to a multiplicative factorProperty particularly important in Bayesian calculations wherethe posterior distribution

(|x)()f(x|).

is specified up to a normalizing constant

Second, the probability of acceptance in the algorithm is

1/M, e.g., expected number of trials until a variable isaccepted is M



More interesting properties
http://find/


104/584

In cases f and g both probability densities, the constant M isnecessarily larger that 1.



http://find/


105/584


The size ofM, and thus the efficiency of the algorithm, arefunctions of how closely g can imitate f, especially in the tails



http://find/


106/584


The size ofM, and thus the efficiency of the algorithm, arefunctions of how closely g can imitate f, especially in the tails

For f/g to remain bounded, necessary for g to have tailsthicker than those off.It is therefore impossible to use the A-R algorithm to simulatea Cauchy distribution fusing a normal distribution g, howeverthe reverse works quite well.



No Cauchy!
http://find/


107/584

Example (Normal from a Cauchy)Take

f(x) = 1

2exp(x2/2)

and

g(x) = 1

11 + x2

,

densities of the normal and Cauchy distributions.



No Cauchy!
http://find/


108/584

Example (Normal from a Cauchy)Take

f(x) = 1

2exp(x2/2)

and

g(x) = 1

11 + x2

,

densities of the normal and Cauchy distributions.Then

f(x)

g(x) = 2 (1 + x2)ex2/2 2e = 1.52attained at x=1.


http://find/


109/584

Example (Normal from a Cauchy (2))

So probability of acceptance

1/1.52 = 0.66,

and, on the average, one out of every three simulated Cauchyvariables is rejected.



No Double!
http://find/


110/584

Example (Normal/Double Exponential)

Generate a N(0, 1) by using a double-exponential distributionwith density

g(x|) = (/2) exp(|x|)Then

f(x)

g(x|)

2

1e

2/2

and minimum of this bound (in ) attained for

= 1


http://find/


111/584

Example (Normal/Double Exponential (2))

Probability of acceptance

/2e= .76To produce one normal random variable requires on the average1/.761.3 uniform variables.


http://find/


112/584

truncate

Example (Gamma generation)

Illustrates a real advantage of the Accept-Reject algorithmThe gamma distributionGa(, ) represented as the sum ofexponential random variables, only if is an integer




Example (Gamma generation (2))
http://find/


113/584

Can use the Accept-Reject algorithm with instrumental distribution

Ga(a, b), with a= [], 0.

(Without loss of generality, = 1.)




Example (Gamma generation (2))
http://find/


114/584

Can use the Accept-Reject algorithm with instrumental distribution

Ga(a, b), with a= [], 0.

(Without loss of generality, = 1.)

Up to a normalizing constant,

f /gb =baxa exp{(1 b)x} ba

a(1 b)e

aforb1.The maximum is attained at b= a/.




Cheng and Feasts Gamma generator

Gamma Ga(, 1), >1 distribution
http://find/


115/584

( , ),

1. Define c1 = 1, c2= ( (1/6))/c1,c3 = 2/c1, c4 = 1 + c3, and c5 = 1/

.

2. Repeatgenerate U1, U2

take U1 =U2+ c5(1 1.86U1) if >2.5until 0< U1


116/584

Example (Truncated Normal distributions)Constraint x produces density proportional to

e(x)2/22 Ix

for a bound large compared with




Truncated Normal simulation
http://find/


117/584

Example (Truncated Normal distributions)Constraint x produces density proportional to

e(x)2/22 Ix

for a bound large compared with There exists alternatives far superior to the nave method ofgenerating aN(, 2) until exceeding, which requires an averagenumber of

1/(( )/)simulations fromN(, 2) for a single acceptance.





118/584

Example (Truncated Normal distributions (2))Instrumental distribution: translated exponential distribution,E(, ), with density

g(z) =e(z) Iz.





119/584

Example (Truncated Normal distributions (2))Instrumental distribution: translated exponential distribution,E(, ), with density

g(z) =e(z) Iz.

The ratio f /g is bounded by

f/g 1/ exp(2/2 ) if > ,1/ exp(

2/2) otherwise.




Log-concave densities (1)
http://find/


120/584

move to next chapter Densities fwhose logarithm is concave, forinstance Bayesian posterior distributions such that

log (|x) = log () + log f(x|) + c

concave




http://find/


121/584

TakeSn={xi, i= 0, 1, . . . , n + 1} supp(f)

such that h(xi) = log f(xi) known up to the same constant.

By concavity ofh, line Li,i+1 through (xi, h(xi)) and(xi+1, h(xi+1))




http://find/


122/584

TakeSn={xi, i= 0, 1, . . . , n + 1} supp(f)

such that h(xi) = log f(xi) known up to the same constant.

By concavity ofh, line Li,i+1 through (xi, h(xi)) and(xi+1, h(xi+1))

x1

x2

x3

x4

x

L (x)2,3

log f(x)

below h in [xi, xi+1] and

above this graph outside this interval




http://find/


123/584

For x[xi, xi+1], if

hn(x) = min{Li1,i(x), Li+1,i+2(x)} and hn(x) =Li,i+1(x) ,

the envelopes arehn(x)h(x)hn(x)

uniformly on the support off, with

hn(x) =

and hn(x) = min(L0,1(x), Ln,n+1(x))

on [x0, xn+1]c.




http://find/


124/584

Therefore, if

fn

(x) = exp hn(x) and fn(x) = exp hn(x)

thenfn

(x)f(x)fn(x) =n gn(x),where n normalizing constant offn




ARS Algorithm
http://find/


125/584

1. Initialize n and Sn.

2. Generate X

gn(x), U U[0,1]

.

3. IfU fn

(X)/n gn(X), accept X;otherwise, ifU f(X)/n gn(X), accept X




kill ducks
http://find/


126/584

Example (Northern Pintail ducks)

Ducks captured at time i with both probability pi and size N ofthe population unknown.Dataset

(n1, . . . , n11) = (32, 20, 8, 5, 1, 2, 0, 2, 1, 1, 0)

Number of recoveries over the years 19571968 ofN= 1612Northern Pintail ducks banded in 1956




E l (N th Pi t il d k (2))
http://find/


127/584

Example (Northern Pintail ducks (2))Corresponding conditional likelihood

L(p1, . . . , pI

|N, n1, . . . , nI) =

N!

(N r)!

I

i=1 pnii (1

pi)

Nni ,

where Inumber of captures, ni number of captured animalsduring the ith capture, and r is the total number of differentcaptured animals.




http://find/


128/584

Example (Northern Pintail ducks (3))Prior selectionIf

N P()and

i = log

pi

1 pi

N(i, 2),

[Normal logistic]




http://find/


129/584

Example (Northern Pintail ducks (4))

Posterior distribution

(, N

|, n1, . . . , nI)

N!

(N r)!N

N!

I

i=1(1 + ei)NIi=1

exp

ini 1

22(i i)2




Example (Northern Pintail ducks (5))
http://find/


130/584

For the conditional posterior distribution

(i|N, n1, . . . , nI) exp

ini 122

(i i)2

(1+ei)N ,

the ARS algorithm can be implemented since

ini 122

(i i)2 N log(1 + ei)

is concave in i.




Posterior distributions of capture log-odds ratios for theyears 19571965.

1.

0

1.

0

1.

0
http://find/


131/584

10 9 8 7 6 5 4 3

0.

0

0.

2

0.

4

0.

6

0.

8

1

10 9 8 7 6 5 4 3

0.

0

0.

2

0.

4

0.

6

0.

8

1

10 9 8 7 6 5 4 3

0.

0

0.

2

0.

4

0.

6

0.

8

1

10 9 8 7 6 5 4 3

0.

0

0.

2

0.

4

0.

6

0.

8

1.

01960

10 9 8 7 6 5 4 3

0.

0

0.

2

0.

4

0.

6

0.

8

1.

01961

10 9 8 7 6 5 4 3

0.

0

0.

2

0.

4

0.

6

0.

8

1.

01962

10 9 8 7 6 5 4 3

0.

0

0.

2

0.

4

0.

6

0.

8

1.

0

1963

10 9 8 7 6 5 4 3

0.

0

0.

2

0.

4

0.

6

0.

8

1.

0

1964

10 9 8 7 6 5 4 3

0.

0

0.

2

0.

4

0.

6

0.

8

1.

0

1965




1960
http://find/


132/584

1960

8 7 6 5 4

0.0

0.2

0.

4

0.

6

0.

8

True distribution versus histogram of simulated sample



Monte Carlo integration

http://find/


133/584



IntroductionMonte Carlo integrationImportance SamplingAcceleration methodsBayesian importance sampling





Introduction

Quick reminder
http://find/


134/584

Two major classes of numerical problems that arise in statisticalinference

Optimization - generally associated with the likelihoodapproach



Introduction

Quick reminder
http://find/


135/584

Two major classes of numerical problems that arise in statisticalinference

Optimization - generally associated with the likelihoodapproach

Integration- generally associated with the Bayesian approach



Introduction

skip Example!

Example (Bayesian decision theory)

Bayes estimators are not always posterior expectations but rathersolutions of the minimization problem
http://find/


136/584

Bayes estimators are not always posterior expectations, but rathersolutions of the minimization problem

min

L(, )()f(x|)d .

Proper loss:For L(, ) = ( )2, the Bayes estimator is the posterior mean



Introduction

skip Example!


http://find/


137/584


min

L(, )()f(x|)d .

Proper loss:For L(, ) = ( )2, the Bayes estimator is the posterior meanAbsolute error loss:For L(, ) =| |, the Bayes estimator is theposterior median



Introduction

skip Example!


http://find/


138/584

solutions of the minimization problem

min

L(, )()f(x|)d .

Proper loss:For L(, ) = ( )2, the Bayes estimator is the posterior meanAbsolute error loss:For L(, ) =| |, the Bayes estimator is theposterior medianWith no loss functionuse themaximum a posteriori (MAP) estimator

arg max

(|x)()




http://find/


139/584

Theme:Generic problem of evaluating the integral

I = Ef[h(X)] = X

h(x)f(x)dx

where X is uni- or multidimensional, fis a closed form, partlyclosed form, or implicit density, and h is a function




Monte Carlo integration (2)

Monte Carlo solution
http://find/


140/584

Monte Carlo solutionFirst use a sample (X1, . . . , X m) from the density f toapproximate the integral I by the empirical average

hm= 1m

mj=1

h(xj)




Monte Carlo integration (2)

Monte Carlo solution
http://find/


141/584

Monte Carlo solutionFirst use a sample (X1, . . . , X m) from the density f toapproximate the integral I by the empirical average

hm= 1m

mj=1

h(xj)

which convergeshm

Ef[h(X)]

by the Strong Law of Large Numbers




Monte Carlo precision

Estimate the variance with
http://find/


142/584

vm= 1

m

1

m 1m

j=1

[h(xj) hm]2,

and for m large,

hm Ef[h(X)]vm

N(0, 1).

Note: This can lead to the construction of a convergence test andof confidence bounds on the approximation ofEf[h(X)].




Example (Cauchy prior/normal sample)

For estimating a normal mean, a robust prior is a Cauchy prior
http://find/


143/584

g , p y p

X N(, 1), C(0, 1).

Under squared error loss, posterior mean

(x) =

1 + 2e(x)

2/2d

1

1 +

2e(x)

2/2d




Example (Cauchy prior/normal sample (2))

Form of

suggests simulating iid variables
http://find/


144/584

1, , mN(x, 1)

and calculating

m(x) =mi=1

i1 + 2i

mi=1

1

1 + 2i.

The Law of Large Numbers implies

m(x)(x) as m .




10.

6
http://find/


145/584

0 200 400 600 800 1000

9.6

9.8

1

0.0

10.2

10.4

iterations

Range of estimators m for 100 runs and x= 10



Importance Sampling

Importance sampling

P d
http://find/


146/584

Paradox

Simulation from f(the true density) is not necessarily optimal



Importance Sampling

Importance sampling

Paradox
http://find/


147/584

Paradox

Simulation from f(the true density) is not necessarily optimal

Alternative to direct sampling from f is importance sampling,

based on the alternative representation

Ef[h(X)] =

X

h(x)

f(x)

g(x)

g(x)dx .

which allows us to use other distributions than f



Importance Sampling

Importance sampling algorithm

Evaluation of
http://find/


148/584

Ef[h(X)] =

X

h(x)f(x)dx

by

1. Generate a sample X1, . . . , X n from a distributiong

2. Use the approximation

1

m

m

j=1f(Xj)

g(Xj)

h(Xj)



Importance Sampling

Same thing as before!!!


149/584

Convergence of the estimator

1

m

m

j=1f(Xj)

g(Xj)

h(Xj)

X h(x)f(x)dx



Importance Sampling

Same thing as before!!!


150/584

Convergence of the estimator

1

m

m

j=1f(Xj)

g(Xj)

h(Xj)

X h(x)f(x)dxconverges for any choice of the distribution g[as long as supp(g)supp(f)]



Importance Sampling

Important details


151/584

Instrumental distribution g chosen from distributions easy tosimulate

The same sample (generated from g) can be used repeatedly,

not only for different functions h, but also for differentdensities f

Even dependent proposals can be used, as seen laterPMC chapter



Importance Sampling

Althoughg can be any density, some choices are better thanothers:

Finite variance only when


152/584

Ef

h2(X)

f(X)

g(X)

=

X

h2(x)f2(X)

g(X) dx


153/584

Ef

h2(X)

f(X)

g(X)

=

X

h2(x)f2(X)

g(X) dx


154/584

Ef

h2(X)

f(X)

g(X)

=

X

h2(x)f2(X)

g(X) dx


155/584

Ratio of the densities

(x) = p(x)

p0(x)=

2

exp x2/2

(1 + x2)

very badly behaved: e.g.,

(x)2p0(x)dx= .

Poor performances of the associated importance samplingestimator



Importance Sampling

200


156/584

0 2000 4000 6000 8000 10000

0

50

100

150

iterations

Range and average of500 replications of IS estimate ofE[exp X] over 10, 000 iterations.



Importance Sampling

Optimal importance function

The choice of g that minimizes the variance of the
http://find/


157/584

The choice ofg that minimizes the variance of theimportance sampling estimator is

g

(x) = |h(x)

|f(x)Z |h(z)|f(z)dz .
http://find/


158/584



Importance Sampling

Practical impact

mj=1 h(Xj)f(Xj)/g(Xj)mj=1 f(Xj)/g(Xj)

,


159/584

where f and g are known up to constants.

Also converges to I by the Strong Law of Large Numbers. Biased, but the bias is quite small



Importance Sampling

Practical impact

mj=1 h(Xj)f(Xj)/g(Xj)mj=1 f(Xj)/g(Xj)

,
http://find/


160/584

where f and g are known up to constants.

Also converges to I by the Strong Law of Large Numbers. Biased, but the bias is quite small In some settings beats the unbiased estimator in squared error

loss.

Using the optimal solution does not always work:mj=1 h(xj)f(xj)/|h(xj)|f(xj)m

j=1 f(xj)/|h(xj)|f(xj) =

#positive h #negative hmj=11/|h(xj)|



Importance Sampling

Selfnormalised importance sampling

For ratio estimator


161/584

nh =n

i=1i h(xi)

n

i=1i

with Xig(y) and Wi such that

E[Wi|Xi=x] =f(x)/g(x)



Importance Sampling

Selfnormalised variance

then

var(nh) 1

var(Snh ) 2E[h]cov(Snh Sn1 ) + E[h]2 var(Sn1 )
http://find/


162/584

var(h) n22

var(Sh ) 2E [h]cov(Sh , S1 ) + E [h] var(S1 )

.

for

Snh =

ni=1

Wih(Xi) , Sn1 =

ni=1

Wi

Rough approximation

varnh 1nvar(h(X)) {1 + varg(W)}



Importance Sampling

Example (Students t distribution)

X T(,,2), with density


163/584

f(x) = ((+ 1)/2)

(/2)

1 +

(x )22

(+1)/2.

Without loss of generality, take = 0, = 1.Problem: Calculate the integral

2.1 sin(x)

x n

f(x)dx.



Importance Sampling

Example (Students t distribution (2)) Simulation possibilities
http://find/


164/584

p

Directly from f, since f= N(0,1)2



Importance Sampling

http://find/


165/584

p


Importance sampling using Cauchy C(0, 1)



Importance Sampling

http://find/


166/584

p


Importance sampling using Cauchy C(0, 1) Importance sampling using a normal N(0, 1)

(expected to be nonoptimal)



Importance Sampling

http://find/


167/584


Importance sampling using Cauchy C(0, 1) Importance sampling using a normal N(0, 1)

(expected to be nonoptimal) Importance sampling using a U([0, 1/2.1])

change of variables



Importance Sampling

6.5

7.0

6.5

7.0

6.5

7.0

6.5

7.0
http://find/


168/584

0 10000 20000 30000 40000 50000

5.0

5.5

6.0

0 10000 20000 30000 40000 50000

5.0

5.5

6.0

0 10000 20000 30000 40000 50000

5.0

5.5

6.0

0 10000 20000 30000 40000 50000

5.0

5.5

6.0

Sampling from f(solid lines), importance sampling with

Cauchy instrumental (short dashes), U([0, 1/2.1])instrumental (long dashes) and normal instrumental (dots).



Importance Sampling

IS suffers from curse of dimensionality

As dimension increases, discrepancy between importance andtarget worsens

skip explanation
http://find/


169/584
http://find/


170/584



Importance Sampling

Since E[Vn+1] = 1 and Vn+1 independent fromFn,

E(Un+1| F

n) =UnE(Vn+1| F

n) =Un,

and thus{Un}n0 martingaleSi

b J i lit


171/584

Sincex x concave, by Jensen s inequality,

E(Un+1 | Fn) E(Un+1 | Fn) Unand thus{Un}n0 supermartingaleAssume E(

Vn+1)< 1. Then

E(Un) =n

k=1

E(Vk) 0, n .



Importance Sampling

But{Un}n0 is a nonnegative supermartingale and thus

Unconverges a.s. to a random variable Z 0. By Fatous lemma,

E(Z) = E

limn

Un lim infn E(Un) = 0.

Hence Z = 0 and U 0 a s which implies that the martingale
http://find/


172/584

Hence, Z= 0 and Un 0 a.s., which implies that the martingale{Un}n0 is not regular.Apply these results to Vk =

dd

(ik), i {1, . . . , N }:

E

d

d(ik)

E

d

d(ik)

= 1.

with equality iff dd

= 1, -a.e., i.e. = .

Thus all importance weights converge to 0



Importance Sampling

too volatile!

Example (Stochastic volatility model)
http://find/


173/584

yt=exp(xt/2) t , t N(0, 1)

with AR(1) log-variance process (or volatility)

xt+1=xt+ ut , ut N(0, 1)



Importance Sampling

Evolution of IBM stocks (corrected from trend and log-ratio-ed)

6
http://find/


174/584

0 100 200 300 400 500

10

9

8

7

time



Importance Sampling

Example (Stochastic volatility model (2))

Observed likelihood unavailable in closed from.Joint posterior (or conditional) distribution of the hidden state
http://find/


175/584

p ( )sequence{Xk}1kKcan be evaluated explicitly

Kk=2

exp 2(xk xk1)2 + 2 exp(xk)y2k+ xk /2 , (2)up to a normalizing constant.



Importance Sampling

Computational problems

http://find/


176/584


Direct simulation from this distribution impossible because of

(a) dependence among the Xks,(b) dimension of the sequence{Xk}1kK, and(c) exponential term exp(xk)y2k within (2).



Importance Sampling

Importance sampling

http://find/


177/584

Natural candidate: replace the exponential term with a quadraticapproximation to preserve Gaussianity.

E.g., expand exp(xk) around its conditional expectationxk1 as

exp(xk)exp(xk1)

1 (xk xk1) +12

(xk xk1)2



Importance Sampling


Corresponding Gaussian importance distribution with mean

k =xk1{2 + y2kexp(xk1)/2} {1 y2kexp(xk1)}/2

2 + 2 ( )/2
http://find/


178/584

k2 + y2kexp(xk1)/2

and variance

2k = (2 + y2kexp(xk1)/2)1

Prior proposal on X1,

X1 N(0, 2)



Importance Sampling


Simulation starts with X1 and proceeds forward to Xn, each Xkbeing generated conditional on Yk and the previously generatedX
http://find/


179/584

Xk1.Importance weight computed sequentially as the product of

exp 2(xk xk1)2 + exp(xk)y2k+ xk /2exp 2k (xk k)2 1k .

(1kK)



Importance Sampling

sity

0.0

6

0.0

8

0.1

0

2

0.1

0.0

0.1
http://find/


180/584

weights

Dens

15 5 0 5 10 15

0.0

0

0.0

2

0.0

4

0 20 40 60 80 100

0.4

0.3

0.2

t

Histogram of the logarithms of the importance weights (left)and comparison between the true volatility and the best fit,based on 10, 000 simulated importance samples.



Importance Sampling

0.

2

0.

4
http://find/


181/584

0 20 40 60 80 100

0.

4

0.

2

0.0

t

Corresponding range of the simulated{Xk}1k100,compared with the true value.



Acceleration methods

Correlated simulations

Negative correlation reduces varianceSpecial technique but efficient when it appliesTwo samples (X1, . . . , X m) and (Y1, . . . , Y m) fromf to estimate
http://find/


182/584

I = R h(x)f(x)dxby

I1 =

1

m

m

i=1h(Xi) and

I2 =

1

m

m

i=1h(Yi)

with mean I and variance 2




Variance reduction

Variance of the averageI1 +I2 2 1( )
http://find/


183/584

var

I1+ I2

2

=

2 +

1

2cov(

I1,

I2).

If the two samples are negatively correlated,

cov(I1,I2)0 ,they improve on two independent samples of same size




Antithetic variables

If f symmetric about take Yi = 2 Xi
http://find/


184/584

Iff symmetric about , take Yi = 2 Xi IfXi=F1(Ui), take Yi=F1(1 Ui) If (Ai)i partition ofX,partitioned sampling by sampling

Xjs in each Ai (requires to know Pr(Ai))




Control variates

out of control!

For

I =

h(x)f(x)dx
http://find/


185/584

unknown and

I0 = h0(x)f(x)dxknown,

I0 estimated by

I0 and

I estimated byI




Control variates (2)

Combined estimator
http://find/


186/584

I =

I+ (

I0 I0)

I is unbiased for I andvar(I) = var(I) + 2var(I) + 2cov(I,I0)




Optimal control

Optimal choice of

cov(I,I0)
http://find/


187/584

= ( , 0)var(

I0)

,

withvar(I) = (1 2) var(I),

where correlation between

I and

I0

Usual solution: regression coefficient ofh(xi) over h0(xi)




Example (Quantile Approximation)

Evaluate

= Pr(X > a) =

a

f(x)dx
http://find/


188/584

by

= 1nni=1

I(Xi> a),

with Xi iid f.If Pr(X > ) = 12 known




Example (Quantile Approximation (2))

Control variate

1

nI(X > a) +

1

nI(X > ) Pr(X > )
http://find/


189/584

=n

i=1

I(Xi> a) +

n

i=1

I(Xi> ) Pr(X > )

improves upon if




Example (Quantile Approximation (2))

Control variate

= 1

nI(Xi > a) +

1

nI(Xi > ) Pr(X > )
http://find/


190/584

=n

i=1

I(Xi> a) +

n

i=1

I(Xi> ) Pr(X > )

improves upon if ).




Integration by conditioning

Use Rao-Blackwell Theorem
http://find/


191/584

UseRao-Blackwell Theorem

var(E[(

X)|Y]) var((X))




Consequence

IfI unbiased estimator ofI = Ef[h(X)], with X simulated from ajoint density f(x, y), where
http://find/


192/584

f(x, y)dy=f(x),

the estimator I = Ef[I|Y1, . . . , Y n]dominateI(X1, . . . , X n) variance-wise (and is unbiased)




skip expectation

Example (Students t expectation)

For2 2
http://find/


193/584

E[h(x)] = E[exp(x2)] with X T(, 0, 2)

a Studentst distribution can be simulated as

X|y N(, 2y) and Y1 2.




Example (Students t expectation (2))

Empirical distribution

1

m

mj=1

exp(X2j ),

can be improved from the joint sample
http://find/


194/584

p j p

((X1, Y1), . . . , (Xm, Ym))




Example (Students t expectation (2))

Empirical distribution

1

m

mj=1

exp(X2j ),

can be improved from the joint sample
http://find/


195/584

p j p

((X1, Y1), . . . , (Xm, Ym))

since

1

m

m

j=1 E[exp(X2)|Yj] = 1

m

m

j=11

22Yj+ 1is the conditional expectation.In this example, precision ten times better




0.5

6

0.5

8

0.

60

0.5

6

0.5

8

0.

60
http://find/


196/584

0 2000 4000 6000 8000 10000

0.

50

0.

52

0.

54

Documents

Monte Carlo Notes From Main Book