ﬃ strategy of MCMC in high-dimension and its application ...lmm.univ-lemans.fr/files/saps10/18/Kamatani.pdf · high-dimension and its application to ... distribution on Rd. 2. 1-b

Efficient strategy of MCMC in

high-dimension and its application to

diffusion processes

Kengo Kamatani (Osaka Univ. and CREST, JST)

Mar 2015 at LeMans

1. New algorithm:

• Markov chain Monte Carlo (MCMC) produces a Markov chainX0, . . . , XM−1 with a given invariant probability measure P .If it is ergodic, we have

M−1M−1∑m=0

f(Xm) → P (f) =∫

f(x)P (dx).

We can approximate P (f) by the empirical average.

• MCMC ∋ RWM, Gibbs, MALA, Slice Sampler, HMC etc.

• Almost all MCMC satisfies reversibility, i.e., if X0 ∼ P (dx),

L(X0, X1, . . . , XM) = L(XM , XM−1, . . . , X0).

1

1-a. RWM Algorithm: Let P (dx) = p(x)dx be the target on

Rd.

1. Generate x∗ = x+ w where w ∼ Nd(0, σ2Id) = Γd.

2. Accept x∗ as the next state with probability α(x, x∗), and

otherwise, discard x∗, where

α(x, x∗) = min

{1,

p(x∗)

p(x)

}.

Proposal kernel [ x to x∗ ] is reversible with respect to the uniform

distribution on Rd.

2

1-b. pCN Algorithm: Fix ρ ∈ (0,1). For x = (x1, . . . , xd) ∈ Rd

let ∥x∥ = (∑d

i=1 x2i )

1/2.

1. Generate x∗ = ρ1/2x+ (1− ρ)1/2w where w ∼ Nd(0, Id).

2. Accept x∗ with probability α(x, x∗) where

α(x, x∗) = min

{1,

p(x∗)ϕ(x)

p(x)ϕ(x∗)

}where ϕ is the pdf of Nd(0, Id).

Proposal kernel [ x to x∗ ] is reversible with respect to Nd(0, Id).

3

1-c. MpCN Algorithm (New method):

1. Generate r ∼ Gamma(d/2, ∥x∥2/2).

2. Generate x∗ = ρ1/2x+(1− ρ)1/2r−1/2w where w ∼ Nd(0, Id).

3. Accept x∗ with probability α(x, x∗) where

α(x, x∗) = min

{1,

p(x∗)∥x∥−d

p(x)∥x∗∥−d

}.

Proposal kernel [ x to x∗ ] is reversible with respect to ∥x∥−ddx.

4

Note for application to Bayesian inference for complicated

models

• x = θ and P (dx) = P (dθ|Xn) = p(θ|Xn)dθ.

• For many advanced MCMC methods, we need to calculate

(log p(x))′ ≈ (score function) in each iteration (ex. 106

times!). Sometimes we also need to calculate (log p(x))′′.

• Previous three methods are nice in this point of view as long

as the performance is nice.

5

2. Application

2-a. Toy examples; P (dx) = standard normal distribution

−5.0

−2.5

0.0

2.5

5.0

0 250 500 750 1000Iteration

Trajectory of (||x||^2−d)/sqrt(2d) by Gaussian RWM

0.00

0.25

0.50

0.75

1.00

0 25 50 75 100lag

acf

−5.0

−2.5

0.0

2.5

5.0

0 250 500 750 1000Iteration

Trajectory of (||x||^2−d)/sqrt(2d) by pCN

0.0

0.2

0.4

0.6

0.8

0 25 50 75 100lag

acf

−5.0

−2.5

0.0

2.5

5.0

0 250 500 750 1000Iteration

Trajectory of (||x||^2−d)/sqrt(2d) by MpCN

0.00

0.25

0.50

0.75

0 25 50 75 100lag

acf

6

2-a. Toy examples; t-distribution

0

500

1000

1500

0 250 500 750 1000Iteration

Trajectory of ||x||^2/d by Gaussian RWM

0.00

0.25

0.50

0.75

1.00

0 2500 5000 7500 10000lag

acf

0

500

1000

1500

0 250 500 750 1000Iteration

Trajectory of ||x||^2/d by pCN

0.00

0.25

0.50

0.75

1.00

0 2500 5000 7500 10000lag

acf

0

500

1000

1500

0 250 500 750 1000Iteration

Trajectory of ||x||^2/d by MpCN

0.00

0.25

0.50

0.75

1.00

0 2500 5000 7500 10000lag

acf

7

2-b. Stochastic processes

Realistic examples. R with Yuima package. We consider some

Bayesian parameter estimation for discretely observed stochastic

processes.

Note

• LA (Likelihood analysis) is not available and we treat QLA

(quasi-LA).

• QLA has been studied extensively. See Yoshida [9] and ref-

erences therein.

8

Consider

dXt = a(Xt, θ)dt+ b(Xt)dWt;X0 = 2, t ∈ [0, T ]

where

a(x, θ) = θ1 − θ2x+2sin(θ3x), b(x) =0.5+ x2

1+ 0.3x2.

N = 5000, T = 250.

P (dθ|XN) ∝ exp

(−1

2

(N∑

n=1

(Xnh −X(n−1)h − a(X(n−1)h, θ)h)2

hb(X(n−1)h)2

))P (dθ)

where h = T/N (Nh3 = 0.625). True is θ = (3,7,5).

9

• Generate discrete observation XN from the model for a true

parameter.

• Run each MCMC for M = 105 iteration from 100 different

starting points.

• Plot empirical average for each 100 trials to approximate∫θP (dθ|XN).

• We compare RWM (σ = 1.5,2,4), pCN and MpCN.

10

●●●●●●

●

●

●

●

●

●

●●

●

●

●

●●●●●

●

●

●●●

●

●

●●●●●●●

●

●

●●●

●

●

●

●

●

●●●

●

●●●●●●●

●

●●

●

●

●

●

●

●●●●●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●●

●●

●●●

0

10

20

30

40

50

0 2 4 6 8b1

b3

6600661066206630664066506660

ll

RWMH sd = 1.5;0/100 points are out of this region

●

●●

●

● ●

●

●

●

●

●

4.6

4.7

4.8

4.9

3.2 3.3 3.4 3.5b1

b3

6600661066206630664066506660

ll

RWMH sd = 1.5;89/100 points are out of this region

11

●●

●

●

●

●●

●

●

●●

●●

●●

●

●●

●

●

●

●

●●

●●●●

●●●●

●●

●●●●●

●●

●●●●

●

●●●

●

●●

●

●●●●●

●

●●

●

●

●

●

●●●

●

●●

●●

●

●

●●●

●

●

●

●

●●

●

●

●●

●

●

●

●●●●●

●

●

●

●

0

10

20

30

40

50

0 2 4 6 8b1

b3

6600661066206630664066506660

ll

RWMH sd = 2;0/100 points are out of this region

●

●

●

●

●

●

●

●

●●

●

●

4.6

4.7

4.8

4.9

3.2 3.3 3.4 3.5b1

b3

6600661066206630664066506660

ll


12

● ●●● ●● ●

●

●

●●●

● ●●●

●●

●

●

●

●

●●● ●

●●

●●●●

●●

●

●

●

●●●●

●●

●

●

●●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●●

●●

●●

●

●●

●

●

● ●● ●

●

●

●

●

●

●●●

●●●●●

●

● ●

0

10

20

30

40

50

0 2 4 6 8b1

b3

6600661066206630664066506660

ll


●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

4.6

4.7

4.8

4.9

3.2 3.3 3.4 3.5b1

b3

6600661066206630664066506660

ll


13

●

●

●

●

● ●●●

●

●

●

●

●

●

●

●●

●

●●● ●●●

●

●

●

●

●

●

●●

●

●●●

●●

●

●

●

●

●

●

●

●

0

10

20

30

40

50

0 2 4 6 8b1

b3

6600661066206630664066506660

ll

pCN;54/100 points are out of this region

●

●

●

4.6

4.7

4.8

4.9

3.2 3.3 3.4 3.5b1

b3

6600661066206630664066506660

ll

pCN;97/100 points are out of this region

14

●●●●●●●●●●●●●●●

●

●●●●●●●●

●

●●

●●

●●●●●●●

●

●●●●

●

●

●

●●●●●●●●●●●●●●●●●●●●

●

●

●●●●● ●●●● ●●●●●●●●●●●●●●●●●●●● ●●

●

●

●

0

10

20

30

40

50

0 2 4 6 8b1

b3

6600661066206630664066506660

ll

MpCN;0/100 points are out of this region

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

4.6

4.7

4.8

4.9

3.2 3.3 3.4 3.5b1

b3

6600661066206630664066506660

ll

MpCN;12/100 points are out of this region

15

●

●

●●●

●

●

●

●●●

●●

●

●

●

●●

●

●

●

●●●

●

●

●

●●

●

●

●

●

●

●●

●●●

●

●

●

●●

●

●

●

●●

●

●●●●●●●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●●

●

●

●

●●●●●

●

●

●

●

●●

0

10

20

30

40

50

0 2 4 6 8b1

b3

6600661066206630664066506660

ll

Optim;0/100 points are out of this region

●●●●●●●●●●●

4.6

4.7

4.8

4.9

3.2 3.3 3.4 3.5b1

b3

6600661066206630664066506660

ll

Optim;89/100 points are out of this region

16

MpCN

• (Essentially) No tuning parameter.

• No derivative.

• Good performance.

17

3. Theoretical results

• We study (HDA) High-Dimensional asymptotics for MCMC.

• HDA is strong assumption ⇒ strong conclusion type frame-

work.

• HDA was developed by Gelman et al. [5] and Roberts et al.

[8] (RGG97).

18

3-a. What is HDA? RGG97’s results (Here, X is the parame-

ter.)

• They considered asymptotic properties of d-dimensional Markov

chain Xd = (Xdm)m∈N0

as d → ∞, where Xd ∼ Gaussian

RWM.

• Set

Pd(dx) =d∏

i=1

f(xi)dxi (x = (x1, . . . , xd)).

• Under some regularity conditions on f , Pd ≈ Nd(0, σ2Id).

19

• Introduce time scaling t 7→ [dt], where [x] is the integer partof x, and consider

Xd[dt].

• Introduce projection πE(x) = (xi)i∈E for E ⊂ {1, . . . , d} wherex = (x1, . . . , xd). Ex.

π{3,5,10}(x) = (x3, x5, x10) if E = {3,5,10}

and consider

Y dt := π{1}(X

d[dt]).

• Introduce proposal scaling

σ2 = l2/d.

20

Theorem (GGR97). Y d ⇒ Y where

dYt = h(l)(log f)′(Yt)

2dt+

√h(l)dWt

where

h(l) =2l2Φ

(−l√I

2

),

I =∫ {

(log f)′(x)}2

f(x)dx.

21

Interpretation of RGG97’s

• The rate of convergence is d. Thus the number of iteration

should be proportional to d.

• For the limit process Y , the convergence rate is determined

by h(l).

• The function h(l) is maximised if the average acceptance

probability is approximately 0.23.

The result gives a criterion for constructing a good RWM.

22

After the seminal paper RGG97, there are many studies for the

generalization of the result.

Generalization of Pd Non i.i.d. Bedard [2], Perturbation of

Gaussian Beskos et al. [4], etc.

Better convergence rate Metropolis adjusted Langevin algo-

rithm (MaLa, d1/3 Roberts and Rosenthal [7]), Hybrid Monte

Carlo (d1/4 Beskos et al. [3]), Metropolis-Coupled MCMC

Atchade et al. [1], etc.

23

Our plan:

• (perturbation of) Gaussian = ideal situation. Heavy-tail ≈realistic, non-ideal situation. We want to know the rate of

convergence (time scaling). It is d for RWM for Gaussian

case.

• We want to construct MCMC, which works well for a difficult

target distribution.

• We only consider a special class of heavy-tailed distribution.

By this, we can apply Stein’s techniques and Malliavin cal-

culus.

24

3-b. Setting

• Pd is a scale mixture of the Gaussian distribution;

Pd = L(Xd0), where Xd

0|Y ∼ Nd(0, IdY ), Y ∼ Q(dy).

• The class of Pd ∋ Nd(0, Id), Student t-distribution and the

stable distribution.

If Pd is heavy-tailed, the rate of convergence is difficult to

define.

25

• (Usual) Consistency

ξd = (ξdm)m∈N0(d ∈ N) is consistent if

1

M

M−1∑m=0

f(ξdm)−∫

f(x)Πd(dx) = oP(1) (M,d → ∞)

for any bounded continuous function f (K. 2014 [6]).

• Since the dimension grows as d → ∞, it is not suitable the

current study. We make a generalization of this definition.

26

• For any bounded continuous function f : Rk → R and for any

sequence Ekd ⊂ {1, . . . , d} s.t. ♯Ek

d = k,

1

Md

Md−1∑m=0

f ◦ πEkd(Xd

m)−∫

f ◦ πEkd(x)Pd(dx) = oP(1)

for any Md → ∞ then we call that (Xd)d is consistent.

• If above satisfies all Md such that Md/Td → ∞, then we call

that Td is the convergence rate.

• This is just a formalisation of the rate of convergence used

in HDA community.

27

3-c. Gaussian case; Pd = Nd(0, Id)

• Let µk(σ) = E[|ξ|k exp(−ξ+)] for ξ ∼ N(σ2

2 , σ2), and ξ+ =

max{0, ξ}.

• Let

rd(x) =√d

(∥x∥2d

− 1)

(x ∈ Rd).

28

Proposition (Gaussian RWM).Consider the Gaussian RWM and

set σ2 = l2/d. Set Y dt = rd(X

d[dt]). Then Y d ⇒ Y where

dYt = −σ(l)2

4Ytdt+ σ(l)dWt;Y0 ∼ N(0,2).

where σ(l)2 = 4µ2(l). By this, the Gaussian RWM is weakly

consistent with the rate d.

Theorem (Optimality). The above RWM attains the optimal

rate among all the RWM algorithms.

Proposition. Both pCN and MpCN algorithms have the rate 1.

29

The key of the proof is reversibility.

P(∣∣∣∥Xd

1∥2 − ∥Xd

0∥2∣∣∣ > ϵ

)= 2P

(∥Xd

1∥2 − ∥Xd

0∥2 < −ϵ

)≤ 2P

(∥Xd

0 +W d1∥

2 − ∥Xd0∥

2 < −ϵ)

= 2P(2Zd < −ϵ

)

where

Zd :=∥Xd

0 +W d1∥

2 − ∥Xd0∥

2

2= ⟨Xd

0,Wd1⟩+

∥W d1∥

2

2.

We have

Zd =1

2

(∥W d

1∥+

⟨Xd

0,W d

1

∥W d1∥

⟩)2−⟨Xd

0,W d

1

∥W d1∥

⟩2

≥ −⟨Xd

0,W d

1

∥W d1∥

⟩2

.

30

3-d. Heavy-tail case

Set

rd(x) =∥x∥2

d(x ∈ Rd).

Proposition. Let Γd = Nd(0, l2Id/d) and set Y d

t = rd(Xd[d2t]

).

Then Y d ⇒ Y where

dYt = a(Yt)dt+√b(Yt)dWt;Y0 ∼ Q

where

a(y) = 2(y+(log q)′(y)y2)µ2(l/√y)+l2µ1(l/

√y), b(y) = 4y2µ2(l/

√y).

In particular, the Gaussian RWM has the rate d2.

31

Theorem. The above RWM attains the optimal rate for the

weak consistency. Thus d2 is the optimal rate of RWM.

Proposition. In this case, pCN does not have any polynomial

rate and MpCN has the rate d.

Summary

Light-tail Heavy-tailRMW d d2

pCN 1 ∞MpCN 1 d

32

Summary

• We propose a new MCMC algorithm, MpCN algorithm.

• It works well for both toy models, and stochastic process

examples.

• High-dimensional asymptotic theory was provided.

33

[1] Yves F. Atchade, Gareth O. Roberts, and Jeffrey S. Rosen-thal. Towards optimal scaling of metropolis-coupled markov chainmonte carlo. Statistics and Computing, 21(4):555–568, October2011. ISSN 0960-3174. doi: 10.1007/s11222-010-9192-1. URLhttp://dx.doi.org/10.1007/s11222-010-9192-1.

[2] Mylene Bedard. Weak convergence of Metropolis algorithms fornon-i.i.d. target distributions. Ann. Appl. Probab., 17(4):1222–1244,2007. ISSN 1050-5164. doi: 10.1214/105051607000000096. URLhttp://dx.doi.org/10.1214/105051607000000096.

[3] A. Beskos, N. Pillai, G.O. Roberts, J.-M. Sanz-Serna, and A.M. Stuart.Optimal tuning of hybrid monte-carlo. to appear, 2013.

[4] Alexandros Beskos, Gareth Roberts, and Andrew Stuart. Optimal scal-ings for local Metropolis-Hastings chains on nonproduct targets in highdimensions. Ann. Appl. Probab., 19(3):863–898, 2009. ISSN 1050-5164.doi: 10.1214/08-AAP563. URL http://dx.doi.org/10.1214/08-AAP563.

[5] A. Gelman, G. O. Roberts, and W. R. Gilks. Efficient Metropolis jumpingrules. In Bayesian statistics, 5 (Alicante, 1994), Oxford Sci. Publ., pages599–607. Oxford Univ. Press, New York, 1996.

[6] Kengo Kamatani. Local consistency of Markov chain Monte Carlo meth-ods. Ann. Inst. Statist. Math., 66(1):63–74, 2014. ISSN 0020-3157. doi:10.1007/s10463-013-0403-3.

[7] Gareth O. Roberts and Jeffrey S. Rosenthal. Optimal scaling of dis-crete approximations to langevin diffusions. J. R. Stat. Soc. Ser. BStat. Methodol., 60(1):255–268, 1998. ISSN 1467-9868. doi: 10.1111/1467-9868.00123.

[8] Gareth O. Roberts, Andrew Gelman, and Walter R. Gilks. Weak con-vergence and optimal scaling of random walk Metropolis algorithms.Ann. Appl. Probab., 7(1):110–120, 1997. ISSN 1050-5164. doi:10.1214/aoap/1034625254.

[9] Nakahiro Yoshida. Polynomial type large deviation inequalities andquasi-likelihood analysis for stochastic differential equations. Ann. Inst.Statist. Math., 63(3):431–479, 2011. ISSN 0020-3157. doi: 10.1007/s10463-009-0263-z.

Documents

ﬃ strategy of MCMC in high-dimension and its application ...lmm.univ-lemans.fr/files/saps10/18/Kamatani.pdf · high-dimension and its application to ... distribution on Rd. 2. 1-b