Upload
ngodieu
View
215
Download
0
Embed Size (px)
Citation preview
Efficient strategy of MCMC in
high-dimension and its application to
diffusion processes
Kengo Kamatani (Osaka Univ. and CREST, JST)
Mar 2015 at LeMans
1. New algorithm:
• Markov chain Monte Carlo (MCMC) produces a Markov chainX0, . . . , XM−1 with a given invariant probability measure P .If it is ergodic, we have
M−1M−1∑m=0
f(Xm) → P (f) =∫
f(x)P (dx).
We can approximate P (f) by the empirical average.
• MCMC ∋ RWM, Gibbs, MALA, Slice Sampler, HMC etc.
• Almost all MCMC satisfies reversibility, i.e., if X0 ∼ P (dx),
L(X0, X1, . . . , XM) = L(XM , XM−1, . . . , X0).
1
1-a. RWM Algorithm: Let P (dx) = p(x)dx be the target on
Rd.
1. Generate x∗ = x+ w where w ∼ Nd(0, σ2Id) = Γd.
2. Accept x∗ as the next state with probability α(x, x∗), and
otherwise, discard x∗, where
α(x, x∗) = min
{1,
p(x∗)
p(x)
}.
Proposal kernel [ x to x∗ ] is reversible with respect to the uniform
distribution on Rd.
2
1-b. pCN Algorithm: Fix ρ ∈ (0,1). For x = (x1, . . . , xd) ∈ Rd
let ∥x∥ = (∑d
i=1 x2i )
1/2.
1. Generate x∗ = ρ1/2x+ (1− ρ)1/2w where w ∼ Nd(0, Id).
2. Accept x∗ with probability α(x, x∗) where
α(x, x∗) = min
{1,
p(x∗)ϕ(x)
p(x)ϕ(x∗)
}where ϕ is the pdf of Nd(0, Id).
Proposal kernel [ x to x∗ ] is reversible with respect to Nd(0, Id).
3
1-c. MpCN Algorithm (New method):
1. Generate r ∼ Gamma(d/2, ∥x∥2/2).
2. Generate x∗ = ρ1/2x+(1− ρ)1/2r−1/2w where w ∼ Nd(0, Id).
3. Accept x∗ with probability α(x, x∗) where
α(x, x∗) = min
{1,
p(x∗)∥x∥−d
p(x)∥x∗∥−d
}.
Proposal kernel [ x to x∗ ] is reversible with respect to ∥x∥−ddx.
4
Note for application to Bayesian inference for complicated
models
• x = θ and P (dx) = P (dθ|Xn) = p(θ|Xn)dθ.
• For many advanced MCMC methods, we need to calculate
(log p(x))′ ≈ (score function) in each iteration (ex. 106
times!). Sometimes we also need to calculate (log p(x))′′.
• Previous three methods are nice in this point of view as long
as the performance is nice.
5
2. Application
2-a. Toy examples; P (dx) = standard normal distribution
−5.0
−2.5
0.0
2.5
5.0
0 250 500 750 1000Iteration
Trajectory of (||x||^2−d)/sqrt(2d) by Gaussian RWM
0.00
0.25
0.50
0.75
1.00
0 25 50 75 100lag
acf
−5.0
−2.5
0.0
2.5
5.0
0 250 500 750 1000Iteration
Trajectory of (||x||^2−d)/sqrt(2d) by pCN
0.0
0.2
0.4
0.6
0.8
0 25 50 75 100lag
acf
−5.0
−2.5
0.0
2.5
5.0
0 250 500 750 1000Iteration
Trajectory of (||x||^2−d)/sqrt(2d) by MpCN
0.00
0.25
0.50
0.75
0 25 50 75 100lag
acf
6
2-a. Toy examples; t-distribution
0
500
1000
1500
0 250 500 750 1000Iteration
Trajectory of ||x||^2/d by Gaussian RWM
0.00
0.25
0.50
0.75
1.00
0 2500 5000 7500 10000lag
acf
0
500
1000
1500
0 250 500 750 1000Iteration
Trajectory of ||x||^2/d by pCN
0.00
0.25
0.50
0.75
1.00
0 2500 5000 7500 10000lag
acf
0
500
1000
1500
0 250 500 750 1000Iteration
Trajectory of ||x||^2/d by MpCN
0.00
0.25
0.50
0.75
1.00
0 2500 5000 7500 10000lag
acf
7
2-b. Stochastic processes
Realistic examples. R with Yuima package. We consider some
Bayesian parameter estimation for discretely observed stochastic
processes.
Note
• LA (Likelihood analysis) is not available and we treat QLA
(quasi-LA).
• QLA has been studied extensively. See Yoshida [9] and ref-
erences therein.
8
Consider
dXt = a(Xt, θ)dt+ b(Xt)dWt;X0 = 2, t ∈ [0, T ]
where
a(x, θ) = θ1 − θ2x+2sin(θ3x), b(x) =0.5+ x2
1+ 0.3x2.
N = 5000, T = 250.
P (dθ|XN) ∝ exp
(−1
2
(N∑
n=1
(Xnh −X(n−1)h − a(X(n−1)h, θ)h)2
hb(X(n−1)h)2
))P (dθ)
where h = T/N (Nh3 = 0.625). True is θ = (3,7,5).
9
• Generate discrete observation XN from the model for a true
parameter.
• Run each MCMC for M = 105 iteration from 100 different
starting points.
• Plot empirical average for each 100 trials to approximate∫θP (dθ|XN).
• We compare RWM (σ = 1.5,2,4), pCN and MpCN.
10
●●●●●●
●
●
●
●
●
●
●●
●
●
●
●●●●●
●
●
●●●
●
●
●●●●●●●
●
●
●●●
●
●
●
●
●
●●●
●
●●●●●●●
●
●●
●
●
●
●
●
●●●●●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●●
●●
●●●
0
10
20
30
40
50
0 2 4 6 8b1
b3
6600661066206630664066506660
ll
RWMH sd = 1.5;0/100 points are out of this region
●
●●
●
● ●
●
●
●
●
●
4.6
4.7
4.8
4.9
3.2 3.3 3.4 3.5b1
b3
6600661066206630664066506660
ll
RWMH sd = 1.5;89/100 points are out of this region
11
●●
●
●
●
●●
●
●
●●
●●
●●
●
●●
●
●
●
●
●●
●●●●
●●●●
●●
●●●●●
●●
●●●●
●
●●●
●
●●
●
●●●●●
●
●●
●
●
●
●
●●●
●
●●
●●
●
●
●●●
●
●
●
●
●●
●
●
●●
●
●
●
●●●●●
●
●
●
●
0
10
20
30
40
50
0 2 4 6 8b1
b3
6600661066206630664066506660
ll
RWMH sd = 2;0/100 points are out of this region
●
●
●
●
●
●
●
●
●●
●
●
4.6
4.7
4.8
4.9
3.2 3.3 3.4 3.5b1
b3
6600661066206630664066506660
ll
RWMH sd = 2;88/100 points are out of this region
12
● ●●● ●● ●
●
●
●●●
● ●●●
●●
●
●
●
●
●●● ●
●●
●●●●
●●
●
●
●
●●●●
●●
●
●
●●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●●
●●
●●
●
●●
●
●
● ●● ●
●
●
●
●
●
●●●
●●●●●
●
● ●
0
10
20
30
40
50
0 2 4 6 8b1
b3
6600661066206630664066506660
ll
RWMH sd = 4;0/100 points are out of this region
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
4.6
4.7
4.8
4.9
3.2 3.3 3.4 3.5b1
b3
6600661066206630664066506660
ll
RWMH sd = 4;50/100 points are out of this region
13
●
●
●
●
● ●●●
●
●
●
●
●
●
●
●●
●
●●● ●●●
●
●
●
●
●
●
●●
●
●●●
●●
●
●
●
●
●
●
●
●
0
10
20
30
40
50
0 2 4 6 8b1
b3
6600661066206630664066506660
ll
pCN;54/100 points are out of this region
●
●
●
4.6
4.7
4.8
4.9
3.2 3.3 3.4 3.5b1
b3
6600661066206630664066506660
ll
pCN;97/100 points are out of this region
14
●●●●●●●●●●●●●●●
●
●●●●●●●●
●
●●
●●
●●●●●●●
●
●●●●
●
●
●
●●●●●●●●●●●●●●●●●●●●
●
●
●●●●● ●●●● ●●●●●●●●●●●●●●●●●●●● ●●
●
●
●
0
10
20
30
40
50
0 2 4 6 8b1
b3
6600661066206630664066506660
ll
MpCN;0/100 points are out of this region
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
4.6
4.7
4.8
4.9
3.2 3.3 3.4 3.5b1
b3
6600661066206630664066506660
ll
MpCN;12/100 points are out of this region
15
●
●
●●●
●
●
●
●●●
●●
●
●
●
●●
●
●
●
●●●
●
●
●
●●
●
●
●
●
●
●●
●●●
●
●
●
●●
●
●
●
●●
●
●●●●●●●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●●
●
●
●
●●●●●
●
●
●
●
●●
0
10
20
30
40
50
0 2 4 6 8b1
b3
6600661066206630664066506660
ll
Optim;0/100 points are out of this region
●●●●●●●●●●●
4.6
4.7
4.8
4.9
3.2 3.3 3.4 3.5b1
b3
6600661066206630664066506660
ll
Optim;89/100 points are out of this region
16
MpCN
• (Essentially) No tuning parameter.
• No derivative.
• Good performance.
17
3. Theoretical results
• We study (HDA) High-Dimensional asymptotics for MCMC.
• HDA is strong assumption ⇒ strong conclusion type frame-
work.
• HDA was developed by Gelman et al. [5] and Roberts et al.
[8] (RGG97).
18
3-a. What is HDA? RGG97’s results (Here, X is the parame-
ter.)
• They considered asymptotic properties of d-dimensional Markov
chain Xd = (Xdm)m∈N0
as d → ∞, where Xd ∼ Gaussian
RWM.
• Set
Pd(dx) =d∏
i=1
f(xi)dxi (x = (x1, . . . , xd)).
• Under some regularity conditions on f , Pd ≈ Nd(0, σ2Id).
19
• Introduce time scaling t 7→ [dt], where [x] is the integer partof x, and consider
Xd[dt].
• Introduce projection πE(x) = (xi)i∈E for E ⊂ {1, . . . , d} wherex = (x1, . . . , xd). Ex.
π{3,5,10}(x) = (x3, x5, x10) if E = {3,5,10}
and consider
Y dt := π{1}(X
d[dt]).
• Introduce proposal scaling
σ2 = l2/d.
20
Theorem (GGR97). Y d ⇒ Y where
dYt = h(l)(log f)′(Yt)
2dt+
√h(l)dWt
where
h(l) =2l2Φ
(−l√I
2
),
I =∫ {
(log f)′(x)}2
f(x)dx.
21
Interpretation of RGG97’s
• The rate of convergence is d. Thus the number of iteration
should be proportional to d.
• For the limit process Y , the convergence rate is determined
by h(l).
• The function h(l) is maximised if the average acceptance
probability is approximately 0.23.
The result gives a criterion for constructing a good RWM.
22
After the seminal paper RGG97, there are many studies for the
generalization of the result.
Generalization of Pd Non i.i.d. Bedard [2], Perturbation of
Gaussian Beskos et al. [4], etc.
Better convergence rate Metropolis adjusted Langevin algo-
rithm (MaLa, d1/3 Roberts and Rosenthal [7]), Hybrid Monte
Carlo (d1/4 Beskos et al. [3]), Metropolis-Coupled MCMC
Atchade et al. [1], etc.
23
Our plan:
• (perturbation of) Gaussian = ideal situation. Heavy-tail ≈realistic, non-ideal situation. We want to know the rate of
convergence (time scaling). It is d for RWM for Gaussian
case.
• We want to construct MCMC, which works well for a difficult
target distribution.
• We only consider a special class of heavy-tailed distribution.
By this, we can apply Stein’s techniques and Malliavin cal-
culus.
24
3-b. Setting
• Pd is a scale mixture of the Gaussian distribution;
Pd = L(Xd0), where Xd
0|Y ∼ Nd(0, IdY ), Y ∼ Q(dy).
• The class of Pd ∋ Nd(0, Id), Student t-distribution and the
stable distribution.
If Pd is heavy-tailed, the rate of convergence is difficult to
define.
25
• (Usual) Consistency
ξd = (ξdm)m∈N0(d ∈ N) is consistent if
1
M
M−1∑m=0
f(ξdm)−∫
f(x)Πd(dx) = oP(1) (M,d → ∞)
for any bounded continuous function f (K. 2014 [6]).
• Since the dimension grows as d → ∞, it is not suitable the
current study. We make a generalization of this definition.
26
• For any bounded continuous function f : Rk → R and for any
sequence Ekd ⊂ {1, . . . , d} s.t. ♯Ek
d = k,
1
Md
Md−1∑m=0
f ◦ πEkd(Xd
m)−∫
f ◦ πEkd(x)Pd(dx) = oP(1)
for any Md → ∞ then we call that (Xd)d is consistent.
• If above satisfies all Md such that Md/Td → ∞, then we call
that Td is the convergence rate.
• This is just a formalisation of the rate of convergence used
in HDA community.
27
3-c. Gaussian case; Pd = Nd(0, Id)
• Let µk(σ) = E[|ξ|k exp(−ξ+)] for ξ ∼ N(σ2
2 , σ2), and ξ+ =
max{0, ξ}.
• Let
rd(x) =√d
(∥x∥2d
− 1)
(x ∈ Rd).
28
Proposition (Gaussian RWM).Consider the Gaussian RWM and
set σ2 = l2/d. Set Y dt = rd(X
d[dt]). Then Y d ⇒ Y where
dYt = −σ(l)2
4Ytdt+ σ(l)dWt;Y0 ∼ N(0,2).
where σ(l)2 = 4µ2(l). By this, the Gaussian RWM is weakly
consistent with the rate d.
Theorem (Optimality). The above RWM attains the optimal
rate among all the RWM algorithms.
Proposition. Both pCN and MpCN algorithms have the rate 1.
29
The key of the proof is reversibility.
P(∣∣∣∥Xd
1∥2 − ∥Xd
0∥2∣∣∣ > ϵ
)= 2P
(∥Xd
1∥2 − ∥Xd
0∥2 < −ϵ
)≤ 2P
(∥Xd
0 +W d1∥
2 − ∥Xd0∥
2 < −ϵ)
= 2P(2Zd < −ϵ
)
where
Zd :=∥Xd
0 +W d1∥
2 − ∥Xd0∥
2
2= ⟨Xd
0,Wd1⟩+
∥W d1∥
2
2.
We have
Zd =1
2
(∥W d
1∥+
⟨Xd
0,W d
1
∥W d1∥
⟩)2−⟨Xd
0,W d
1
∥W d1∥
⟩2
≥ −⟨Xd
0,W d
1
∥W d1∥
⟩2
.
30
3-d. Heavy-tail case
Set
rd(x) =∥x∥2
d(x ∈ Rd).
Proposition. Let Γd = Nd(0, l2Id/d) and set Y d
t = rd(Xd[d2t]
).
Then Y d ⇒ Y where
dYt = a(Yt)dt+√b(Yt)dWt;Y0 ∼ Q
where
a(y) = 2(y+(log q)′(y)y2)µ2(l/√y)+l2µ1(l/
√y), b(y) = 4y2µ2(l/
√y).
In particular, the Gaussian RWM has the rate d2.
31
Theorem. The above RWM attains the optimal rate for the
weak consistency. Thus d2 is the optimal rate of RWM.
Proposition. In this case, pCN does not have any polynomial
rate and MpCN has the rate d.
Summary
Light-tail Heavy-tailRMW d d2
pCN 1 ∞MpCN 1 d
32
Summary
• We propose a new MCMC algorithm, MpCN algorithm.
• It works well for both toy models, and stochastic process
examples.
• High-dimensional asymptotic theory was provided.
33
[1] Yves F. Atchade, Gareth O. Roberts, and Jeffrey S. Rosen-thal. Towards optimal scaling of metropolis-coupled markov chainmonte carlo. Statistics and Computing, 21(4):555–568, October2011. ISSN 0960-3174. doi: 10.1007/s11222-010-9192-1. URLhttp://dx.doi.org/10.1007/s11222-010-9192-1.
[2] Mylene Bedard. Weak convergence of Metropolis algorithms fornon-i.i.d. target distributions. Ann. Appl. Probab., 17(4):1222–1244,2007. ISSN 1050-5164. doi: 10.1214/105051607000000096. URLhttp://dx.doi.org/10.1214/105051607000000096.
[3] A. Beskos, N. Pillai, G.O. Roberts, J.-M. Sanz-Serna, and A.M. Stuart.Optimal tuning of hybrid monte-carlo. to appear, 2013.
[4] Alexandros Beskos, Gareth Roberts, and Andrew Stuart. Optimal scal-ings for local Metropolis-Hastings chains on nonproduct targets in highdimensions. Ann. Appl. Probab., 19(3):863–898, 2009. ISSN 1050-5164.doi: 10.1214/08-AAP563. URL http://dx.doi.org/10.1214/08-AAP563.
[5] A. Gelman, G. O. Roberts, and W. R. Gilks. Efficient Metropolis jumpingrules. In Bayesian statistics, 5 (Alicante, 1994), Oxford Sci. Publ., pages599–607. Oxford Univ. Press, New York, 1996.
[6] Kengo Kamatani. Local consistency of Markov chain Monte Carlo meth-ods. Ann. Inst. Statist. Math., 66(1):63–74, 2014. ISSN 0020-3157. doi:10.1007/s10463-013-0403-3.
[7] Gareth O. Roberts and Jeffrey S. Rosenthal. Optimal scaling of dis-crete approximations to langevin diffusions. J. R. Stat. Soc. Ser. BStat. Methodol., 60(1):255–268, 1998. ISSN 1467-9868. doi: 10.1111/1467-9868.00123.
[8] Gareth O. Roberts, Andrew Gelman, and Walter R. Gilks. Weak con-vergence and optimal scaling of random walk Metropolis algorithms.Ann. Appl. Probab., 7(1):110–120, 1997. ISSN 1050-5164. doi:10.1214/aoap/1034625254.
[9] Nakahiro Yoshida. Polynomial type large deviation inequalities andquasi-likelihood analysis for stochastic differential equations. Ann. Inst.Statist. Math., 63(3):431–479, 2011. ISSN 0020-3157. doi: 10.1007/s10463-009-0263-z.