View
9
Download
0
Category
Preview:
Citation preview
Lattice Systems KMC Coarse-Graining Parallel KMC Conclusions
Accelerated kinetic Monte Carlo methods:Hierarchical Parallel Algorithms & Coarse-Graining
Markos Katsoulakis
University of Massachusetts & University of Crete
Funding:NSF-DMS, NSF-CMMI, U.S. DOE and E.C. FP7
Lattice Systems KMC Coarse-Graining Parallel KMC Conclusions
Overview
1 Stochastic Lattice Systems & Applications
2 Kinetic Monte Carlo (KMC) methods
3 Coarse Graining (CG)
4 Hierarchical Parallel Algorithms
5 Benchmarks, examples and simulations
Lattice Systems KMC Coarse-Graining Parallel KMC Conclusions
Stochastic Lattice Systems
Surface processes
Provide information forpattern formation, chemical reactions, phase transitions
ΛN = 1N Zd ∩ [0, 1)d
Lattice size N >> 1
Configurationsσ ∈ ΣN := I ΛN
I = {0, 1} or I = {−1, 1}
Lattice Systems KMC Coarse-Graining Parallel KMC Conclusions
Equilibrium Theory
Hamiltonian: HN(σ) = −12
∑x 6=y J(x , y)σ(x)σ(y) + h
∑x σ(x)
- h: external field
- J: potential with interaction range L; V : R→ R hascompact support,
J(x − y) =1
LV(x − y
L
).
Nearest neighbor models (as truncations) and possiblycombinations short-/long- range interactions.
Fitted potentials to Molecular Dynamics simulations or data,e.g. Morse potentials
Lattice Systems KMC Coarse-Graining Parallel KMC Conclusions
Gibbs States
At the inverse temperature β = 1kT :
µΛ,β(σ = σ0) =1
ZΛ,βexp
{− βHN(σ0)}PN(σ = σ0)
[Probability of the configuration σ0]
Partition function: ZΛ,β =∑
σ0exp
{− βHN(σ0)}PN(σ = σ0)
Prior distribution (no interactions, hight temp.):
PN(σ = σ0) = Πx∈ΛP(σ(x) = σ0(x))
Lattice Systems KMC Coarse-Graining Parallel KMC Conclusions
Kinetic Monte Carlo (KMC)
Dynamics
Adsorption/Desorption/Reactions/Surface diffusion
Markov Chain modeling with state space Σ = all configurations σ
Generator: ∂tEf (σ) = E∑x∈Λ
c(x , σ)[f (σx)− f (σ)]︸ ︷︷ ︸LN f (σ)
.
Multi-site updates σx for most systems, e.g.
Suchorski et al ChemPhysChem (2010)
Lattice Systems KMC Coarse-Graining Parallel KMC Conclusions
Kinetic Monte Carlo (KMC)
Dynamics
Adsorption/Desorption/Reactions/Surface diffusion
Markov Chain modeling with state space Σ = all configurations σ
Generator: ∂tEf (σ) = E∑x∈Λ
c(x , σ)[f (σx)− f (σ)]︸ ︷︷ ︸LN f (σ)
.
Multi-site updates σx for most systems, e.g.
Suchorski et al ChemPhysChem (2010)
Lattice Systems KMC Coarse-Graining Parallel KMC Conclusions
Kinetic Monte Carlo (KMC)
Dynamics
Adsorption/Desorption/Reactions/Surface diffusion
Markov Chain modeling with state space Σ = all configurations σ
Generator: ∂tEf (σ) = E∑x∈Λ
c(x , σ)[f (σx)− f (σ)]︸ ︷︷ ︸LN f (σ)
.
Multi-site updates σx for most systems, e.g.
Suchorski et al ChemPhysChem (2010)
Lattice Systems KMC Coarse-Graining Parallel KMC Conclusions
Kinetic Monte Carlo (KMC)
TransitionProbability p(x,y)
No depend. onthe Past 1,...,k-1
Present State=x
Possible FutureState=y
Possible FutureState=z
Possible FutureState=w
Past States=x_kk=1,2,...,k-1
ResidenceTime: τ_x
expon.distributed:λ(x)
Lattice Systems KMC Coarse-Graining Parallel KMC Conclusions
Kinetic Monte Carlo: Arrhenius dynamics
Transition rate to the gas phase: c(x , σ) ∼ d0 exp[− βU(x , σ)
]Energy barrier: U(x , σ) =
∑z 6=x J(x − z)σ(z)− h.
-Exponential clock: for each configuration σ,
λ(σ) = d1(N −∑x∈
σ(x)) +∑x∈
d0σ(x)e−βU(x ,σ).
-Transition rates σ 7→ σ′ = σx :
c(x , σ) = λ(σ)p(σ, σx) = d1(1− σ(x)) + d0σ(x)e−βU(x ,σ)
Lattice Systems KMC Coarse-Graining Parallel KMC Conclusions
Kinetic Monte Carlo: Arrhenius dynamics
Transition rate to the gas phase: c(x , σ) ∼ d0 exp[− βU(x , σ)
]Energy barrier: U(x , σ) =
∑z 6=x J(x − z)σ(z)− h.
-Exponential clock: for each configuration σ,
λ(σ) = d1(N −∑x∈
σ(x)) +∑x∈
d0σ(x)e−βU(x ,σ).
-Transition rates σ 7→ σ′ = σx :
c(x , σ) = λ(σ)p(σ, σx) = d1(1− σ(x)) + d0σ(x)e−βU(x ,σ)
Lattice Systems KMC Coarse-Graining Parallel KMC Conclusions
Kinetic Monte Carlo: Arrhenius dynamics
Transition rate to the gas phase: c(x , σ) ∼ d0 exp[− βU(x , σ)
]Energy barrier: U(x , σ) =
∑z 6=x J(x − z)σ(z)− h.
-Exponential clock: for each configuration σ,
λ(σ) = d1(N −∑x∈
σ(x)) +∑x∈
d0σ(x)e−βU(x ,σ).
-Transition rates σ 7→ σ′ = σx :
c(x , σ) = λ(σ)p(σ, σx) = d1(1− σ(x)) + d0σ(x)e−βU(x ,σ)
Lattice Systems KMC Coarse-Graining Parallel KMC Conclusions
Kinetic Monte Carlo: Arrhenius dynamics
References: Gillespie (chemical reactions); Bortz, Kalos, Lebowitz(Ising-type systems)The pseudo-algorithm suggests:
divide lattice sites x into classes of equal rates
pick a class using the relative weights
pick from each class a site x uniformly and update theconfiguration
However: For complex interactions (e.g. long-range)
U(x , σ) =∑z 6=x
J(x − z)σ(z)− h
yields a very high number of classes making the algorithmimpractical.
Lattice Systems KMC Coarse-Graining Parallel KMC Conclusions
Towards accelerating molecular simulations
Bottlenecks in molecular simulation of extended systems.
Cannot simulate realistic spatio-temporal scales:
1µm2 ≈ 10, 0002 lattice
Difficult to carry out ”systems tasks” for engineering applications:
sensitivity analysis, optimization, control
Lattice Systems KMC Coarse-Graining Parallel KMC Conclusions
Towards accelerating molecular simulations
Bottlenecks in molecular simulation of extended systems.
Cannot simulate realistic spatio-temporal scales:
1µm2 ≈ 10, 0002 lattice
Difficult to carry out ”systems tasks” for engineering applications:
sensitivity analysis, optimization, control
Lattice Systems KMC Coarse-Graining Parallel KMC Conclusions
Towards accelerating molecular simulations
Bottlenecks in molecular simulation of extended systems.
Cannot simulate realistic spatio-temporal scales:
1µm2 ≈ 10, 0002 lattice
Difficult to carry out ”systems tasks” for engineering applications:
sensitivity analysis, optimization, control
Lattice Systems KMC Coarse-Graining Parallel KMC Conclusions
Coarse-Graining: from microscopics to mesoscopics
Spatial acceleration methods
Microscopic lattice
Coarse lattice
Time (s)
Non-uniform mesh
1 Chatterjee et al., JCP 121, 11420 (2004); PRE 71, 0267021 (2005)2 Chatterjee and Vlachos, JCP 124, 0641101 (2006)
0.00
0.05
0.10
0.15
0.20
0.25
0.40 0.44 0.48 0.52
Prob
abili
ty
Average lattice coverage, !
KMCMultiscale CGMC
CGMC-MFCGMC-QC
t = 10 s
• Spatial adaptivity1
- Error estimates guide mesh refinement• Multiscale MC methods for high accuracy2
– Higher order closures– Multigrid
• Multicomponent interacting systems
c(x , t) ≈ local average vN(x , t) =1
|Bx |∑y∈Bx
σt(y), as N →∞ ,
”Closure”: when does c = c(x , t) solve a PDE/Stoch. PDE?
E.g. local mean-field limits, Connections to Cahn-Hilliard (S)PDEfor attractive interactions (J > 0)Lebowitz, Orlandi, Presutti JSP ’91; Giacomin, Lebowitz, Phys. Rev. Lett. ’96; K. , Vlachos, Phys. Rev. Lett. ’00;
J. Chem. Phys. ’03.
Lattice Systems KMC Coarse-Graining Parallel KMC Conclusions
Coarse-Graining: from microscopics to mesoscopics
Spatial acceleration methods
Microscopic lattice
Coarse lattice
Time (s)
Non-uniform mesh
1 Chatterjee et al., JCP 121, 11420 (2004); PRE 71, 0267021 (2005)2 Chatterjee and Vlachos, JCP 124, 0641101 (2006)
0.00
0.05
0.10
0.15
0.20
0.25
0.40 0.44 0.48 0.52
Prob
abili
ty
Average lattice coverage, !
KMCMultiscale CGMC
CGMC-MFCGMC-QC
t = 10 s
• Spatial adaptivity1
- Error estimates guide mesh refinement• Multiscale MC methods for high accuracy2
– Higher order closures– Multigrid
• Multicomponent interacting systems
c(x , t) ≈ local average vN(x , t) =1
|Bx |∑y∈Bx
σt(y), as N →∞ ,
”Closure”: when does c = c(x , t) solve a PDE/Stoch. PDE?
E.g. local mean-field limits, Connections to Cahn-Hilliard (S)PDEfor attractive interactions (J > 0)Lebowitz, Orlandi, Presutti JSP ’91; Giacomin, Lebowitz, Phys. Rev. Lett. ’96; K. , Vlachos, Phys. Rev. Lett. ’00;
J. Chem. Phys. ’03.
Lattice Systems KMC Coarse-Graining Parallel KMC Conclusions
Coarse-Graining: from microscopics to mesoscopics
Spatial acceleration methods
Microscopic lattice
Coarse lattice
Time (s)
Non-uniform mesh
1 Chatterjee et al., JCP 121, 11420 (2004); PRE 71, 0267021 (2005)2 Chatterjee and Vlachos, JCP 124, 0641101 (2006)
0.00
0.05
0.10
0.15
0.20
0.25
0.40 0.44 0.48 0.52
Prob
abili
ty
Average lattice coverage, !
KMCMultiscale CGMC
CGMC-MFCGMC-QC
t = 10 s
• Spatial adaptivity1
- Error estimates guide mesh refinement• Multiscale MC methods for high accuracy2
– Higher order closures– Multigrid
• Multicomponent interacting systems
c(x , t) ≈ local average vN(x , t) =1
|Bx |∑y∈Bx
σt(y), as N →∞ ,
”Closure”: when does c = c(x , t) solve a PDE/Stoch. PDE?
E.g. local mean-field limits, Connections to Cahn-Hilliard (S)PDEfor attractive interactions (J > 0)Lebowitz, Orlandi, Presutti JSP ’91; Giacomin, Lebowitz, Phys. Rev. Lett. ’96; K. , Vlachos, Phys. Rev. Lett. ’00;
J. Chem. Phys. ’03.
Lattice Systems KMC Coarse-Graining Parallel KMC Conclusions
Hierarchical Coarse-Graining
1. Coarse-graining ofpolymers; DPD methods
Briels, et. al. J.Chem.Phys. ’01;Doi et. al. J.Chem.Phys. ’02;Kremer et. al. Macromolecules ’06;Muller-Plathe Chem.Phys.Chem ’00;Laaksonen et. al. Soft Matter ’03, etc.
Recent related work on simulating
bio-membranes: Deserno et. al. Nature ’07.
2. Stochastic latticedynamics/ KMCK., Majda, Vlachos, PNAS’03;K., Plechac, Sopasakis, SIAM Num. Anal.’06;Are, K., Plechac, Rey-Bellet SIAMJ.Sci.Comp. ’08;
Sinno et al. J.Chem.Phys.’08.
Spatial acceleration methods
Microscopic lattice
Coarse lattice
Time (s)
Non-uniform mesh
1 Chatterjee et al., JCP 121, 11420 (2004); PRE 71, 0267021 (2005)2 Chatterjee and Vlachos, JCP 124, 0641101 (2006)
0.00
0.05
0.10
0.15
0.20
0.25
0.40 0.44 0.48 0.52
Prob
abili
ty
Average lattice coverage, !
KMCMultiscale CGMC
CGMC-MFCGMC-QC
t = 10 s
• Spatial adaptivity1
- Error estimates guide mesh refinement• Multiscale MC methods for high accuracy2
– Higher order closures– Multigrid
• Multicomponent interacting systems
Lattice Systems KMC Coarse-Graining Parallel KMC Conclusions
Coarse Graining in Lattice Systems
Divide lattice of size N into M cellswith q-particles in each cell
q
q
di!usion
adsorption desorption
block spin !(k) =!
x!Ck"(x)Coarse cells
m
m
Coarse map:
T : ΣN → ΣM
σ 7→ η := {η(k) =∑x∈Ck
σ(x)}
Renormalization Group map:
H(η) = − 1
βlog
∫exp{−βHN(σ)}P(dσ|η)
Lattice Systems KMC Coarse-Graining Parallel KMC Conclusions
Coarse Graining in Lattice Systems
Divide lattice of size N into M cellswith q-particles in each cell
q
q
di!usion
adsorption desorption
block spin !(k) =!
x!Ck"(x)Coarse cells
m
m
Coarse map:
T : ΣN → ΣM
σ 7→ η := {η(k) =∑x∈Ck
σ(x)}
Renormalization Group map:
H(η) = − 1
βlog
∫exp{−βHN(σ)}P(dσ|η)
Lattice Systems KMC Coarse-Graining Parallel KMC Conclusions
Coarse Graining in Lattice Systems
Divide lattice of size N into M cellswith q-particles in each cell
q
q
di!usion
adsorption desorption
block spin !(k) =!
x!Ck"(x)Coarse cells
m
m
Coarse map:
T : ΣN → ΣM
σ 7→ η := {η(k) =∑x∈Ck
σ(x)}
Renormalization Group map:
H(η) = − 1
βlog
∫exp{−βHN(σ)}P(dσ|η)
Lattice Systems KMC Coarse-Graining Parallel KMC Conclusions
1-D example: n.n. Ising model
Approximation of RG map: H(η) by H0(η) computable:
HN(σ) =∑k
Hk(σ) +∑k
Wk,k+1(σ)
Hk : energy for the cell k with free boundary conditionsWk,k+1: short-range interactions between cell k and cell k + 1.
e−βHN PN(dσ|η) =∏
k: odd
[e−(Wk−1,k+Wk,k+1)e−Hk Pk(dσk |η(k))
]×∏
k: even
e−Hk Pk(dσk |η(k))1D Operator Splitting
Algorithm(OpSpl)
1. Apply SSA on white cells in parallel until synchronization time 2. Apply SSA on black cells in parallel until synchronization time 3. Goto 1
Lattice Systems KMC Coarse-Graining Parallel KMC Conclusions
A simple example
-When Wk,k+1 are disregarded (e.g. high temps), there areintra-cell interactions, but there are no CG cell correlations:
H(0)m (η) =
∑k
U(0)k (ηk) = −
∑k
1
βlog
∫e−βHk (σ)Pk(dσk |η(k))
Sampling over a single coarse cell with free boundary conditionsInverse Monte Carlo method: Laaksonen et. al. Soft Matter ’03.
1D Operator Splitting Algorithm(OpSpl)
1. Apply SSA on white cells in parallel until synchronization time 2. Apply SSA on black cells in parallel until synchronization time 3. Goto 1
Multi-body terms in Coarse GrainingK., Plechac , Rey-Bellet, Tsagkarogiannis ESAIM M2AN ’07, preprint ’10
Lattice Systems KMC Coarse-Graining Parallel KMC Conclusions
A simple example
-When Wk,k+1 are disregarded (e.g. high temps), there areintra-cell interactions, but there are no CG cell correlations:
H(0)m (η) =
∑k
U(0)k (ηk) = −
∑k
1
βlog
∫e−βHk (σ)Pk(dσk |η(k))
Sampling over a single coarse cell with free boundary conditionsInverse Monte Carlo method: Laaksonen et. al. Soft Matter ’03.
1D Operator Splitting Algorithm(OpSpl)
1. Apply SSA on white cells in parallel until synchronization time 2. Apply SSA on black cells in parallel until synchronization time 3. Goto 1
Multi-body terms in Coarse GrainingK., Plechac , Rey-Bellet, Tsagkarogiannis ESAIM M2AN ’07, preprint ’10
Lattice Systems KMC Coarse-Graining Parallel KMC Conclusions
Coarse Graining- Approximations heuristics
• CG Hamiltonian–Renormalization Group Map: N = mq
H(η) = − 1
βlog
∫exp{−βHN(σ)}P(dσ|η)
• Correction terms around a first ”good guess” H(0)m :
Hm(η) = H(0)m (η)− 1
βlog E [exp
(− β(HN − H
(0)m ))|η] , m = N,N−1, ...
• Expansion of exp (β∆H) and log:
= E [∆H|η] + E [(∆H)2|η]− E [∆H|η]2 + O((∆H)3)
formal calculations inadequate since:
∆H ≡ HN − H(0)m = N · O(ε)
• the role of fluctuations and extensivity.
• Rigorous analysis – Cluster expansion: around H(0)m
K., Plechac , Rey-Bellet, Tsagkarogiannis ESAIM M2AN ’07
Lattice Systems KMC Coarse-Graining Parallel KMC Conclusions
Coarse Graining- Approximations heuristics
• CG Hamiltonian–Renormalization Group Map: N = mq
H(η) = − 1
βlog
∫exp{−βHN(σ)}P(dσ|η)
• Correction terms around a first ”good guess” H(0)m :
Hm(η) = H(0)m (η)− 1
βlog E [exp
(− β(HN − H
(0)m ))|η] , m = N,N−1, ...
• Expansion of exp (β∆H) and log:
= E [∆H|η] + E [(∆H)2|η]− E [∆H|η]2 + O((∆H)3)
formal calculations inadequate since:
∆H ≡ HN − H(0)m = N · O(ε)
• the role of fluctuations and extensivity.
• Rigorous analysis – Cluster expansion: around H(0)m
K., Plechac , Rey-Bellet, Tsagkarogiannis ESAIM M2AN ’07
Lattice Systems KMC Coarse-Graining Parallel KMC Conclusions
Coarse Graining- Approximations heuristics
• CG Hamiltonian–Renormalization Group Map: N = mq
H(η) = − 1
βlog
∫exp{−βHN(σ)}P(dσ|η)
• Correction terms around a first ”good guess” H(0)m :
Hm(η) = H(0)m (η)− 1
βlog E [exp
(− β(HN − H
(0)m ))|η] , m = N,N−1, ...
• Expansion of exp (β∆H) and log:
= E [∆H|η] + E [(∆H)2|η]− E [∆H|η]2 + O((∆H)3)
formal calculations inadequate since:
∆H ≡ HN − H(0)m = N · O(ε)
• the role of fluctuations and extensivity.
• Rigorous analysis – Cluster expansion: around H(0)m
K., Plechac , Rey-Bellet, Tsagkarogiannis ESAIM M2AN ’07
Lattice Systems KMC Coarse-Graining Parallel KMC Conclusions
Coarse Graining- Approximations heuristics
• CG Hamiltonian–Renormalization Group Map: N = mq
H(η) = − 1
βlog
∫exp{−βHN(σ)}P(dσ|η)
• Correction terms around a first ”good guess” H(0)m :
Hm(η) = H(0)m (η)− 1
βlog E [exp
(− β(HN − H
(0)m ))|η] , m = N,N−1, ...
• Expansion of exp (β∆H) and log:
= E [∆H|η] + E [(∆H)2|η]− E [∆H|η]2 + O((∆H)3)
formal calculations inadequate since:
∆H ≡ HN − H(0)m = N · O(ε)
• the role of fluctuations and extensivity.
• Rigorous analysis – Cluster expansion: around H(0)m
K., Plechac , Rey-Bellet, Tsagkarogiannis ESAIM M2AN ’07
Lattice Systems KMC Coarse-Graining Parallel KMC Conclusions
Coarse Graining- Approximations heuristics
• CG Hamiltonian–Renormalization Group Map: N = mq
H(η) = − 1
βlog
∫exp{−βHN(σ)}P(dσ|η)
• Correction terms around a first ”good guess” H(0)m :
Hm(η) = H(0)m (η)− 1
βlog E [exp
(− β(HN − H
(0)m ))|η] , m = N,N−1, ...
• Expansion of exp (β∆H) and log:
= E [∆H|η] + E [(∆H)2|η]− E [∆H|η]2 + O((∆H)3)
formal calculations inadequate since:
∆H ≡ HN − H(0)m = N · O(ε)
• the role of fluctuations and extensivity.
• Rigorous analysis – Cluster expansion: around H(0)m
K., Plechac , Rey-Bellet, Tsagkarogiannis ESAIM M2AN ’07
Lattice Systems KMC Coarse-Graining Parallel KMC Conclusions
Coarse Graining- Approximations heuristics
• CG Hamiltonian–Renormalization Group Map: N = mq
H(η) = − 1
βlog
∫exp{−βHN(σ)}P(dσ|η)
• Correction terms around a first ”good guess” H(0)m :
Hm(η) = H(0)m (η)− 1
βlog E [exp
(− β(HN − H
(0)m ))|η] , m = N,N−1, ...
• Expansion of exp (β∆H) and log:
= E [∆H|η] + E [(∆H)2|η]− E [∆H|η]2 + O((∆H)3)
formal calculations inadequate since:
∆H ≡ HN − H(0)m = N · O(ε)
• the role of fluctuations and extensivity.
• Rigorous analysis – Cluster expansion: around H(0)m
K., Plechac , Rey-Bellet, Tsagkarogiannis ESAIM M2AN ’07
Lattice Systems KMC Coarse-Graining Parallel KMC Conclusions
Coarse Graining, Long-range Interactions
-Very costly with KMC, but easy to CG:
HN(σ) = −1
2
∑x∈ΛN
∑y 6=x
J(x − y)σ(x)σ(y) + h∑x
σ(x)
J(x − y) =1
LdV
( |x − y |L
), x , y ∈ ΛN
Jk,l =1
q2
∑x∈Ck
∑y∈Cl
J(x − y),
Spatial acceleration methods
Microscopic lattice
Coarse lattice
Time (s)
Non-uniform mesh
1 Chatterjee et al., JCP 121, 11420 (2004); PRE 71, 0267021 (2005)2 Chatterjee and Vlachos, JCP 124, 0641101 (2006)
0.00
0.05
0.10
0.15
0.20
0.25
0.40 0.44 0.48 0.52
Prob
abili
ty
Average lattice coverage, !
KMCMultiscale CGMC
CGMC-MFCGMC-QC
t = 10 s
• Spatial adaptivity1
- Error estimates guide mesh refinement• Multiscale MC methods for high accuracy2
– Higher order closures– Multigrid
• Multicomponent interacting systems
H(0)(η) = −1
2
∑k
∑l 6=k
Jk,lη(k)η(l)−1
2
∑k
Jk,k(η(k)−q)+h∑k
η(k)
Lattice Systems KMC Coarse-Graining Parallel KMC Conclusions
Coarse Graining, Long-range Interactions
-Very costly with KMC, but easy to CG:
HN(σ) = −1
2
∑x∈ΛN
∑y 6=x
J(x − y)σ(x)σ(y) + h∑x
σ(x)
J(x − y) =1
LdV
( |x − y |L
), x , y ∈ ΛN
Jk,l =1
q2
∑x∈Ck
∑y∈Cl
J(x − y),
Spatial acceleration methods
Microscopic lattice
Coarse lattice
Time (s)
Non-uniform mesh
1 Chatterjee et al., JCP 121, 11420 (2004); PRE 71, 0267021 (2005)2 Chatterjee and Vlachos, JCP 124, 0641101 (2006)
0.00
0.05
0.10
0.15
0.20
0.25
0.40 0.44 0.48 0.52
Prob
abili
ty
Average lattice coverage, !
KMCMultiscale CGMC
CGMC-MFCGMC-QC
t = 10 s
• Spatial adaptivity1
- Error estimates guide mesh refinement• Multiscale MC methods for high accuracy2
– Higher order closures– Multigrid
• Multicomponent interacting systems
H(0)(η) = −1
2
∑k
∑l 6=k
Jk,lη(k)η(l)−1
2
∑k
Jk,k(η(k)−q)+h∑k
η(k)
Lattice Systems KMC Coarse-Graining Parallel KMC Conclusions
Coarse Graining, Long-range Interactions
-Very costly with KMC, but easy to CG:
HN(σ) = −1
2
∑x∈ΛN
∑y 6=x
J(x − y)σ(x)σ(y) + h∑x
σ(x)
J(x − y) =1
LdV
( |x − y |L
), x , y ∈ ΛN
Jk,l =1
q2
∑x∈Ck
∑y∈Cl
J(x − y),
Spatial acceleration methods
Microscopic lattice
Coarse lattice
Time (s)
Non-uniform mesh
1 Chatterjee et al., JCP 121, 11420 (2004); PRE 71, 0267021 (2005)2 Chatterjee and Vlachos, JCP 124, 0641101 (2006)
0.00
0.05
0.10
0.15
0.20
0.25
0.40 0.44 0.48 0.52
Prob
abili
ty
Average lattice coverage, !
KMCMultiscale CGMC
CGMC-MFCGMC-QC
t = 10 s
• Spatial adaptivity1
- Error estimates guide mesh refinement• Multiscale MC methods for high accuracy2
– Higher order closures– Multigrid
• Multicomponent interacting systems
H(0)(η) = −1
2
∑k
∑l 6=k
Jk,lη(k)η(l)−1
2
∑k
Jk,k(η(k)−q)+h∑k
η(k)
Lattice Systems KMC Coarse-Graining Parallel KMC Conclusions
Multi-body terms in Coarse Graining
Corrections to H(0): Hm(η) = H(0)m (η) + H
(1)m (η) + ...
H(1)(η) = β∑k1
∑k2>k1
∑k3>k2
[j2k1k2k3
(−E1(k1)E2(k2)E1(k3) + ...
• “Moments” of interaction potential J:
j2k1k2k3
=∑
x∈Ck1
∑y∈Ck2
∑z∈Ck3
(J(x−y)− J(k1, k2))(J(y−z)− J(k2, k3))
Typically omitted but essential tocapture phase transitions, hysteresis
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 20
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
h −external field
cove
rage
N=1024; q=8; ! Jo = 5.0
0 0.02 0.04 0.06 0.08 0.1
−0.1
−0.05
0
0.05
r "
J(r)
MCq=8 with corrections & potential splittingq=8 with corrections & no potential splittingq=8
J(r)
Lattice Systems KMC Coarse-Graining Parallel KMC Conclusions
Multi-body terms in Coarse Graining
Corrections to H(0): Hm(η) = H(0)m (η) + H
(1)m (η) + ...
H(1)(η) = β∑k1
∑k2>k1
∑k3>k2
[j2k1k2k3
(−E1(k1)E2(k2)E1(k3) + ...
• “Moments” of interaction potential J:
j2k1k2k3
=∑
x∈Ck1
∑y∈Ck2
∑z∈Ck3
(J(x−y)− J(k1, k2))(J(y−z)− J(k2, k3))
Typically omitted but essential tocapture phase transitions, hysteresis
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 20
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
h −external field
cove
rage
N=1024; q=8; ! Jo = 5.0
0 0.02 0.04 0.06 0.08 0.1
−0.1
−0.05
0
0.05
r "
J(r)
MCq=8 with corrections & potential splittingq=8 with corrections & no potential splittingq=8
J(r)
Lattice Systems KMC Coarse-Graining Parallel KMC Conclusions
Multi-body terms in Coarse Graining
Corrections to H(0): Hm(η) = H(0)m (η) + H
(1)m (η) + ...
H(1)(η) = β∑k1
∑k2>k1
∑k3>k2
[j2k1k2k3
(−E1(k1)E2(k2)E1(k3) + ...
• “Moments” of interaction potential J:
j2k1k2k3
=∑
x∈Ck1
∑y∈Ck2
∑z∈Ck3
(J(x−y)− J(k1, k2))(J(y−z)− J(k2, k3))
Typically omitted but essential tocapture phase transitions, hysteresis
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 20
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
h −external field
cove
rage
N=1024; q=8; ! Jo = 5.0
0 0.02 0.04 0.06 0.08 0.1
−0.1
−0.05
0
0.05
r "
J(r)
MCq=8 with corrections & potential splittingq=8 with corrections & no potential splittingq=8
J(r)
Lattice Systems KMC Coarse-Graining Parallel KMC Conclusions
Loss of Information & Coarse Graining
Relative entropy: R(µ|ν) =∫
log(
dµdν
)dν
Theorem [Error estimates]:1. For ε = β q
L‖∇J‖1,
1
NR(µ(p)|µ) = O(εp+2)
2. Cluster expansions → a posteriori expansion for the relativeentropy:
1
NR(µ(p)|µ) = Eµ(0) [I (η)] + log
(Eµ(0) [e−I (η)]
)+ O(ε3)
The error indicator I (.) is given by the terms H(1), H(2) anddepends only on the coarse variable η ∼ µ(0).
Lattice Systems KMC Coarse-Graining Parallel KMC Conclusions
Loss of Information & Coarse Graining
Relative entropy: R(µ|ν) =∫
log(
dµdν
)dν
Theorem [Error estimates]:1. For ε = β q
L‖∇J‖1,
1
NR(µ(p)|µ) = O(εp+2)
2. Cluster expansions → a posteriori expansion for the relativeentropy:
1
NR(µ(p)|µ) = Eµ(0) [I (η)] + log
(Eµ(0) [e−I (η)]
)+ O(ε3)
The error indicator I (.) is given by the terms H(1), H(2) anddepends only on the coarse variable η ∼ µ(0).
Lattice Systems KMC Coarse-Graining Parallel KMC Conclusions
Parallel KMC Simulation in Lattice Systems
Markovian Dynamics: Adsorption/Desorption/Reaction/ Diffusion
Generator: ∂tEf (σ) = E∑
x∈Λ c(x , σ)[f (σx)− f (σ)].
TransitionProbability p(x,y)
No depend. onthe Past 1,...,k-1
Present State=x
Possible FutureState=y
Possible FutureState=z
Possible FutureState=w
Past States=x_kk=1,2,...,k-1
ResidenceTime: τ_x
expon.distributed:λ(x)
Lattice Systems KMC Coarse-Graining Parallel KMC Conclusions
Parallel KMC Simulation in Lattice Systems
Lubachevsky, JCP ’88, Korniss, Novotny, Rikvold JCP ’01,...
Main idea in geometric parallelization: break up the lattice insmaller sub-lattices.
Run KMC on each sub-lattice in separate processors andcommunicate across boundaries.
However: asynchronous updates in neighboring sites acrossprocesses in standard CTMC implementations
parallel n-fold way with block size l serial
l 16 32 64 128 256 512 1024 (!)!t 3.7 6.1 9.2 12.6 15.4 17.4 18.5 19.9
TABLE I. Mean time increments (in MCSP) for the serial and the parallel n-fold way algorithms with di"erent block sizel at T=0.7Tc, |H |/J=0.2857. They are approximately independent of the full system size L and NPE. (!) The mean timeincrement for the serial algorithm is approximately independent of L.
|H |/J0.1587 0.2222 0.2857 0.3492 0.4127
0.6 serial - 81.5 61.4 46.4 36.3parallel - 23.4 21.4 19.3 17.4
T/Tc 0.7 serial 33.8 25.4 19.9 16.5 14.3parallel 16.8 14.5 12.6 11.1 10.1
0.8 serial 12.5 10.4 9.2 8.5 7.9parallel 9.2 8.0 7.4 6.9 6.5
TABLE II. Mean time increments (in MCSP) for the serial and the parallel n-fold way algorithms for di"erent temperaturesand magnetic fields (NPE=64, l=128).
PE8
i k
PE0 PE1 PE2
PE3 PE4 PE5
PE6 PE7
FIG. 1. Schematic diagram of the spatial decomposition of the system and its mapping onto a parallel machine. Here L=12and l=4. Each of the NPE=(L/l)2=9 processing elements (PEs) carries l2=16 spins, confined by solid lines. The spins on theboundary are separated from those in the kernel by dashed lines.
11
Lattice Systems KMC Coarse-Graining Parallel KMC Conclusions
Uniformization and Parallel Simulation in Lattice Systems
One solution is ”uniformization”: same process indistribution, however we pick clock with uniform rate λ∗ suchthat:
maxxλ(x) ≤ λ∗ ,
and new skeleton process
p∗(x , y) =
1− λ(x)
λ∗ if x=y
λ(x)λ∗ p(x , y) if x 6= y
p∗(x , y) introduces rejections in the algorithm.
asynchronous algorithms, unless a boundary event occurs
Lattice Systems KMC Coarse-Graining Parallel KMC Conclusions
Parallel Simulation in Lattice Systems
However,
we still have excessive communication between processors inthe case of complex interactions:communication (boundary) regions can be ”wide”, in contrastto the n.n. case:
parallel n-fold way with block size l serial
l 16 32 64 128 256 512 1024 (!)!t 3.7 6.1 9.2 12.6 15.4 17.4 18.5 19.9
TABLE I. Mean time increments (in MCSP) for the serial and the parallel n-fold way algorithms with di"erent block sizel at T=0.7Tc, |H |/J=0.2857. They are approximately independent of the full system size L and NPE. (!) The mean timeincrement for the serial algorithm is approximately independent of L.
|H |/J0.1587 0.2222 0.2857 0.3492 0.4127
0.6 serial - 81.5 61.4 46.4 36.3parallel - 23.4 21.4 19.3 17.4
T/Tc 0.7 serial 33.8 25.4 19.9 16.5 14.3parallel 16.8 14.5 12.6 11.1 10.1
0.8 serial 12.5 10.4 9.2 8.5 7.9parallel 9.2 8.0 7.4 6.9 6.5
TABLE II. Mean time increments (in MCSP) for the serial and the parallel n-fold way algorithms for di"erent temperaturesand magnetic fields (NPE=64, l=128).
PE8
i k
PE0 PE1 PE2
PE3 PE4 PE5
PE6 PE7
FIG. 1. Schematic diagram of the spatial decomposition of the system and its mapping onto a parallel machine. Here L=12and l=4. Each of the NPE=(L/l)2=9 processing elements (PEs) carries l2=16 spins, confined by solid lines. The spins on theboundary are separated from those in the kernel by dashed lines.
11
many rejections for poor upper bounds λ∗ ≥ maxx λ(x) .
Variants: partially rejection-free methods
Lattice Systems KMC Coarse-Graining Parallel KMC Conclusions
Parallel Simulation in Lattice Systems
However,
we still have excessive communication between processors inthe case of complex interactions:communication (boundary) regions can be ”wide”, in contrastto the n.n. case:
parallel n-fold way with block size l serial
l 16 32 64 128 256 512 1024 (!)!t 3.7 6.1 9.2 12.6 15.4 17.4 18.5 19.9
TABLE I. Mean time increments (in MCSP) for the serial and the parallel n-fold way algorithms with di"erent block sizel at T=0.7Tc, |H |/J=0.2857. They are approximately independent of the full system size L and NPE. (!) The mean timeincrement for the serial algorithm is approximately independent of L.
|H |/J0.1587 0.2222 0.2857 0.3492 0.4127
0.6 serial - 81.5 61.4 46.4 36.3parallel - 23.4 21.4 19.3 17.4
T/Tc 0.7 serial 33.8 25.4 19.9 16.5 14.3parallel 16.8 14.5 12.6 11.1 10.1
0.8 serial 12.5 10.4 9.2 8.5 7.9parallel 9.2 8.0 7.4 6.9 6.5
TABLE II. Mean time increments (in MCSP) for the serial and the parallel n-fold way algorithms for di"erent temperaturesand magnetic fields (NPE=64, l=128).
PE8
i k
PE0 PE1 PE2
PE3 PE4 PE5
PE6 PE7
FIG. 1. Schematic diagram of the spatial decomposition of the system and its mapping onto a parallel machine. Here L=12and l=4. Each of the NPE=(L/l)2=9 processing elements (PEs) carries l2=16 spins, confined by solid lines. The spins on theboundary are separated from those in the kernel by dashed lines.
11many rejections for poor upper bounds λ∗ ≥ maxx λ(x) .
Variants: partially rejection-free methods
Lattice Systems KMC Coarse-Graining Parallel KMC Conclusions
Parallel Simulation in Lattice Systems
However,
we still have excessive communication between processors inthe case of complex interactions:communication (boundary) regions can be ”wide”, in contrastto the n.n. case:
parallel n-fold way with block size l serial
l 16 32 64 128 256 512 1024 (!)!t 3.7 6.1 9.2 12.6 15.4 17.4 18.5 19.9
TABLE I. Mean time increments (in MCSP) for the serial and the parallel n-fold way algorithms with di"erent block sizel at T=0.7Tc, |H |/J=0.2857. They are approximately independent of the full system size L and NPE. (!) The mean timeincrement for the serial algorithm is approximately independent of L.
|H |/J0.1587 0.2222 0.2857 0.3492 0.4127
0.6 serial - 81.5 61.4 46.4 36.3parallel - 23.4 21.4 19.3 17.4
T/Tc 0.7 serial 33.8 25.4 19.9 16.5 14.3parallel 16.8 14.5 12.6 11.1 10.1
0.8 serial 12.5 10.4 9.2 8.5 7.9parallel 9.2 8.0 7.4 6.9 6.5
TABLE II. Mean time increments (in MCSP) for the serial and the parallel n-fold way algorithms for di"erent temperaturesand magnetic fields (NPE=64, l=128).
PE8
i k
PE0 PE1 PE2
PE3 PE4 PE5
PE6 PE7
FIG. 1. Schematic diagram of the spatial decomposition of the system and its mapping onto a parallel machine. Here L=12and l=4. Each of the NPE=(L/l)2=9 processing elements (PEs) carries l2=16 spins, confined by solid lines. The spins on theboundary are separated from those in the kernel by dashed lines.
11many rejections for poor upper bounds λ∗ ≥ maxx λ(x) .
Variants: partially rejection-free methods
Lattice Systems KMC Coarse-Graining Parallel KMC Conclusions
Synchronous algorithms-Exact SimulationJ. Phys.: Condens. Matter 21 (2009) 084214 G Nandipati et al
Figure 6. Schematic diagram of square decomposition for Np = 9.Solid lines correspond to processor domains.
carrying out our parallel KMC simulations of coarsening wehave tested and developed several different parallel algorithmsin order to determine which is the most efficient. In particular,we have studied a recently developed rigorous algorithm, the‘optimistic synchronous relaxation’ (OSR) algorithm as wellas a modified version of this algorithm which we refer toas ‘optimistic synchronous relaxation with pseudo-rollback’(OSRPR). In addition, we have tested the recently developedsemi-rigorous synchronous sublattice (SL) algorithm. Finally,to reduce the number of events corresponding to boundaryevents between processors we have also developed a newmethod, which we refer to as ‘dynamic boundary allocation’(DBA). Below we discuss each of these methods and some ofthe details of their implementation.
4.1. Optimistic synchronous relaxation (OSR) algorithm
One of the first rigorous algorithms for parallel discrete-event simulations was the synchronous relaxation algorithmdeveloped by Lubachevsky [9]. We note that the applicationof this algorithm to KMC simulations as well as its scalingas a function of the number of processors Np has beenrecently studied by Shim and Amar [11]. However, since thisalgorithm is relatively complex and requires multiple iterationsfor each cycle, Merrick and Fichthorn have recently developeda similar but simpler algorithm which they refer to as optimisticsynchronous relaxation (OSR) [12].
Figure 6 shows a typical decomposition of a squaresystem into Np square regions, where Np is the number ofprocessors. Also indicated in figure 6 are the boundary and‘ghost’ regions for the central processor, where the boundaryregion is defined as that portion of the processor’s domain inwhich a change may affect neighboring processors. Similarly,the ghost region corresponds to that part of the neighboringprocessors’ domains which can affect a given processor. Thus,in general the width of the boundary and ghost regions must beat least equal to the range of interaction.
As shown in figure 7, in the OSR algorithm in eachcycle all processors start with the same initial time and then
Figure 7. Time evolution of events for OSR and OSRPR algorithmswith G = 4. Dashed lines correspond to selected events, while thedashed line with an X corresponds to an event exceeding tmin (see thetext). In OSR this event is discarded while in OSRPR this event isadded to the next cycle.
simultaneously and independently carry out KMC events intheir domains until either the number of KMC events reachesa pre-determined fixed number G, or one of the eventscorresponds to a ‘boundary event’, i.e. an event which modifiesthe boundary region of the given processor and which can thusaffect events in neighboring processors. Defining the time forthe last event in each processor as tlast, a global communicationis then carried out to determine the time tmin corresponding tothe minimum of tlast over all processors. Each processor then‘rolls back’ or undoes all KMC events which occur after tmin.If there are no boundary events then the processors all move onto the next cycle with the new starting time corresponding totmin. However, if tmin corresponds to a boundary event, then anadditional communication is needed to update the ghost and/orboundary regions of all processors affected by the boundaryevent.
We note that typically the OSR algorithm requires 2–3 global communications each cycle, one to determine tmin,another to determine if the event with tmin corresponded to aboundary event, and a third to update the boundary regionsof the affected processors if there was a boundary event.To reduce the number of global communications we haveencoded the processor identity as well as whether or not thelast event was a boundary event, along with the least advancedtime of each processor before doing a global communicationto determine tmin. This was done by replacing tlast with anumber whose most significant figures corresponded to tlast butwhose least significant figures contained information about theprocessor ID and whether or not that processor had a boundaryevent6. Thus, in our implementation of the OSR algorithm onlyone global communication was needed if tlast corresponded to anon-boundary event, while two communications were neededif it was a boundary event.
6 In this method, the time each processor advances from its previous cycleis multiplied by a very large number to form the integer part of the doubleprecision packed number. The ratio of the processor ID to the total numberof processors used Np is then added to the decimal part if there is a boundaryevent in that processor. If there is no boundary event in that processor a decimalnumber is added such that it does not correspond to any processor identity. Inour implementation the multiplying number was 1020, which leads to goodaccuracy.
5
Shim, Amar, PRB ’05, Merrick, Fichthorn, PRE ’07, etc
Synchronous algorithm: uniform time window for eachprocessor unless a boundary event occurs.
Resolve conflicts at boundary regions by communicating withneighboring processors and restart cycle.
Global communication overhead in each cycle.
Previous methods rely on exact simulation of the stochasticprocess.
Lattice Systems KMC Coarse-Graining Parallel KMC Conclusions
Synchronous algorithms-Exact SimulationJ. Phys.: Condens. Matter 21 (2009) 084214 G Nandipati et al
Figure 6. Schematic diagram of square decomposition for Np = 9.Solid lines correspond to processor domains.
carrying out our parallel KMC simulations of coarsening wehave tested and developed several different parallel algorithmsin order to determine which is the most efficient. In particular,we have studied a recently developed rigorous algorithm, the‘optimistic synchronous relaxation’ (OSR) algorithm as wellas a modified version of this algorithm which we refer toas ‘optimistic synchronous relaxation with pseudo-rollback’(OSRPR). In addition, we have tested the recently developedsemi-rigorous synchronous sublattice (SL) algorithm. Finally,to reduce the number of events corresponding to boundaryevents between processors we have also developed a newmethod, which we refer to as ‘dynamic boundary allocation’(DBA). Below we discuss each of these methods and some ofthe details of their implementation.
4.1. Optimistic synchronous relaxation (OSR) algorithm
One of the first rigorous algorithms for parallel discrete-event simulations was the synchronous relaxation algorithmdeveloped by Lubachevsky [9]. We note that the applicationof this algorithm to KMC simulations as well as its scalingas a function of the number of processors Np has beenrecently studied by Shim and Amar [11]. However, since thisalgorithm is relatively complex and requires multiple iterationsfor each cycle, Merrick and Fichthorn have recently developeda similar but simpler algorithm which they refer to as optimisticsynchronous relaxation (OSR) [12].
Figure 6 shows a typical decomposition of a squaresystem into Np square regions, where Np is the number ofprocessors. Also indicated in figure 6 are the boundary and‘ghost’ regions for the central processor, where the boundaryregion is defined as that portion of the processor’s domain inwhich a change may affect neighboring processors. Similarly,the ghost region corresponds to that part of the neighboringprocessors’ domains which can affect a given processor. Thus,in general the width of the boundary and ghost regions must beat least equal to the range of interaction.
As shown in figure 7, in the OSR algorithm in eachcycle all processors start with the same initial time and then
Figure 7. Time evolution of events for OSR and OSRPR algorithmswith G = 4. Dashed lines correspond to selected events, while thedashed line with an X corresponds to an event exceeding tmin (see thetext). In OSR this event is discarded while in OSRPR this event isadded to the next cycle.
simultaneously and independently carry out KMC events intheir domains until either the number of KMC events reachesa pre-determined fixed number G, or one of the eventscorresponds to a ‘boundary event’, i.e. an event which modifiesthe boundary region of the given processor and which can thusaffect events in neighboring processors. Defining the time forthe last event in each processor as tlast, a global communicationis then carried out to determine the time tmin corresponding tothe minimum of tlast over all processors. Each processor then‘rolls back’ or undoes all KMC events which occur after tmin.If there are no boundary events then the processors all move onto the next cycle with the new starting time corresponding totmin. However, if tmin corresponds to a boundary event, then anadditional communication is needed to update the ghost and/orboundary regions of all processors affected by the boundaryevent.
We note that typically the OSR algorithm requires 2–3 global communications each cycle, one to determine tmin,another to determine if the event with tmin corresponded to aboundary event, and a third to update the boundary regionsof the affected processors if there was a boundary event.To reduce the number of global communications we haveencoded the processor identity as well as whether or not thelast event was a boundary event, along with the least advancedtime of each processor before doing a global communicationto determine tmin. This was done by replacing tlast with anumber whose most significant figures corresponded to tlast butwhose least significant figures contained information about theprocessor ID and whether or not that processor had a boundaryevent6. Thus, in our implementation of the OSR algorithm onlyone global communication was needed if tlast corresponded to anon-boundary event, while two communications were neededif it was a boundary event.
6 In this method, the time each processor advances from its previous cycleis multiplied by a very large number to form the integer part of the doubleprecision packed number. The ratio of the processor ID to the total numberof processors used Np is then added to the decimal part if there is a boundaryevent in that processor. If there is no boundary event in that processor a decimalnumber is added such that it does not correspond to any processor identity. Inour implementation the multiplying number was 1020, which leads to goodaccuracy.
5
Shim, Amar, PRB ’05, Merrick, Fichthorn, PRE ’07, etc
Synchronous algorithm: uniform time window for eachprocessor unless a boundary event occurs.
Resolve conflicts at boundary regions by communicating withneighboring processors and restart cycle.
Global communication overhead in each cycle.
Previous methods rely on exact simulation of the stochasticprocess.
Lattice Systems KMC Coarse-Graining Parallel KMC Conclusions
Synchronous algorithms: Sub-Lattice MethodJ. Phys.: Condens. Matter 21 (2009) 084214 G Nandipati et al
Figure 8. Comparison between parallel results using the OSRPRalgorithm with square decomposition (Np = 4) and serial results fora fractal model with D/F = 105 and G = 7.
We note that, in the OSR algorithm for a givenconfiguration, there is an optimal value of G which takes intoaccount the tradeoffs between communication time (which iswasted if there are no boundary events) and rollbacks. While,in general, an adaptive method could be used to attempt tooptimize the value of G from cycle to cycle, in practice we havefound it more efficient to simply use trial and error to find theoptimal fixed value of G for our simulation (see section 4.4).
4.2. Optimistic synchronous relaxation with pseudo-rollback(OSRPR) algorithm
In the OSR algorithm each processor discards all KMC eventswhich occur after tmin. However, this is unnecessary if thereare no boundary events in any of the processors. Therefore, wehave considered a variation of the OSR algorithm (optimisticsynchronous relaxation with pseudo-rollback) in which, whenthere are no boundary events in the system, those events thatwould have been discarded are added to the next cycle. Thiscan reduce the loss of computational time due to undoing andthen ‘redoing’ events and thus enhances the computationalefficiency. As a test of the OSRPR algorithm, we have carriedout parallel simulations using this algorithm for a ‘fractal’model of irreversible submonolayer growth in which onlymonomer deposition and diffusion processes are included [11],with Np = 4. As expected, there is excellent agreementbetween serial and parallel results for the island and monomerdensities (see figure 8).
4.3. Synchronous sublattice (SL) algorithm
In order to maximize the parallel efficiency we have alsocarried out simulations using the semi-rigorous synchronoussublattice (SL) algorithm recently developed by Shim andAmar [13]. To avoid conflicts between processors, in the SLalgorithm each processor domain is divided into subregionsor sublattices (see figure 9). A complete synchronous cycle
Figure 9. Schematic diagram of strip decomposition for Np = 2.Each processor domain is subdivided into A and B sublattices.Boundary and ghost regions for B sublattice of processor 1 are alsoshown.
Figure 10. Time evolution in the SL algorithm. Dashed linescorrespond to selected events, while the dashed line with an Xcorresponds to an event which is rejected since it exceeds the cycletime ! .
corresponding to a cycle time ! is then as follows. Atthe beginning of a cycle, each processor’s local time isinitialized to zero. One of the sublattices (A or B) isthen randomly selected so that all processors operate onthe same sublattice during the cycle. Each processor thensimultaneously and independently carries out KMC events inthe selected sublattice until the time of the next event exceedsthe time interval ! (see figure 10). The processors thencommunicate any necessary changes (boundary events) withtheir neighboring processors, update their event rates and moveon to the next cycle using a new randomly chosen sublattice.We note that, in order to ensure accuracy, the cycle time musttypically be less than or equal to the inverse of the fastestpossible single-event rate in the system [13].
Since it only requires local communication, the scalingbehavior of the SL algorithm is significantly better than for theOSR and OSRPR algorithms. As a result, it has been shown tobe relatively efficient in parallel KMC simulations of a varietyof models of growth [13, 17] and island coarsening [18]. Inaddition, while it is not exact, in simulations of a variety ofmodels [13, 17, 18] it was found that, unless the processor sizeis extremely small (smaller than a ‘diffusion length’) or thecycle time is too large, there is essentially perfect agreement
6
Shim, Amar, PRB ’07
Synchronous algorithm: fixed time window (cycle).
Random choice of sublattice and restart of cycle.
No global communication overhead in each cycle.
Relies on approximation of the stochastic process.
Lattice Systems KMC Coarse-Graining Parallel KMC Conclusions
Synchronous algorithms: Sub-Lattice MethodJ. Phys.: Condens. Matter 21 (2009) 084214 G Nandipati et al
Figure 8. Comparison between parallel results using the OSRPRalgorithm with square decomposition (Np = 4) and serial results fora fractal model with D/F = 105 and G = 7.
We note that, in the OSR algorithm for a givenconfiguration, there is an optimal value of G which takes intoaccount the tradeoffs between communication time (which iswasted if there are no boundary events) and rollbacks. While,in general, an adaptive method could be used to attempt tooptimize the value of G from cycle to cycle, in practice we havefound it more efficient to simply use trial and error to find theoptimal fixed value of G for our simulation (see section 4.4).
4.2. Optimistic synchronous relaxation with pseudo-rollback(OSRPR) algorithm
In the OSR algorithm each processor discards all KMC eventswhich occur after tmin. However, this is unnecessary if thereare no boundary events in any of the processors. Therefore, wehave considered a variation of the OSR algorithm (optimisticsynchronous relaxation with pseudo-rollback) in which, whenthere are no boundary events in the system, those events thatwould have been discarded are added to the next cycle. Thiscan reduce the loss of computational time due to undoing andthen ‘redoing’ events and thus enhances the computationalefficiency. As a test of the OSRPR algorithm, we have carriedout parallel simulations using this algorithm for a ‘fractal’model of irreversible submonolayer growth in which onlymonomer deposition and diffusion processes are included [11],with Np = 4. As expected, there is excellent agreementbetween serial and parallel results for the island and monomerdensities (see figure 8).
4.3. Synchronous sublattice (SL) algorithm
In order to maximize the parallel efficiency we have alsocarried out simulations using the semi-rigorous synchronoussublattice (SL) algorithm recently developed by Shim andAmar [13]. To avoid conflicts between processors, in the SLalgorithm each processor domain is divided into subregionsor sublattices (see figure 9). A complete synchronous cycle
Figure 9. Schematic diagram of strip decomposition for Np = 2.Each processor domain is subdivided into A and B sublattices.Boundary and ghost regions for B sublattice of processor 1 are alsoshown.
Figure 10. Time evolution in the SL algorithm. Dashed linescorrespond to selected events, while the dashed line with an Xcorresponds to an event which is rejected since it exceeds the cycletime ! .
corresponding to a cycle time ! is then as follows. Atthe beginning of a cycle, each processor’s local time isinitialized to zero. One of the sublattices (A or B) isthen randomly selected so that all processors operate onthe same sublattice during the cycle. Each processor thensimultaneously and independently carries out KMC events inthe selected sublattice until the time of the next event exceedsthe time interval ! (see figure 10). The processors thencommunicate any necessary changes (boundary events) withtheir neighboring processors, update their event rates and moveon to the next cycle using a new randomly chosen sublattice.We note that, in order to ensure accuracy, the cycle time musttypically be less than or equal to the inverse of the fastestpossible single-event rate in the system [13].
Since it only requires local communication, the scalingbehavior of the SL algorithm is significantly better than for theOSR and OSRPR algorithms. As a result, it has been shown tobe relatively efficient in parallel KMC simulations of a varietyof models of growth [13, 17] and island coarsening [18]. Inaddition, while it is not exact, in simulations of a variety ofmodels [13, 17, 18] it was found that, unless the processor sizeis extremely small (smaller than a ‘diffusion length’) or thecycle time is too large, there is essentially perfect agreement
6
Shim, Amar, PRB ’07
Synchronous algorithm: fixed time window (cycle).
Random choice of sublattice and restart of cycle.
No global communication overhead in each cycle.
Relies on approximation of the stochastic process.
Lattice Systems KMC Coarse-Graining Parallel KMC Conclusions
Hierarchical Parallel KMC algorithms
M.K., P. Plechac (U of TN and ORNL), G. Arabatzis (U of Crete)and L.
Xu (CS, UDel), preprint, (2010)
Markovian Dynamics: Adsorption/Desorption/Reaction/ Diffusion
Adsorption/Desorption/Reaction Generator: LN f (σ) =P
x∈Λ c(x, σ)[f (σx ) − f (σ)].
Multi-site updates σx for most systems, e.g.
Suchorski et al ChemPhysChem (2010)Complex behavior: bistability, oscillations, chaos, patterning, etc.
Lattice Systems KMC Coarse-Graining Parallel KMC Conclusions
Analogy to coarse-graining
Decompose particle system in parts communicating minimally;thus, local info is represented by suitable coarse variables, orcomputed on separate processors within a parallel architecture.
Example: A 1-D equilibrium calculationHs
N (σ) =P
k Hsk (σ) +
Pk Wk,k+1(σ)
Hsk : short-range Hamiltonian for the k-CG cell with free boundary conditions
Wk,k+1: short-range interactions between k- and k + 1-CG cells
e−βHsN PN(dσ|η) =
∏k: odd
[e−(Wk−1,k+Wk,k+1)e−Hs
k Pk(dσk |η(k))]×∏
k: even
e−Hk Pk(dσk |η(k))1D Operator Splitting
Algorithm(OpSpl)
1. Apply SSA on white cells in parallel until synchronization time 2. Apply SSA on black cells in parallel until synchronization time 3. Goto 1
Lattice Systems KMC Coarse-Graining Parallel KMC Conclusions
Analogy to coarse-graining
Simplified CG: When Wk,k+1 are disregarded, there are no CG cellcorrelations–but there are intra-cell–and the CG Hamiltonian is
H(s,0)m (η) =
∑k
U(s,0)k (ηk) = −
∑k
1
βlog
∫e−βHs
k (σ)Pk(dσk |η(k))
i.e. independent sampling over each coarse cell with free boundaryconditions
1D Operator Splitting Algorithm(OpSpl)
1. Apply SSA on white cells in parallel until synchronization time 2. Apply SSA on black cells in parallel until synchronization time 3. Goto 1
Parallelization: trivial, no communication between CG cells.
Lattice Systems KMC Coarse-Graining Parallel KMC Conclusions
Step 1: Generator decomposition
LN f (σ) =∑x∈Λ
c(x , σ)[f (σx)− f (σ)]∑k
∑x∈Ck
c(x , σ)[f (σx)− f (σ)]
∑k: odd
Lk f (σ)︷ ︸︸ ︷∑x∈Ck
c(x , σ)[f (σx)− f (σ)] +∑
k: even
∑x∈Ck
c(x , σ)[...]
:= LO f (σ) + LE f (σ)1D Operator Splitting
Algorithm(OpSpl)
1. Apply SSA on white cells in parallel until synchronization time 2. Apply SSA on black cells in parallel until synchronization time 3. Goto 1
2D OpSpl
1. Apply SSA on white cells in parallel until synchronization time 2. Apply SSA on black cells in parallel until synchronization time 3. Goto 1
Lattice Systems KMC Coarse-Graining Parallel KMC Conclusions
Step 1: Generator decomposition
LN f (σ) =∑x∈Λ
c(x , σ)[f (σx)− f (σ)]∑k
∑x∈Ck
c(x , σ)[f (σx)− f (σ)]
∑k: odd
Lk f (σ)︷ ︸︸ ︷∑x∈Ck
c(x , σ)[f (σx)− f (σ)] +∑
k: even
∑x∈Ck
c(x , σ)[...]
:= LO f (σ) + LE f (σ)1D Operator Splitting
Algorithm(OpSpl)
1. Apply SSA on white cells in parallel until synchronization time 2. Apply SSA on black cells in parallel until synchronization time 3. Goto 1
2D OpSpl
1. Apply SSA on white cells in parallel until synchronization time 2. Apply SSA on black cells in parallel until synchronization time 3. Goto 1
Lattice Systems KMC Coarse-Graining Parallel KMC Conclusions
Step 2: Trotter product & Fractional Step Approximation
Trotter product for semigroups (Proc. AMS (1958)):
limh→0
(eAh/2eBh/2
)[t/h]f = e(A+B)t f
Random Trotter product formula for jump processes:Kurtz, (Proc. AMS (1972))
Approximation of the Markov semigroup based on (random)Trotter Theorem (Lie or Strang splitting):
eLN∆t ≈ eLO∆t/2eLE ∆t/2
For short range interactions, the processes ∼ Lk areindependent and can be simulated on separate processors:
eLN∆t ≈ eLO∆teLE ∆t ≈∏
k: odd
k-th︷ ︸︸ ︷eLk∆t
∏k: even
processor︷ ︸︸ ︷eLk∆t
Lattice Systems KMC Coarse-Graining Parallel KMC Conclusions
Step 2: Trotter product & Fractional Step Approximation
Trotter product for semigroups (Proc. AMS (1958)):
limh→0
(eAh/2eBh/2
)[t/h]f = e(A+B)t f
Random Trotter product formula for jump processes:Kurtz, (Proc. AMS (1972))
Approximation of the Markov semigroup based on (random)Trotter Theorem (Lie or Strang splitting):
eLN∆t ≈ eLO∆t/2eLE ∆t/2
For short range interactions, the processes ∼ Lk areindependent and can be simulated on separate processors:
eLN∆t ≈ eLO∆teLE ∆t ≈∏
k: odd
k-th︷ ︸︸ ︷eLk∆t
∏k: even
processor︷ ︸︸ ︷eLk∆t
Lattice Systems KMC Coarse-Graining Parallel KMC Conclusions
Step 2: Trotter product & Fractional Step Approximation
Trotter product for semigroups (Proc. AMS (1958)):
limh→0
(eAh/2eBh/2
)[t/h]f = e(A+B)t f
Random Trotter product formula for jump processes:Kurtz, (Proc. AMS (1972))
Approximation of the Markov semigroup based on (random)Trotter Theorem (Lie or Strang splitting):
eLN∆t ≈ eLO∆t/2eLE ∆t/2
For short range interactions, the processes ∼ Lk areindependent and can be simulated on separate processors:
eLN∆t ≈ eLO∆teLE ∆t ≈∏
k: odd
k-th︷ ︸︸ ︷eLk∆t
∏k: even
processor︷ ︸︸ ︷eLk∆t
Lattice Systems KMC Coarse-Graining Parallel KMC Conclusions
Step 2: Trotter product & Fractional Step Approximation
Trotter product for semigroups (Proc. AMS (1958)):
limh→0
(eAh/2eBh/2
)[t/h]f = e(A+B)t f
Random Trotter product formula for jump processes:Kurtz, (Proc. AMS (1972))
Approximation of the Markov semigroup based on (random)Trotter Theorem (Lie or Strang splitting):
eLN∆t ≈ eLO∆t/2eLE ∆t/2
For short range interactions, the processes ∼ Lk areindependent and can be simulated on separate processors:
eLN∆t ≈ eLO∆teLE ∆t ≈∏
k: odd
k-th︷ ︸︸ ︷eLk∆t
∏k: even
processor︷ ︸︸ ︷eLk∆t
Lattice Systems KMC Coarse-Graining Parallel KMC Conclusions
Benchmarks and Error Analysis
4
Algorithm and ResultsApplication Performance
!"#"$%&'()*+&,-($./!#0012$3"$3+456&5$./7%289:;12$<"="$>+&56)($./<?;1
!"#$%&"#'(") #**)+ ",- ").*/'#$&+
o Kinetic Monte Carlo methods amenable to parallelization on
GPU clusters
o Benchmark model defined and accuracy tested
o Simulation of real chemical processes (oxidation)
o Distributed (MPI) version implemented
o 1000x speed-up compared to standard implementation
o Controlled approximation of the original Markov jump
process
3@(A$!"#"$%&'()*+&,-( ./!#0012$3"$3+456&5./7%289:;12$ <"="$>+&56)($./<?;13)('B)5A$?"$%&++-C-&DD&,- ./7%EF@G01
Application Performance
0-H*+&'-)D$)I$)J-B&'-)D$KL)54(($)D$'64$M<$+&''-54"Domain decomposition depicted together with the workload on GPU cells (bottom figure).
<ND&H-5$+)&B$O&+&D5-DCA$4J&HK+4$)I$&D$&+C)L-'6H$-D$P<"
0*,#-1 2%3 4"/"))%) ").*/'#$&+ 5*/ 6',%#'( !*,#%0"/)* +'&7)"#'*,+
Figure: Phase diagram of critical 2D Ising model used as a benchmark for accuracy (Onsager solution).
Figure: No load balancing
Figure: With load balancing
Phase diagram of critical 2D Ising model used as a benchmark for accuracy (Onsager solution).
Flexibility in choosing suitable decompositions
Controlled error approximations for observables with similar toolboxas in CG
K., Plechac, Sopasakis, SIAM Num. Anal. ’06
Lattice Systems KMC Coarse-Graining Parallel KMC Conclusions
Algorithm Performance
105 106
101
102
103
104
105
106
107
GPU and Sequential code execution time
N (lattice size = N2)
exec
utio
n tim
e in
sec
seq
dt=0.01 (fermi)
dt=0.1
dt=1
dt=10
dt=0.01 (tesla)
dt=0.1
dt=1
dt=10
GPU simulation with various architectures (e.g. Fermi) vs. DNS
Lattice Systems KMC Coarse-Graining Parallel KMC Conclusions
Dynamic load balancing
4
Algorithm and ResultsApplication Performance
!"#"$%&'()*+&,-($./!#0012$3"$3+456&5$./7%289:;12$<"="$>+&56)($./<?;1
!"#$%&"#'(") #**)+ ",- ").*/'#$&+
o Kinetic Monte Carlo methods amenable to parallelization on
GPU clusters
o Benchmark model defined and accuracy tested
o Simulation of real chemical processes (oxidation)
o Distributed (MPI) version implemented
o 1000x speed-up compared to standard implementation
o Controlled approximation of the original Markov jump
process
3@(A$!"#"$%&'()*+&,-( ./!#0012$3"$3+456&5./7%289:;12$ <"="$>+&56)($./<?;13)('B)5A$?"$%&++-C-&DD&,- ./7%EF@G01
Application Performance
0-H*+&'-)D$)I$)J-B&'-)D$KL)54(($)D$'64$M<$+&''-54"Domain decomposition depicted together with the workload on GPU cells (bottom figure).
<ND&H-5$+)&B$O&+&D5-DCA$4J&HK+4$)I$&D$&+C)L-'6H$-D$P<"
0*,#-1 2%3 4"/"))%) ").*/'#$&+ 5*/ 6',%#'( !*,#%0"/)* +'&7)"#'*,+
Figure: Phase diagram of critical 2D Ising model used as a benchmark for accuracy (Onsager solution).
Figure: No load balancing
Figure: With load balancing
Load balancing controlled by number of
jumps executed on each sub-domain
Mass Transport to a uniform histogram
Fractional Step approximation allows for tuning the balancing
4
Algorithm and ResultsApplication Performance
!"#"$%&'()*+&,-($./!#0012$3"$3+456&5$./7%289:;12$<"="$>+&56)($./<?;1
!"#$%&"#'(") #**)+ ",- ").*/'#$&+
o Kinetic Monte Carlo methods amenable to parallelization on
GPU clusters
o Benchmark model defined and accuracy tested
o Simulation of real chemical processes (oxidation)
o Distributed (MPI) version implemented
o 1000x speed-up compared to standard implementation
o Controlled approximation of the original Markov jump
process
3@(A$!"#"$%&'()*+&,-( ./!#0012$3"$3+456&5./7%289:;12$ <"="$>+&56)($./<?;13)('B)5A$?"$%&++-C-&DD&,- ./7%EF@G01
Application Performance
0-H*+&'-)D$)I$)J-B&'-)D$KL)54(($)D$'64$M<$+&''-54"Domain decomposition depicted together with the workload on GPU cells (bottom figure).
<ND&H-5$+)&B$O&+&D5-DCA$4J&HK+4$)I$&D$&+C)L-'6H$-D$P<"
0*,#-1 2%3 4"/"))%) ").*/'#$&+ 5*/ 6',%#'( !*,#%0"/)* +'&7)"#'*,+
Figure: Phase diagram of critical 2D Ising model used as a benchmark for accuracy (Onsager solution).
Figure: No load balancing
Figure: With load balancing
Lattice Systems KMC Coarse-Graining Parallel KMC Conclusions
Hierarchical Structure and multiple GPUs2D OpSpl on multi-GPUs
Hierarchical methods are well suited for current architectures whichhave sophisticated memory hierarchies, e.g. GPUs.
Hierarchical lattice partitioning on GPU cluster: macro-, meso-,micro-cells
MPI/OpenMP communication between GPUs
Lattice Systems KMC Coarse-Graining Parallel KMC Conclusions
Hierarchical Structure and multiple GPUs2D OpSpl on multi-GPUs
Hierarchical methods are well suited for current architectures whichhave sophisticated memory hierarchies, e.g. GPUs.
Hierarchical lattice partitioning on GPU cluster:
macro-, meso-,micro-cells
MPI/OpenMP communication between GPUs
Lattice Systems KMC Coarse-Graining Parallel KMC Conclusions
Hierarchical Structure and multiple GPUs2D OpSpl on multi-GPUs
Hierarchical methods are well suited for current architectures whichhave sophisticated memory hierarchies, e.g. GPUs.
Hierarchical lattice partitioning on GPU cluster: macro-, meso-,micro-cells
MPI/OpenMP communication between GPUs
Lattice Systems KMC Coarse-Graining Parallel KMC Conclusions
Hierarchical Structure and multiple GPUs2D OpSpl on multi-GPUs
Hierarchical methods are well suited for current architectures whichhave sophisticated memory hierarchies, e.g. GPUs.
Hierarchical lattice partitioning on GPU cluster: macro-, meso-,micro-cells
MPI/OpenMP communication between GPUs
Lattice Systems KMC Coarse-Graining Parallel KMC Conclusions
Hierarchical Structure-Algorithm Performance
104
105
106
107
108
100
101
102
103
104
105
106
107
108
N (lattice size = N2)
exec
utio
n tim
e in
sec
GPU and Sequential code execution time
seq−1
seq−2
dt=0.01 mpi−64
dt=0.1
dt=1
dt=10
dt=0.01 (fermi)
dt=0.1
dt=1
dt=10
2D unimolecular reaction/diffuson particle system
Up to 105x speed-up compared to standard implementation
With 64 GPUs can simulate with relative ease 108 particles ( approx. 1µm2).
Lattice Systems KMC Coarse-Graining Parallel KMC Conclusions
Hierarchical Structure & Multiple Scales
1. Micromechanisms with (very) different time scales, e.g. fastdiffusion in CO oxidation on Pt [Suchorski et al Phys. Rev. Lett. ’99]
ε−1Ldiff + Lreaction , ε << 1
Combine the hierarchical structure with established uses ofTrotter products for molecular systems with fast/slowmechanisms.Molecular Dynamics: Tuckerman et al J. Chem. Phys. ’92.
2. Optimizing the algorithms: Computation vs. CommunicationK., Plechac, Xu, Taufer (CS, UDel)
Central cells
!"#$%&'"()*+,"'-*.-%%/*0(*',-*10$(2&34
Lattice Systems KMC Coarse-Graining Parallel KMC Conclusions
Hierarchical Structure & Multiple Scales
1. Micromechanisms with (very) different time scales, e.g. fastdiffusion in CO oxidation on Pt [Suchorski et al Phys. Rev. Lett. ’99]
ε−1Ldiff + Lreaction , ε << 1
Combine the hierarchical structure with established uses ofTrotter products for molecular systems with fast/slowmechanisms.Molecular Dynamics: Tuckerman et al J. Chem. Phys. ’92.
2. Optimizing the algorithms: Computation vs. CommunicationK., Plechac, Xu, Taufer (CS, UDel)
Central cells
!"#$%&'"()*+,"'-*.-%%/*0(*',-*10$(2&34
Lattice Systems KMC Coarse-Graining Parallel KMC Conclusions
Hierarchical Structure & Multiple Scales
1. Micromechanisms with (very) different time scales, e.g. fastdiffusion in CO oxidation on Pt [Suchorski et al Phys. Rev. Lett. ’99]
ε−1Ldiff + Lreaction , ε << 1
Combine the hierarchical structure with established uses ofTrotter products for molecular systems with fast/slowmechanisms.Molecular Dynamics: Tuckerman et al J. Chem. Phys. ’92.
2. Optimizing the algorithms: Computation vs. CommunicationK., Plechac, Xu, Taufer (CS, UDel)
Central cells
!"#$%&'"()*+,"'-*.-%%/*0(*',-*10$(2&34
Lattice Systems KMC Coarse-Graining Parallel KMC Conclusions
Conclusions - Further Work
Kinetic Monte Carlo methods amenable to parallelization onGPU clusters
Benchmark model defined and accuracy tested
Distributed (MPI) version implemented
Controlled approximation of the original Markov jump process
Capability to simulate realistic chemical processes at largespatiotemporal scales
Lattice Systems KMC Coarse-Graining Parallel KMC Conclusions
Conclusions - Further Work
Kinetic Monte Carlo methods amenable to parallelization onGPU clusters
Benchmark model defined and accuracy tested
Distributed (MPI) version implemented
Controlled approximation of the original Markov jump process
Capability to simulate realistic chemical processes at largespatiotemporal scales
Lattice Systems KMC Coarse-Graining Parallel KMC Conclusions
Conclusions - Further Work
Kinetic Monte Carlo methods amenable to parallelization onGPU clusters
Benchmark model defined and accuracy tested
Distributed (MPI) version implemented
Controlled approximation of the original Markov jump process
Capability to simulate realistic chemical processes at largespatiotemporal scales
Lattice Systems KMC Coarse-Graining Parallel KMC Conclusions
Conclusions - Further Work
Kinetic Monte Carlo methods amenable to parallelization onGPU clusters
Benchmark model defined and accuracy tested
Distributed (MPI) version implemented
Controlled approximation of the original Markov jump process
Capability to simulate realistic chemical processes at largespatiotemporal scales
Lattice Systems KMC Coarse-Graining Parallel KMC Conclusions
Conclusions - Further Work
However challenges remain....
Systems in surface process with short and long rangeinteractions, patterning, etc.
Optimizing the algorithms: Computation vs. Communication
Central cells
!"#$%&'"()*+,"'-*.-%%/*0(*',-*10$(2&34
Lattice Systems KMC Coarse-Graining Parallel KMC Conclusions
Conclusions - Further Work
However challenges remain....
Systems in surface process with short and long rangeinteractions, patterning, etc.
Optimizing the algorithms: Computation vs. Communication
Central cells
!"#$%&'"()*+,"'-*.-%%/*0(*',-*10$(2&34
Recommended