Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Opt
imiz
atio
n of
Com
mun
icat
ions
to
war
ds S
cala
ble
Alg
orith
ms
on
Post
Pet
asca
leSu
perc
ompu
ters
Ken
go N
akaj
ima
Info
rmat
ion
Tech
nolo
gy C
ente
r, Th
e U
nive
rsity
of T
okyo
Scal
A15
: Wor
ksho
p on
Lat
est A
dvan
ces
in S
cala
ble
Alg
orith
ms
for
Larg
e-Sc
ale
Syst
ems
in c
onju
nctio
n w
ith S
C15
Nov
embe
r 16,
201
5, A
ustin
, Tex
as
•pp
Ope
n-H
PC•
ppO
pen-
MAT
H–
ppO
pen-
MAT
H/M
G: M
ultig
rid S
olve
r–
Targ
et P
robl
ems,
Com
pute
r Sys
tem
s–
Opt
imiz
atio
n of
Ser
ial C
omm
unic
atio
n–
Opt
imiz
atio
n of
Par
alle
l Com
m. (
I): C
GA
–O
ptim
izat
ion
of P
aral
lel C
omm
. (II)
: hC
GA
•Su
mm
ary
2
Syst
em S
oftw
are
in P
ost K
Su
perc
ompu
ter
Yuta
ka Is
hika
wa
(RIK
EN)
Tues
day
10:3
0-11
:15
Post‐Peta CR
EST
Developm
ent o
f System Softw
are Techno
logies fo
r Post‐P
eta
Scale High
Perform
ance Com
putin
g•
Objectives
–Co
‐design of sy
stem
softw
are with
app
lications and
post‐p
etascale
compu
ter a
rchitectures
–De
velopm
ent o
f deliverable so
ftware pieces
•Re
search Sup
ervisor
–Prof. M
itsuh
isa Sato (RIKEN
AICS)
•Ru
n by JST (Ja
pan Science and Techno
logy Agency)
•Bu
dget and
Formation (201
0 to 201
8)–
55M‐60M
$ in
total
–Ro
und 1: From 201
0 for 5
.5 ye
ar (5
Team
s)–
Roun
d 2: From 201
1 for 5
.5 ye
ar (5
Team
s)–
Roun
d 3: From 201
2 for 5
.5 ye
ar (4
Team
s)
3
http
://w
ww.
post
peta
.jst.g
o.jp
/en/
Syst
em S
oftw
are
4
Tais
uke
Boku
, U. o
f Tsu
kuba
Res
earc
h an
d D
evel
opm
ent o
n U
nifie
d En
viro
nmen
t of A
ccel
erat
ed C
ompu
ting
and
Inte
rcon
nect
ion
for P
ost-P
etas
cale
Era
Atsu
shi H
ori,
RIK
EN A
ICS
Para
llel S
yste
m S
oftw
are
for M
ulti-
core
and
Man
y-co
re
Tosh
ioEn
do, T
okyo
Tec
h.So
ftwar
e Te
chno
logy
that
Dea
ls
with
Dee
per M
emor
y H
iera
rchy
in
Post
-pet
asca
leEr
a
Take
shi N
anri,
Kyu
shu
Uni
vers
ityD
evel
opm
ent o
f Sca
labl
e C
omm
unic
atio
n Li
brar
y w
ith T
echn
olog
ies
for M
emor
y Sa
ving
an
d R
untim
e O
ptim
izat
ion
Osa
mu
Tate
be, U
. of T
suku
baSy
stem
Sof
twar
e fo
r Pos
t Pet
asca
leD
ata
Inte
nsiv
e Sc
ienc
e
Mas
aaki
Kond
o, U
. of T
okyo
Pow
er M
anag
emen
t Fra
mew
ork
for
Post
-Pet
asca
leSu
perc
ompu
ters
c/o
Y. Is
hika
wa
(RIK
EN)
2013
2014
2015
2016
2017
Roun
d 1:
5 te
ams
run
Roun
d 3
: 4 te
ams
run
Roun
d 2:
5 te
ams
run
Prog
ram
min
g M
odel
s &
Lan
guag
es
5
Nao
yaM
aruy
ama,
Rik
en A
ICS
Hig
hly
Prod
uctiv
e, H
igh
Perfo
rman
ce A
pplic
atio
n Fr
amew
orks
for P
ost P
etas
cale
Com
putin
g
Hiro
yuki
Tak
izaw
a, T
ohok
u U
nive
rsity
An e
volu
tiona
ry a
ppro
ach
to c
onst
ruct
ion
of a
so
ftwar
e de
velo
pmen
t env
ironm
ent f
or m
assi
vely
-pa
ralle
l het
erog
eneo
us s
yste
ms
Shig
eru
Chi
ba, U
. Tok
yoSo
ftwar
e de
velo
pmen
t for
pos
t pet
asca
lesu
per
com
putin
g ---
Mod
ular
ity fo
r Sup
er C
ompu
ting
c/o
Y. Is
hika
wa
(RIK
EN)
2013
2014
2015
2016
2017
Roun
d 1:
5 te
ams
run
Roun
d 3
: 4 te
ams
run
Roun
d 2:
5 te
ams
run
App
licat
ions
& N
umer
ical
Lib
rarie
s
6
2013
2014
2015
2016
2017
Roun
d 1:
5 te
ams
run
Roun
d 3
: 4 te
ams
run
Roun
d 2:
5 te
ams
run
Keng
oN
akaj
ima,
Uni
vers
ity o
f Tok
yopp
Ope
n-H
PC: O
pen
Sour
ce In
frast
ruct
ure
for
Dev
elop
men
t and
Exe
cutio
n of
Lar
ge-S
cale
Sc
ient
ific
Appl
icat
ions
with
Aut
omat
ic T
unin
g (A
T)
Tets
uya
Saku
rai,
Uni
vers
ity o
f Tsu
kuba
Dev
elop
men
t of a
n Ei
gen-
Supe
rcom
putin
g En
gine
us
ing
a Po
st-P
etas
cale
Hie
rarc
hica
l Mod
el
Ryu
ji Sh
ioya
, Toy
o U
nive
rsity
Dev
elop
men
t of a
Num
eric
al L
ibra
ry b
ased
on
Hie
rarc
hica
l Dom
ain
Dec
ompo
sitio
n fo
r Pos
t Pe
tasc
ale
Sim
ulat
ion
Kats
ukiF
ujis
awa,
Kyu
shu
Uni
vers
ityAd
vanc
ed C
ompu
ting
and
Opt
imiz
atio
n In
frast
ruct
ure
for E
xtre
mel
y La
rge-
Scal
e G
raph
s on
Pos
t Pet
a-Sc
ale
Supe
rcom
pute
rs
c/o
Y. Is
hika
wa
(RIK
EN)
Itsuk
iNod
a, A
IST
Fram
ewor
k fo
r Adm
inis
tratio
n of
Soc
ial
Sim
ulat
ions
on
Mas
sive
ly P
aral
lel C
ompu
ters
ppO
pen-
HPC
: Ove
rvie
w•
Appl
icat
ion
fram
ewor
k w
ith a
utom
atic
tuni
ng (A
T)
•“p
p” :
post
-pet
a-sc
ale
•Fi
ve-y
ear p
roje
ct (F
Y.20
11-2
015)
(sin
ce A
pril
2011
) •
P.I.:
Ken
goN
akaj
ima
(ITC
, The
Uni
vers
ity o
f Tok
yo)
•Pa
rt of
“Dev
elop
men
t of S
yste
m S
oftw
are
Tech
nolo
gies
for
Post
-Pet
a Sc
ale
Hig
h Pe
rform
ance
Com
putin
g” fu
nded
by
JST/
CR
EST
(Sup
ervi
sor:
Prof
. Mits
uhis
a Sa
to, C
o-D
irect
or,
RIK
EN A
ICS)
7
•Te
am w
ith 7
inst
itute
s, >
50 p
eopl
e (5
PD
s) fr
om v
ario
us fi
elds
: Co-
Des
ign
•IT
C/U
.Tok
yo, A
OR
I/U.T
okyo
, ER
I/U.T
okyo
, FS/
U.T
okyo
•H
okka
ido
U.,
Kyot
o U
., JA
MST
EC
•G
roup
Lea
ders
–M
asak
i Sat
oh (A
OR
I/U.T
okyo
)–
Taka
shi F
urum
ura
(ER
I/U.T
okyo
)–
Hiro
shi O
kuda
(GSF
S/U
.Tok
yo)
–Ta
kesh
i Iw
ashi
ta (K
yoto
U.,
ITC
/Hok
kaid
o U
.)–
Hid
e Sa
kagu
chi(
IFR
EE/J
AMST
EC)
•M
ain
Mem
bers
–
Taka
hiro
Kat
agiri
(ITC
/U.T
okyo
)–
Mas
ahar
uM
atsu
mot
o (IT
C/U
.Tok
yo)
–H
idey
uki J
itsum
oto
(Tok
yo T
ech)
–Sa
tosh
i Ohs
him
a (IT
C/U
.Tok
yo)
–H
iroya
su H
asum
i(AO
RI/U
.Tok
yo)
–Ta
kash
i Ara
kaw
a (R
IST)
–Fu
tosh
iMor
i (ER
I/U.T
okyo
)–
Take
shi K
itaya
ma
(GSF
S/U
.Tok
yo)
–Ak
ihiro
Ida
(AC
CM
S/Ky
oto
U.)
–M
iki Y
amam
oto
(IFR
EE/J
AMST
EC)
–D
aisu
ke N
ishi
ura
(IFR
EE/J
AMST
EC)
8
9
Fram
ewor
kA
ppl.
Dev
.
Mat
hLi
brar
ies
Aut
omat
icTu
ning
(AT)
Syst
emSo
ftwar
e
ppO
pen-
HPC
cov
ers
…1010
Supe
rcom
pute
rs in
U.T
okyo
2 bi
g sy
stem
s, 6
yr.
cycl
e
11
FY 0506
0708
0910
1112
1314
1516
1718
19
Hita
chi S
R11
000/
J218
.8TF
LOPS
, 16.
4TB
Fat n
odes with
large mem
ory
(Flat) MPI, goo
d comm. perform
ance
京(=K)
Peta
Turning po
int to Hy
brid Parallel Prog. M
odel
Fujit
su P
RIM
EHPC
FX1
0ba
sed
on S
PAR
C64
IXfx
1.13
PFL
OPS
, 150
TB
Hita
chi S
R16
000/
M1
base
d on
IBM
Pow
er-7
54.9
TFL
OPS
, 11.
2 TB
Our last SMP, to be sw
itche
d to M
PP
Hita
chi H
A80
00 (T
2K)
140T
FLO
PS, 3
1.3T
B
11
Post
T2K
25+
PFLO
PS
Initi
al P
lan
Targ
et o
f ppO
pen-
HPC
: Po
st T
2K S
yste
m•
Targ
et s
yste
m is
Pos
t T2K
sys
tem
−25
+ PF
LOPS
, FY.
2016
9JC
AHPC
(Joi
nt C
ente
r for
Adv
ance
d H
igh
Perfo
rman
ce
Com
putin
g): U
. Tsu
kuba
& U
. Tok
yo9
http
://jc
ahpc
.jp/
−M
any-
core
bas
ed (e
.g. I
ntel
MIC
/Xeo
n Ph
i)9
MPI
+ O
penM
P+
X−
ppO
pen-
HPC
hel
ps s
moo
th tr
ansi
tion
of u
sers
(> 2
,000
) to
new
sys
tem
yK/
FX10
, Cra
y, X
eon
clus
ters
are
als
o in
sco
pe
12
Sche
dule
of P
ublic
Rel
ease
(w
ith E
nglis
h D
ocum
ents
, MIT
Lic
ense
)ht
tp://
ppop
enhp
c.cc
.u-to
kyo.
ac.jp
/•
Rel
ease
d at
SC
-XY
(or c
an b
e do
wnl
oade
d)•
Mul
ticor
e/m
anyc
ore
clus
ter v
ersi
on (F
lat M
PI,
Ope
nMP/
MPI
Hyb
rid) w
ith d
ocum
ents
in E
nglis
h•
We
are
now
focu
sing
on
MIC
/Xeo
n Ph
i•
Col
labo
ratio
ns a
re w
elco
me
•H
isto
ry–
SC12
, Nov
201
2 (V
er.0
.1.0
)–
SC13
, Nov
201
3 (V
er.0
.2.0
)–
SC14
, Nov
201
4 (V
er.0
.3.0
)–
SC15
, Nov
201
5 (V
er.1
.0.0
)
13
New
Fea
ture
s in
Ver
.1.0
.0ht
tp://
ppop
enhp
c.cc
.u-to
kyo.
ac.jp
/•
HA
CA
pKlib
rary
for H
-mat
rix c
omp.
in p
pOpe
n-A
PPL/
BEM
(Ope
nMP/
MPI
Hyb
rid V
ersi
on)
–Fi
rst O
pen
Sour
ce L
ibra
ry b
y O
penM
P/M
PI H
ybrid
•pp
Ope
n-M
ATH
/MP
(Cou
pler
for M
ultip
hysi
cs
Sim
ulat
ions
, Loo
se C
oupl
ing
of F
EM &
FD
M)
•M
atrix
Ass
embl
y an
d Li
near
Sol
vers
for p
pOpe
n-AP
PL/F
VM
14
■Fo
r lar
ge-s
cale
d si
mul
atio
ns►
Appr
oxim
atio
n te
chni
que
for m
atric
es・H
-mat
rices
with
AC
A(A
dapt
ive
Cro
ss A
ppro
xim
atio
n):
⇒►
Para
llel c
ompu
ting
・H
ybrid
MPI
+Ope
nMP
prog
ram
min
g m
odel
■D
ownl
oad
site
: ht
tp://
ppop
enhp
c.cc
.u-to
kyo.
ac.jp
15
AC
ApK
libra
ry
sing
ular
ker
nel:
whe
re
,∈span
,0
e.
g.,
∝
■Li
brar
y fo
r sim
ulat
ions
usi
ng th
e in
tegr
al e
quat
ion
met
hod
・O
pen
sour
ce・M
IT li
cens
e
[A. I
da &
T. I
was
hita
]
sing
ular
ker
nel:
,d
,∈span
,0
Full‐Rank
Low‐Rank
16
20,0
00
Fullrankd
ensematrixPe
rmut
atio
nPa
rtitio
n
Dis
cret
izat
ion
ACA
H-m
atric
es w
ith A
CA
Ove
rvie
w o
f H-m
atric
es w
ith A
CA
█Ap
prox
imat
ion
tech
niqu
e fo
r mat
rices
from
Inte
gral
ope
rato
r.
・Lo
w-ra
nk m
atrix
can
be
appr
oxim
ated
by s
ome
pivo
t col
umns
and
row
s.
AC
A: A
dapt
ive
Cro
ss A
ppro
xim
atio
n
[A. I
da &
T. I
was
hita
]
・Pi
vot c
olum
n an
d pi
vot r
ow a
re
alte
rnat
ely
sele
cted
vec
tor b
y ve
ctor
.
:||
||⋅||
||
∑||
||⋅||
||
・Ap
prox
imat
ion
erro
r est
imat
ion:
・H
euris
tic:
■M
emor
y us
age
and
appr
oxim
atio
n ac
cura
cy a
re c
ontro
llabl
e by
the
num
ber o
f the
sel
ecte
d ve
ctor
s.
Appl
ied
to b
lock
s de
tect
ed a
s po
ssib
le lo
w-ra
nk s
ubm
atric
es
ACA
17
Low
-ran
k ap
prox
imat
ion
usin
g A
CA
: arb
itrar
y co
lum
n (e
.g. l
eftm
ost)
:
th-ro
w,
≔argm
ax|
|
[A. I
da &
T. I
was
hita
]
■Ea
rthqu
ake
Cyc
le S
imul
atio
n
1(
)2
N
iij
jpl
ii
sGKu
Vt
VV[
W
�
�¦
**
*ln
(/
)ln
(/
)eff
in
ii
ii
iA
VV
BV
LW
PVW
T
�
�
exp(
/)
/ln
(/
)i
ic
ii
iii
id
VV
VL
VL
dtTT
T
��
➢eq
. mot
ion
➢fri
ctio
n la
w
Inte
gral
ope
rato
r with
,
∝
Subd
ivid
e fa
ult s
urfa
ce18
Exam
ple
anal
ysis
usi
ng H
ACAp
K[A
. Ida
& T
. Iw
ashi
ta]
19
Ana
lysi
s re
sult
Exam
ple
anal
ysis
usi
ng H
ACAp
K
Gro
und
Ana
lysi
s co
nditi
on
■St
atic
ele
ctric
fiel
d an
alys
is・Po
tent
ial o
pera
tor:
・Su
rface
cha
rge
is c
alcu
late
d in
hal
f-inf
inite
dom
ain.
[A. I
da &
T. I
was
hita
]
20
■St
atic
ele
ctric
fiel
d an
alys
is・Po
tent
ial o
pera
tor:
Num
eric
al re
sult
Exam
ple
anal
ysis
usi
ng H
ACAp
K
Ana
lysi
s co
nditi
on
0.5m
1V
Air
Conductor
Ground
0.25m
・Su
rface
cha
rge
is c
alcu
late
d in
hal
f-inf
inite
dom
ain.
[A. I
da &
T. I
was
hita
]
Mem
ory
usag
eof
HA
CA
pKan
d or
igin
al d
ense
mat
rices
■H
-mat
rices
with
AC
A re
duce
mem
ory
usag
e.
21
Mem
ory
usa
ge (
Log-
Log
scal
e)
104
105
106
107
108
0.1110100
1000
Memory[GB]
Num
ber o
f unk
nown
s
Den
se m
atric
es
ACAp
K(S
tatic
ele
ctric
fiel
d)
ACAp
K(S
tatic
ele
ctric
fiel
d)
ACAp
K(E
arth
quak
e cy
cle)
[A. I
da &
T. I
was
hita
]
■Pa
ralle
lizat
ion
to e
xplo
it SM
P cl
uste
r sys
tem
■Im
prov
emen
t for
larg
e-si
zed
prob
lem
・C
onve
ntio
nal H
-mat
rices
can
fail
to m
ake
effic
ient
app
roxi
mat
ion
whe
n ap
plie
d to
larg
e sc
ale
prob
lem
.
22To
app
ly H
-mat
rices
for h
uge-
size
d pr
oble
ms
020
4060
8010
0-8-7-6-5-4 log(||r||/||b||)
21,600
元 100,000
元 338,000
元 1,000,000
元
Numb
erofite
rations
N128,0
00N
288,0
00 22
Our
effo
rts in
clud
e:
■N
ew a
lgor
ithm
of l
inea
r sol
ver
・Bi
CG
STAB
and
GC
R a
re a
vaila
ble.
・Is
any
pre
cond
ition
er n
eede
d?
[A. I
da &
T. I
was
hita
]
Para
lleliz
atio
n o
f H
-mat
rices
in
ACAp
K
step
1 M
ake
Clu
ster
tree
st
ep2
Mak
e H
-mat
rix s
truct
ure
step
3Fi
ll in
sub-
mat
rices
(AC
A)
Red
unda
nt c
ompu
tatio
non
all
MPI
pro
cess
ors
para
llel c
ompu
ting
■W
hen
cons
truct
ing
H-m
atric
es・
Onl
y st
ep 3
(tim
e-co
nsum
ing
part)
is p
aral
leliz
ed.
・An
y M
PI c
omm
unic
atio
n is
NO
T ne
eded
.
23
■W
hen
perfo
rmin
g H
MVM
(H-m
atrix
-vec
tor m
ultip
licat
ion)
・Al
l MPI
pro
cess
es h
ave
the
full
mul
tiplic
and
vect
or.
・M
PI c
omm
unic
atio
ns a
re n
eede
d to
real
ize
it.
■In
bot
h pa
ralle
lizat
ion
abov
e・Sa
me
assi
gnm
ent a
re u
sed.
・Ar
ithm
etic
are
con
duct
ed s
ub-m
atrix
by
sub-
mat
rix.
・As
sign
men
t to
each
pro
cess
is a
col
lect
ion
of s
ub-m
atric
es.
[A. I
da &
T. I
was
hita
]
■Fo
r MPI
-pro
cess
es
24
①M
inim
ize
the
max
imum
as
pos
sibl
e⇒
Red
ucin
g tra
nsfe
rred
data
siz
e②
Min
imiz
e th
e lo
ad im
bala
nce
amon
g M
PI p
roce
sses
■Fo
r Ope
nMP-
thre
ads
・M
inim
ize
the
load
imba
lanc
e am
ong
Ope
nMP-
thre
ads
Inte
ntio
n fo
r ass
ignm
ent s
trat
egy
[A. I
da &
T. I
was
hita
]
Diff
eren
ce in
ass
ignm
ent b
etw
een
stra
tegi
es
Assi
gned
sub
mat
rices
to
MPI
-pro
cess
es in
HAC
ApK
Assi
gned
sub
mat
rices
op
timiz
ed fo
r loa
d ba
lanc
e
25[A
. Ida
& T
. Iw
ashi
ta]
Com
pute
r: Fu
jitsu
FX1
0 at
the
univ
ersi
ty o
f Tok
yo
Proc
esso
r : S
PAR
C64
TMIx
fx(1
6cor
es/n
ode)
Mem
ory
: 3
2GB
Net
wor
k
: 5 G
B/s,
Tof
u.
The
num
ber o
f unk
now
ns
case1:N1,0
00case2:N10,0
00case3:N100,0
00
Perf
orm
ance
test
of
AC
ApK
26
Para
llel s
cala
bilit
y is
exa
min
ed
・w
hen
cons
truct
ing
H-m
atric
es・
whe
n pe
rform
ing
HM
VM
■Te
st m
odel
[A. I
da &
T. I
was
hita
]
Para
llel S
cala
bilit
y of
A
CA
pK(F
lat-M
PI)
27
■Th
e la
rger
the
data
siz
e be
com
es,
the
bette
r par
alle
l sca
labi
lity
AC
ApK
atta
ins
in b
oth
case
s.■
Bette
r par
alle
l sca
labi
lity
is s
how
n w
hen
cons
truct
ing
H-m
atric
es.
■Pa
ralle
l spe
ed-u
p in
a H
MVM
stro
ngly
dep
ends
on
the
data
siz
e.
H-m
atrix
vecto
r m
ultip
licat
ion
020
4060
0204060
Numb
er of
Pro
cesso
rs
Speed-up
100
,000u
nkno
wn
10,00
0unk
nown
1
,000u
nkno
wn
020
4060
0204060
Num
ber o
f Pro
cess
ors
Speed-up
100
,000
unkn
own
1
0,00
0unk
nown
1,00
0unk
nown
Const
ructing
H-m
atrices
[A. I
da &
T. I
was
hita
]
Effe
cts
of u
sing
Hyb
rid M
PI+O
penM
Pin
HM
VM (F
X10)
28
We
exam
ined
spe
edup
vs.
the
time
of th
e Fl
at-M
PI v
er. o
n 1
node
.■
Para
llel s
cala
bilit
y is
impr
oved
in c
ase
of h
ybrid
MPI
+Ope
nMP
by re
duci
ng M
PI c
omm
unic
atio
n co
st.
■Sp
eed-
up re
ache
s a
limit
arou
nd 9
6-co
res
in c
ase
of F
lat-M
PI.
050
100
150
200
250
0246810
Flat
-MPI
MPI
+OM
P2th
read
s M
PI+O
MP4
thre
ads
MPI
+OM
P8th
read
s M
PI+O
MP1
6thr
eads
Num
ber o
f cor
es
Speed-up vs. 16 core flat-MPI Par
alle
l sc
alab
ility
when p
erf
orm
ing
an H
-m
atrix
vecto
r m
ultip
licat
ion
・1,
000,
000
unkn
owns
・FX
10
[A. I
da &
T. I
was
hita
]
29
Col
labo
ratio
ns, O
utre
achi
ng•
Col
labo
ratio
ns–
Inte
rnat
iona
l Col
labo
ratio
ns•
Law
renc
e Be
rkel
ey N
atio
nal L
ab.
•N
atio
nal T
aiw
an U
nive
rsity
•ES
SEX/
SPPE
XA/D
FG, G
erm
any
•IP
CC
(In
tel P
aral
lel C
ompu
ting
Cen
ter)
•O
utre
achi
ng, A
pplic
atio
ns–
Larg
e-Sc
ale
Sim
ulat
ions
•G
eolo
gic
CO
2St
orag
e•
Astro
phys
ics
•Ea
rthqu
ake
Sim
ulat
ions
etc
.•
ppO
pen-
AT, p
pOpe
n-M
ATH
/VIS
, pp
Ope
n-M
ATH
/MP,
Lin
ear S
olve
rs–
Intl.
Wor
ksho
ps (2
012,
13,1
5)–
Tuto
rials
, Cla
sses
•pp
Ope
n-H
PC•
ppO
pen-
MAT
H–
ppO
pen-
MAT
H/M
G: M
ultig
rid S
olve
r–
Targ
et P
robl
ems,
Com
pute
r Sys
tem
s–
Opt
imiz
atio
n of
Ser
ial C
omm
unic
atio
n–
Opt
imiz
atio
n of
Par
alle
l Com
m. (
I): C
GA
–O
ptim
izat
ion
of P
aral
lel C
omm
. (II)
: hC
GA
•Su
mm
ary
30
Spar
se L
inea
r Sol
vers
in p
pOpe
n-H
PC•
(Ope
nMP+
MPI
) Hyb
rid
•M
ultic
olor
ing/
RC
M/C
M-R
CM
for O
penM
P–
Col
orin
g pr
oced
ures
are
NO
T pa
ralle
lized
yet
•pp
Ope
n-AP
PL/F
EM, F
VM, F
DM
–IL
U/B
ILU
(p,d
,t)+C
G/G
PBiC
G/G
MR
ES, D
epth
of O
verla
ppin
g–
Hie
rarc
hica
l Int
erfa
ce D
ecom
posi
tion
(HID
) [H
enon
& Sa
ad20
07],
Exte
nded
HID
[KN
201
0]•
ppO
pen-
MA
TH/M
G–
Geo
met
ric M
ultig
rid S
olve
rs/P
reco
nditi
oner
s–
Com
m./s
ynch
. avo
idin
g/re
duci
ng b
ased
on hC
GA
•[K
N 2
014,
Bes
t Pap
er A
war
d in
IEEE
/ICPA
DS
2014
]•
ppO
pen-
APPL
/BEM
–H
-Mat
rix S
olve
r: H
ACAp
K–
Onl
y O
pen-
Sour
ce H
-Mat
rix S
olve
r Lib
rary
by
Ope
nMP/
MPI
31
32
ppO
pen-
MA
TH•
A se
t of c
omm
on n
umer
ical
libr
arie
s–
Mul
tigrid
solv
ers
(ppO
pen-
MAT
H/M
G)
–Pa
ralle
l gra
ph li
brar
ies
(ppO
pen-
MAT
H/G
RAP
H)
•M
ultit
hrea
ded
RC
M fo
r reo
rder
ing
(und
er d
evel
opm
ent)
–Pa
ralle
l vis
ualiz
atio
n (p
pOpe
n-M
ATH
/VIS
)–
Libr
ary
for c
oupl
ed m
ulti-
phys
ics
sim
ulat
ions
(loo
se-
coup
ling)
(ppO
pen-
MAT
H/M
P)•
Orig
inal
ly d
evel
oped
as
a co
uple
r for
NIC
AM (a
tmos
pher
e,
unst
ruct
ured
), an
d C
OC
O (o
cean
, stru
ctur
ed) i
n gl
obal
clim
ate
sim
ulat
ions
usi
ng K
com
pute
r–
Both
cod
es a
re m
ajor
cod
es o
n th
e K
com
pute
r. »
Prof
. Mas
aki S
atoh
(AO
RI/U
.Tok
yo):
NIC
AM»
Prof
. Hiro
yasu
Has
umi(
AOR
I/U.T
okyo
): C
OC
O
•D
evel
oped
cou
pler
is e
xten
ded
to m
ore
gene
ral u
se.
–C
oupl
ed s
eism
ic s
imul
atio
ns
33
•3D
Gro
undw
ater
Flo
w v
ia
Het
erog
eneo
us P
orou
s M
edia
−Po
isso
n’s
equa
tion
−R
ando
mly
dis
tribu
ted
wat
er c
ondu
ctiv
ity−
Fini
te-V
olum
e M
etho
d on
Cub
ic V
oxel
M
esh
−O=
10-5
~10+
5 , Av
erag
e: 1
.00
–M
GC
G s
olve
r with
IC(0
) sm
ooth
er•
Mul
tigrid
−Sc
alab
le, o
ne o
f the
cho
ices
for p
ost-
peta
/exa
scal
eH
PC−
HPC
G
pGW
3D-F
VM w
ith p
pOpe
n-M
ATH
/MG34
��
��q
zy
x
��
�I
O,
,
35
•Pr
econ
ditio
ned
CG
Met
hod
–(G
eom
etric
) Mul
tigrid
Prec
ondi
tioni
ng (M
GC
G)
–IC
(0) f
or S
moo
thin
g O
pera
tor (
Smoo
ther
): go
od fo
r ill-
cond
ition
ed p
robl
ems
Line
ar S
olve
rs
•Pa
ralle
l Geo
met
ric M
ultig
ridM
etho
d–
8 fin
e m
eshe
s (c
hild
ren)
form
1 c
oars
e m
esh
(par
ent)
in
isot
ropi
c m
anne
r (oc
tree)
–V-
cycl
e–
Dom
ain-
Dec
ompo
sitio
n-ba
sed:
Loc
aliz
ed B
lock
-Jac
obi,
Ove
rlapp
ed A
dditi
ve S
chw
artz
Dom
ain
Dec
ompo
sitio
n (A
SDD
)–
Ope
ratio
ns u
sing
a s
ingl
e co
re a
t the
coa
rses
t lev
el
(redu
ndan
t)
Com
puta
tions
on
Fujit
su F
X10
•Fu
jitsu
PR
IMEH
PC F
X10
at U
.Tok
yo(O
akle
af-F
X)–
Com
mer
cial
ver
sion
of K
–
16 c
ores
/nod
e, fl
at/u
nifo
rm a
cces
s to
mem
ory
–4,
800
node
s 1.
043
PF (7
4th ,
TOP
500,
201
5 N
ov.)
36
•U
p to
4,0
96 n
odes
(65,
536
core
s)(L
arge
-Sca
le H
PC C
halle
nge)
–
Max
17,
179,
869,
184
unkn
owns
–Fl
at M
PI, H
B 4x
4, H
B 8x
2, H
B 16
x1•
Wea
k Sc
alin
g•
Stro
ng S
calin
g–
1283
×8=
16,
777,
216
unkn
owns
, fro
m 8
to
4,0
96 n
odes
•N
etw
ork
Topo
logy
is n
ot s
peci
fied
–1D
L1 CL1 C
L1 CL1 C
L1 CL1 C
L1 CL1 C
L1 CL1 C
L1 CL1 C
L1 CL1 C
L1 CL1 C
L2
Mem
ory
37
HB
M x
NL1 C
L1 CL1 C
L1 CL1 C
L1 CL1 C
L1 CL1 C
L1 CL1 C
L1 CL1 C
L1 CL1 C
L1 C
L2
Mem
ory
Num
ber o
f Ope
nMP
thre
ads
per a
sin
gle
MPI
pro
cess
Num
ber o
f MPI
pro
cess
per a
sin
gle
node
Reo
rder
ing
Met
hods
for I
C/IL
U F
act.
& F
/B S
ubst
. on
Each
MPI
Pro
c.El
emen
ts in
“sam
e co
lor”
are
inde
pend
ent:
to b
e pa
ralle
lized
by
Ope
nMP
on e
ach
MPI
pro
cess
.
6463
6158
5449
4336
6260
5753
4842
3528
5956
5247
4134
2721
5551
4640
3326
2015
5045
3932
2519
1410
4438
3124
1813
96
3730
2317
128
53
2922
1611
74
21
4832
3115
1462
6144
4326
258
754
5336
1664
6346
4528
2710
956
5538
3720
192
4730
2912
1158
5740
3922
214
350
4933
1360
5942
4124
236
552
5135
3418
171
6463
6158
5449
4336
6260
5753
4842
3528
5956
5247
4134
2721
5551
4640
3326
2015
5045
3932
2519
1410
4438
3124
1813
96
3730
2317
128
53
2922
1611
74
21
117
318
519
720
3349
3450
3551
3652
1721
1922
2123
2324
3753
3854
3955
4056
3325
3526
3727
3928
4157
4258
4359
4460
4929
5130
5331
5532
4561
4662
4763
4864
12
34
56
78
910
1112
1314
1516
RC
MR
ever
se C
uthi
ll-M
ckee
MC
(Col
or#=
4)M
ultic
olor
ing
CM
-RC
M (C
olor
#=4)
Cyc
lic M
C +
RC
M
38
39
•MC
: Goo
d pa
ralle
l effi
cien
cy w
ith s
mal
ler #
of c
olor
s, b
ad
conv
erge
nce.
Bet
ter c
onve
rgen
ce w
ith m
any
colo
rs, s
ynch
. ov
erhe
ad•R
CM
: Goo
d co
nver
genc
e, p
oor p
aral
lel e
ffici
ency
, syn
ch.
over
head
•CM
-RC
M: R
easo
nabl
e co
nver
genc
e &
effi
cien
cy
6463
6158
5449
4336
6260
5753
4842
3528
5956
5247
4134
2721
5551
4640
3326
2015
5045
3932
2519
1410
4438
3124
1813
96
3730
2317
128
53
2922
1611
74
21
4832
3115
1462
6144
4326
258
754
5336
1664
6346
4528
2710
956
5538
3720
192
4730
2912
1158
5740
3922
214
350
4933
1360
5942
4124
236
552
5135
3418
171
6463
6158
5449
4336
6260
5753
4842
3528
5956
5247
4134
2721
5551
4640
3326
2015
5045
3932
2519
1410
4438
3124
1813
96
3730
2317
128
53
2922
1611
74
21
117
318
519
720
3349
3450
3551
3652
1721
1922
2123
2324
3753
3854
3955
4056
3325
3526
3727
3928
4157
4258
4359
4460
4929
5130
5331
5532
4561
4662
4763
4864
12
34
56
78
910
1112
1314
1516
RC
MR
ever
se C
uthi
ll-M
ckee
MC
(Col
or#=
4)M
ultic
olor
ing
CM
-RC
M (C
olor
#=4)
Cyc
lic M
C +
RC
M
40
•Se
rial C
omm
unic
atio
ns–
Dat
a Tr
ansf
er th
roug
h M
emor
y H
iera
rchy
¾Sp
arse
Mat
rix O
pera
tions
in P
aral
lel M
G
•Pa
ralle
l Com
mun
icat
ions
–M
essa
ge P
assi
ng th
roug
h N
etw
ork
Com
mun
icat
ions
in M
GC
G a
re
expe
nsiv
e !
pGW
3D-F
VM w
ith p
pOpe
n-M
ATH
/MG41
•St
orag
e fo
rmat
of c
oeffi
cien
t mat
rices
(S
eria
l Com
mun
icat
ion)
–C
RS
(Com
pres
sed
Row
Sto
rage
)–
ELL
(Ellp
ack-
Itpac
k)
•C
omm
unic
atio
n/Sy
chro
niza
tion
Red
ucin
g M
G (P
aral
lel
Com
mun
icat
ion)
–C
oars
e G
rid A
ggre
gatio
n (C
GA)
–H
iera
rchi
cal C
GA:
Com
m. R
educ
ing
CG
A
•pp
Ope
n-H
PC•
ppO
pen-
MAT
H–
ppO
pen-
MAT
H/M
G: M
ultig
rid S
olve
r–
Targ
et P
robl
ems,
Com
pute
r Sys
tem
s–
Opt
imiz
atio
n of
Ser
ial C
omm
unic
atio
n–
Opt
imiz
atio
n of
Par
alle
l Com
m. (
I): C
GA
–O
ptim
izat
ion
of P
aral
lel C
omm
. (II)
: hC
GA
•Su
mm
ary
42
ELL:
Fix
ed L
oop-
leng
th, N
ice
for
Pre-
fetc
hing
(if R
OW
maj
or)
43
»»»»»» ¼º
«««««« ¬ª
50
00
10
47
30
00
31
40
05
21
00
03
11
31
25
41
33
74
15
13
12
54
13
37
41
5
0 0
(a) C
RS(b
) ELL
Addi
tiona
l Mem
ory
& Co
mpu
tatio
ns
ELL
with
Row
-wis
e Sw
eepi
ngC
RS
with
fixe
d le
ngth
Back
war
d Su
bstit
utio
n
44
!$ompparallel
do icol= 1, NCOLORtot
!$ompdo
do ip
= 1, PEsmpTOT
do i= Index(ip-1,icol)+1, Index(ip,icol)
do k= 1, 6
Z(i)= Z(i) -
AMU(k,i)*Z(IAMU(k,i))
enddo
Z(i)= Z(i) / DD(i)
enddo
enddo
enddo
!ompend parallel
i
k
Spec
ial T
reat
men
t for
“B
ound
ary”
M
eshe
sco
nnec
ted
to “
Hal
o”•
Dis
tribu
tion
of
Low
er/U
pper
Non
-Zer
o O
ff-D
iago
nal
Com
pone
nts
•If
we
adop
t RC
M (o
r C
M) r
eord
erin
g ...
•Pu
re In
tern
al M
eshe
s–
L: ~
3, U
: ~3
•Bo
unda
ry M
eshe
s–
L: ~
3, U
: ~6
45
Exte
rnal
M
eshe
sIn
tern
al M
eshe
s on
Bou
ndar
y
Pure
Inte
rnal
M
eshe
s
x
yz
Pure
Inte
rnal
M
eshe
sIn
tern
al M
eshe
s on
Bou
ndar
y
●In
tern
al
(low
er)
●In
tern
al
(upp
er)
●Ex
tern
al
(upp
er)
Orig
inal
ELL
: Bac
kwar
d Su
bstit
utio
nN
umbe
r of N
on-Z
ero
Off-
Dia
g.
Com
pone
nts
for U
pper
Tri.
Par
tC
ache
is n
ot w
ell-u
tiliz
ed: I
AUne
w(6
,N),
Aune
w(6
,N)
46
Pure
Inte
rnal
Cel
lsAUne
w(6,
N)
Bou
ndar
y C
ells
AUne
w(6,
N)up
to 6
up to
3
Orig
inal
ELL
: Bac
kwar
d Su
bst.
Cac
he is
not
wel
l-util
ized
: IAU
new
(6,N
), Au
new
(6,N
)
47
do icol= NHYP(lev), 1, -1
if (mod(icol,2).eq.1) then
!$omp
parallel do private (ip,icel,j,SW)
do ip= 1, PEsmpTOT
do icel= SMPindex(icol-1,ip,lev)+1, SMPindex(icol,ip,lev)
SW= 0.0d0
do j
= 1, 6
SW=
SW +
AUn
ew(j
,ice
l)*R
mg(I
AUne
w(j,
icel
))enddo
Rmg(icel)= Rmg(icel) -
SW*DDmg(icel)
enddo
enddo
else
!$omp
parallel do private (ip,icel,j,SW)
do ip= 1, PEsmpTOT
do icel= SMPindex(icol-1,ip,lev)+1, SMPindex(icol,ip,lev)
SW= 0.0d0
do j
= 1, 3
SW=
SW +
AUn
ew(j
,ice
l)*R
mg(I
AUne
w(j,
icel
))enddo
Rmg(icel)= Rmg(icel) -
SW*DDmg(icel)
enddo
enddo
endif
enddo
IAUnew(6,N), AUnew(6,N)
for P
ure
Inte
rnal
Cel
ls
for B
ound
ary
Cel
ls
Impr
oved
ELL
: Bac
kwar
d Su
bstit
utio
nSe
para
te A
rray
s In
trod
uced
Cac
he is
wel
l-util
ized
: AU
new
3/AU
new
6Sl
iced
ELL
[Mon
akov
et a
l. 20
10] (
for S
pMV/
GPU
)
48
Pure
Inte
rnal
Cel
lsAUne
w3(3
,N)
Bou
ndar
y C
ells
AUne
w6(6
,N)
sepa
rate
arra
ysar
e in
trodu
ced
Impr
oved
ELL
: Bac
kwar
d Su
bst.
Cac
he is
wel
l-util
ized
, sep
arat
ed: A
Une
w3/
AUne
w6
49
do icol= NHYP(lev), 1, -1
if (mod(icol,2).eq.1) then
!$omp
parallel do private (ip,icel,j,SW)
do ip= 1, PEsmpTOT
do icel= SMPindex(icol-1,ip,lev)+1, SMPindex(icol,ip,lev)
SW= 0.0d0
do j= 1, 6
SW= SW + AUnew6(j,icel)*Rmg(IAUnew6(j,icel))
enddo
Rmg(icel)= Rmg(icel) -
SW*DDmg(icel)
enddo
enddo
else
!$omp
parallel do private (ip,icel,j,SW)
do ip= 1, PEsmpTOT
do icel= SMPindex(icol-1,ip,lev)+1, SMPindex(icol,ip,lev)
SW= 0.0d0
do j= 1, 3
SW= SW + AUnew3(j,icel)*Rmg(IAUnew3(j,icel))
enddo
Rmg(icel)= Rmg(icel) -
SW*DDmg(icel)
enddo
enddo
endif
enddo
IAUnew3(3,N), AUnew3(3,N)
IAUnew6(6,N), AUnew6(6,N)
for P
ure
Inte
rnal
Cel
ls
for B
ound
ary
Cel
ls
Ther
e ar
e a
lot o
f “X”
-ELL
’s•
Mai
nly
focu
sing
on
SpM
Vco
mpu
tatio
ns•
SELL
-C-V
–M
. Kre
utze
r et a
l.: A
uni
fied
spar
se m
atrix
dat
a fo
rmat
for
effic
ient
gen
eral
spa
rse
mat
rix-v
ecto
r mul
tiplic
atio
n on
m
oder
n pr
oces
sors
with
wid
e S
IMD
uni
ts.S
IAM
SIS
C 3
6(5)
, pp
.401
–423
(201
4)•
Rec
ently
, “X”
-ELL
’s a
re a
pplie
d to
forw
ard/
back
war
d su
bstit
utio
ns w
ith d
ata
depe
nden
cy
–M
ost o
f HPC
G im
plem
enta
tions
: SC
14 B
oF–
They
are
focu
sing
on
Gau
ss-S
eide
l: m
uch
easi
er
•IL
U –U
pper
/low
er c
ompo
nent
s m
ust b
e tre
ated
sep
arat
ely
–M
ore
diffi
cult,
com
plic
ated
–(In
this
cas
e L/
U c
ompo
nent
s ar
e se
para
tely
sto
red)
50
Ana
lyse
s by
Det
aile
d Pr
ofile
r of
Fujit
su F
X10,
sin
gle
node
, Fla
t M
PI, R
CM
(Mul
tigrid
Part
), 64
3 cel
ls/c
ore,
1-n
ode
51
Inst
ruct
ion
L1D
mis
sL2
mis
sSI
MD
Op.
Rat
ioG
FLO
PS
CR
S1.
53u1
092.
32u1
071.
67u1
0730
.14%
6.05
Orig
inal
ELL
4.91
u108
1.67
u107
1.27
u107
93.8
8%6.
99
Impr
oved
ELL
4.91
u108
1.67
u107
9.14
u106
93.8
8%8.
56
•pp
Ope
n-H
PC•
ppO
pen-
MAT
H–
ppO
pen-
MAT
H/M
G: M
ultig
rid S
olve
r–
Targ
et P
robl
ems,
Com
pute
r Sys
tem
s–
Opt
imiz
atio
n of
Ser
ial C
omm
unic
atio
n–
Opt
imiz
atio
n of
Par
alle
l Com
m. (
I): C
GA
–O
ptim
izat
ion
of P
aral
lel C
omm
. (II)
: hC
GA
•Su
mm
ary
52
Para
llel M
ultig
rid: O
rigin
al A
ppro
ach
Coa
rse
grid
sol
ver a
t a s
ingl
e co
re [K
N 2
010]
53
Leve
l=1
Leve
l=2
Leve
l=m
-3
Leve
l=m
-2
Leve
l=m
-1
Leve
l=m
Mes
h #
for
each
MPI
= 1
Fine
Coa
rse
Com
mun
icat
ion
Ove
rhea
dat
Coa
rser
Lev
els C
oars
e gr
id s
olve
r on
a si
ngle
co
re (f
urth
er m
ultig
rid)
Coa
rse
Grid
Agg
rega
tion
(CG
A)
Coa
rse
Grid
Sol
ver i
s m
ultit
hrea
ded
[KN
201
2]
54
Leve
l=1
Leve
l=2
Leve
l=m
-3
Fine
Coa
rse
Coa
rse
grid
sol
ver o
n a
sing
le M
PI p
roce
ss (m
ulti-
thre
aded
, fur
ther
m
ultig
rid)
•C
omm
unic
atio
n ov
erhe
ad
coul
d be
redu
ced
•C
oars
e gr
id s
olve
r is
mor
e ex
pens
ive
than
orig
inal
ap
proa
ch.
•If
proc
ess
num
ber i
s la
rger
, th
is e
ffect
mig
ht b
e si
gnifi
cant
Leve
l=m
-2
55
Wea
k Sc
alin
g: ~
4,09
6 no
des
up to
17,
179,
869,
184
mes
hes
(643
mes
hes/
core
)D
OW
N is
GO
OD
0.00
5.00
10.0
0
15.0
0
20.0
0
100
1000
1000
010
0000
sec.
CORE
#
HB
8x2:
C0
HB
8x2:
C1
HB
8x2:
C2
HB
8x2:
C3
5.0
7.5
10.0
12.5
15.0
100
1000
1000
010
0000
sec.
CO
RE#
Flat
MPI
:C3
HB
4x4
:C3
HB
8x2
:C3
HB
16x1
:C3M
atrix
Coa
rse
Grid
C0
CR
SSi
ngle
Cor
e
C1
ELL
(org
)Si
ngle
Cor
e
C2
ELL
(org
)C
GA
C3
ELL
(slic
ed)
CG
A
56
Wea
k Sc
alin
g: ~
4,09
6 no
des
up to
17,
179,
869,
184
mes
hes
(643
mes
hes/
core
)D
OW
N is
GO
OD
0.00
5.00
10.0
0
15.0
0
20.0
0
100
1000
1000
010
0000
sec.
CORE
#
HB
8x2:
C0
HB
8x2:
C1
HB
8x2:
C2
HB
8x2:
C3
5.0
7.5
10.0
12.5
15.0
100
1000
1000
010
0000
sec.
CO
RE#
Flat
MPI
:C3
HB
4x4
:C3
HB
8x2
:C3
HB
16x1
:C3M
atrix
Coa
rse
Grid
C0
CR
SSi
ngle
Cor
e
C1
ELL
(org
)Si
ngle
Cor
e
C2
ELL
(org
)C
GA
C3
ELL
(slic
ed)
CG
A
CR
S
SELL +
CG
A
x1.9
0
57
Wea
k Sc
alin
g: C
3R
esul
ts a
t 4,0
96 n
odes
(1.7
2x10
10D
OF)
0.0
5.0
10.0
15.0
Flat
MPI
:C
3:64
HB
4x4:
C3:
59H
B 8x
2:C
3:55
HB
16x1
:C
3:55
sec.R
est
Coa
rse
Grid
Sol
ver
MPI
_Allg
athe
rM
PI_I
send
/Irec
v/Al
lredu
ce
58
Wea
k Sc
alin
g: C
2(w
ith C
GA
)Ti
me
for C
oars
e G
rid S
olve
rEf
ficie
ncy
of c
oars
e gr
id s
olve
r for
HB
16x1
is x
256
of th
at o
f fla
t M
PI (1
/16
prob
lem
siz
e, x
16 re
sour
ce fo
r coa
rse
grid
sol
ver)
0.00
1.00
2.00
3.00
4.00
1024
2048
4096
8192
1638
432
768
4915
265
536
sec.
CORE
#
Flat
MPI
HB
4x4
HB
8x2
HB
16x1
Sum
mar
y so
far .
..•
“Coa
rse
Grid
Agg
rega
tion
(CG
A)” i
s ef
fect
ive
for
stab
ilizat
ion
of c
onve
rgen
ce a
t O(1
04) c
ores
for M
GC
G–
Smal
ler n
umbe
r of p
aral
lel d
omai
ns–
HB
8x2
is th
e be
st a
t 4,0
96 n
odes
–Fl
at M
PI, H
B 4x
4•
Coa
rse
grid
sol
vers
are
mor
e ex
pens
ive,
bec
ause
thei
r num
ber o
f M
PI p
roce
sses
are
mor
e th
an th
ose
of H
B 8x
2 an
d H
B 16
x1.
•EL
L fo
rmat
is e
ffect
ive
!–
C0
(CR
S)
->
C1
(ELL
-org
.): +
20-3
0%–
C2
(ELL
-org
)-> C
3(EL
L-ne
w):
+20-
30%
–C
0 ->
C3:
+80
-90%
•C
oars
e G
rid S
olve
r –
Very
exp
ensi
ve fo
r cas
es w
ith m
ore
than
O(1
05) c
ores
–
Mem
ory
of a
sin
gle
node
is n
ot e
noug
h–
Mul
tiple
nod
es s
houl
d be
util
ized
for c
oars
e gr
id s
olve
r59
Mat
rixC
oars
e G
rid
C0
CR
SSi
ngle
Cor
e
C1
ELL
(org
)Si
ngle
Cor
e
C2
ELL
(org
)C
GA
C3
ELL
(slic
ed)
CG
A
•pp
Ope
n-H
PC•
ppO
pen-
MAT
H–
ppO
pen-
MAT
H/M
G: M
ultig
rid S
olve
r–
Targ
et P
robl
ems,
Com
pute
r Sys
tem
s–
Opt
imiz
atio
n of
Ser
ial C
omm
unic
atio
n–
Opt
imiz
atio
n of
Par
alle
l Com
m. (
I): C
GA
–O
ptim
izat
ion
of P
aral
lel C
omm
. (II)
: hC
GA
•Su
mm
ary
60
Hie
rarc
hica
l CG
A: C
omm
. Red
ucin
g M
GR
educ
ed n
umbe
r of M
PI p
roce
sses
[KN
201
3]
61
Leve
l=1
Leve
l=2
Leve
l=m
-3
Leve
l=m
-3
Fine
Coa
rse
Leve
l=m
-2
Coa
rse
grid
sol
ver o
n a
sing
le M
PI p
roce
ss (m
ulti-
thre
aded
, fur
ther
m
ultig
rid)
hCG
A: R
elat
ed W
ork
•N
ot a
new
idea
, but
ver
y fe
w im
plem
enta
tions
.–
Not
effe
ctiv
e fo
r pet
a-sc
ale
syst
ems
(Dr.
U.M
.Yan
g(L
LNL)
, dev
elop
er o
f H
ypre
)
•Ex
istin
g W
orks
: Rep
artit
ioni
ng a
t Coa
rse
Leve
ls–
Lin,
P.T
., Im
prov
ing
mul
tigrid
perfo
rman
ce fo
r uns
truct
ured
m
esh
drift
-diff
usio
n si
mul
atio
ns o
n 14
7,00
0 co
res,
In
tern
atio
nal J
ourn
al fo
r Num
eric
al M
etho
ds in
Eng
inee
ring
91 (2
012)
971
-989
(San
dia)
–Su
ndar
, H. e
t al,
Para
llel G
eom
etric
-Alg
ebra
ic M
ultig
ridon
U
nstru
ctur
ed F
ores
ts o
f Oct
rees
, AC
M/IE
EE P
roce
edin
gs o
f th
e 20
12 In
tern
atio
nal C
onfe
renc
e fo
r Hig
h Pe
rform
ance
C
ompu
ting,
Net
wor
king
, Sto
rage
and
Ana
lysi
s (S
C12
) (2
012)
(UT
Aust
in)
–Fl
at M
PI,
Rep
artit
ioni
ng if
DO
F <
O(1
03) o
n ea
ch p
roce
ss62
hCG
Ain
the
pres
ent w
ork
•Ac
cele
rate
the
coar
ser g
rid s
olve
r–
usin
g m
ultip
le p
roce
sses
inst
ead
of a
sin
gle
proc
ess
in C
GA
–O
nly
64 c
ells
on
each
pro
cess
of l
ev=6
in th
e fig
ure
•
Stra
ight
forw
ard
Appr
oach
–M
PI_C
omm
_spl
it, M
PI_G
athe
r, M
PI_B
cast
etc.
63
0.0
5.0
10.0
15.0
20.0
ELL-
CG
A,le
v=6:
51
ELL-
CG
A,le
v=7:
55
ELL-
CG
A,le
v=8:
60
ELL:
65,
(NO
CG
A)C
RS:
66,
(NO
CG
A)
sec.
Res
tC
oars
e G
rid S
olve
rM
PI_A
llgat
her
MPI
_Ise
nd/Ir
ecv/
Allre
duce
0.0
5.0
10.0
15.0
Flat
MPI
HB
4x4
HB
8x2
HB
16x
1
sec.
C3,
4,0
96 n
odes
C4,
4,0
96 n
odes
64
Wea
k Sc
alin
g:
~4,0
96 n
odes
up to
17,
179,
869,
184
mes
hes
(643
mes
hes/
core
)D
OW
N is
GO
OD
Mat
rixC
oars
e G
ridC
0C
RS
Sing
le C
ore
C1
ELL
(org
)Si
ngle
Cor
e
C2
ELL
(org
)C
GA
C3
ELL
(slic
ed)
CG
A
C4
ELL
(slic
ed)
hCG
A
5.0
7.5
10.0
12.5
15.0
100
1000
1000
010
0000
sec.
CO
RE#
Flat
MPI
:C3
Flat
MPI
:C4
HB
4x4
:C4
HB
8x2
:C3
HB
16x1
:C3
x1.6
1
Opt
imum
Par
amet
ers
at 4
,096
nod
esW
eak
Scal
ing
•O
ptim
um le
vel f
or s
witc
hing
to re
duce
d nu
mbe
r of M
PI
proc
esse
s fo
r CG
A (le
v CG
Aop
t) an
d h
CG
A(le
v hC
GA
opt)
•N
@le
v CG
Aop
t, N
@le
v hC
GA
opt
–N
umbe
r of u
nkno
wns
per
eac
h M
PI p
roce
ss a
t the
sw
itchi
ng le
vel (
muc
h sm
alle
r tha
n O
(103
) use
d in
rela
ted
wor
ks)
•O
ptim
um #
of M
PI p
roce
sses
afte
r rep
artit
ioni
ng (P
Ere
p)
65
lev C
GA
opt
lev h
CG
Aop
t
N@
lev C
GA
opt
N@
lev h
CG
Aop
tPE
rep
Iter’s
sec.
Flat
MPI
C3
71
-64
13.2
C4
68
128
proc
’s8
node
s61
8.22
HB
4u4
C3
81
-59
8.08
C4
632
256
proc
’s,
64 n
odes
567.
97
66
Level=1
Level=2
Level=m‐3
Level=m‐3
Fine
CoarseLe
vel=m‐2
hCGA
lev hCGAopt,N@lev hCGAopt
Leve
l=1
Leve
l=2
Leve
l=m
-3
Fine
Coa
rse
Leve
l=m
-2CGA
lev C
GAopt,N@lev C
GAopt
020406080100
120
1024
8192
6553
6
Parallel Performance (%)
CO
RE#
Flat
MPI
:C3
Flat
MPI
:C4
Stro
ng S
calin
g at
4,0
96 n
odes
268,
435,
456
mes
hes,
163
mes
hes/
core
at 4
,096
nod
esU
P is
GO
OD
Flat
MPI
/ELL
(C3)
, 8
node
s (1
28 c
ores
) :
100%
67
x6.2
7
Mat
rixC
oars
e G
ridC
0C
RS
Sing
le C
ore
C1
ELL
(org
)Si
ngle
Cor
e
C2
ELL
(org
)C
GA
C3
ELL
(slic
ed)
CG
A
C4
ELL
(slic
ed)
hCG
A
•pp
Ope
n-H
PC•
ppO
pen-
MAT
H–
ppO
pen-
MAT
H/M
G: M
ultig
rid S
olve
r–
Targ
et P
robl
ems,
Com
pute
r Sys
tem
s–
Opt
imiz
atio
n of
Ser
ial C
omm
unic
atio
n–
Opt
imiz
atio
n of
Par
alle
l Com
m. (
I): C
GA
–O
ptim
izat
ion
of P
aral
lel C
omm
. (II)
: hC
GA
•Su
mm
ary
68
Sum
mar
y
•hC
GA
is e
ffect
ive,
but
not
so
sign
ifica
nt(e
xcep
t fla
t M
PI)
–fla
t MPI
: x1.
61 fo
r wea
k sc
alin
g, x
6.27
for s
trong
sca
ling
at
4,09
6 no
des
of F
ujits
u FX
10
–hC
GA
will
be e
ffect
ive
for H
B 16
x1 w
ith m
ore
than
2.5
0x10
5
node
s (=
4.0
0x10
6co
res)
of F
X10
(=60
PFL
OPS
)•
Com
p. ti
me
of c
oars
e gr
id s
olve
r is
sign
ifica
nt fo
r Fla
t MPI
with
>10
3
node
s–
Com
mun
icat
ion
over
head
has
bee
n (s
light
ly) r
educ
ed b
y hC
GA
69
Futu
re W
orks
, Ope
n Pr
oble
ms
•Im
prov
emen
t of h
CG
A–
Ove
rhea
d by
MPI
_Allr
educ
eet
c. ->
P2P
com
m.:
Put-G
et•
Algo
rithm
s–
CA-
Mul
tigrid
(for
coa
rser
leve
ls),
CA-
SPAI
, Pip
elin
ed M
etho
d (T
ianh
e-2)
•St
rate
gy fo
r Aut
omat
ic S
elec
tion
–sw
itchi
ng le
vel,
num
ber o
f pro
cess
es fo
r hC
GA,
opt
imum
co
lor #
–ef
fect
s on
con
verg
ence
•M
ore
Flex
ible
ELL
for U
nstru
ctur
ed G
rids
–SE
LL-C
-V•
Xeon
Phi
Clu
ster
s70
Num
ber o
f Col
ors
and
Com
p. T
ime
•IC
CG
Sol
vers
•FX
10•
Ivy-
Brid
ge (I
vyB)
•KN
C (M
IC)
•“O
ptim
um” n
umbe
r for
ea
ch a
rchi
tect
ure
is
diffe
rent
71
2.00
4.00
6.00
8.00
10.0
0
12.0
0
110
100
1000
sec.
Colo
r#
FX10
: AR-
1FX
10: B
R-1
MIC
: AR-
1M
IC: B
R-1
IvyB
: AR-
1Iv
yB: B
R-1
260
280
300
320
340
360
380
400
110
100
1000
Iterations
Colo
r#
Ove
rhea
d by
Col
lect
ive
Com
m.
72
0.00
E+00
1.00
E-03
2.00
E-03
3.00
E-03
4.00
E-03
5.00
E-03
6.00
E-03
7.00
E-03
100
1000
1000
010
0000
sec./MPI_Allreduce
MPI
Pro
cess
#
Flat
MPI
HB
4x4
HB
8x2
HB
16x1
Ove
rhea
d by
MPI
_Allr
educ
efo
r MG
CG
cas
e
•O
verh
ead
by g
loba
l col
lect
ive
com
m. (
e.g.
MPI
_Allr
educ
e)•
Cha
nge
orig
inal
Kry
lov
solv
er s
o th
at c
omm
. ove
rhea
d by
glo
bal
coll.
com
m. a
re h
idde
n by
ove
rlapp
ing
with
oth
er c
ompu
tatio
ns
(Gro
pp’s
asyn
ch. C
G, s
-ste
p, p
ipel
ined
...)
•“M
PI_I
allre
duce
” in
MPI
-3: M
PI-3
on
FX10
, Dec
embe
r 201
5
SELL
-C-V
for P
CG
in F
EMIn
tel X
eon
Phi (
KN
C)
73
0.0
5.0
10.0
15.0
20.0
25.0
110
100
1000
GFLOPS
C o
f SEL
L-C
-Sig
ma
MIC
: HB
240x
1M
IC: H
B 12
0x2
MIC
: HB
60x4
0.00
0.50
1.00
1.50
2.00
110
100
1000
Ratio to CRS
C o
f SEL
L-C
-Sig
ma
MIC
: HB
240x
1M
IC: H
B 12
0x2
MIC
: HB
60x4
Nex
t Sta
ge o
f ppO
pen-
HPC
•FY
.201
6-FY
.201
8–
JST/
CR
EST
& D
FG/S
PPEX
A (G
erm
any)
Col
labo
ratio
n–
ESSE
X: E
quip
ping
Spa
rse
Solv
ers
for E
xasc
ale
•ht
tp://
blog
s.fa
u.de
/ess
ex/
•Le
adin
g PI
: Pro
f. G
erha
rd W
elle
in(U
. Erla
ngen
)–
ESSE
X II:
ESS
EX, S
akur
ai-T
, Nak
ajim
a-T
•Ite
rativ
e So
lver
for Q
uant
um C
hem
istry
: pK-
Ope
n-SO
L–
Mul
tgrid
/Low
-Ran
k Ap
prox
imat
ion
–D
LR (G
erm
an A
eros
pace
Res
earc
h C
ente
r)•
Perfo
rman
ce M
odel
for S
tenc
il C
ompu
tatio
n: p
K-O
pen-
AT–
U. E
rlang
en–
kern
craf
t: Lo
op K
erne
l Ana
lysi
s an
d Pe
rform
ance
Mod
elin
g To
olki
t »
http
s://g
ithub
.com
/cod
3mon
k/ke
rncr
aft
74
Plea
se v
isit
the
boot
h of
O
akle
af/K
ashi
wa
Alli
ance
,th
e U
nive
rsity
of T
okyo
#220
3
75