75
Optimization of Communications towards Scalable Algorithms on Post Petascale Supercomputers Kengo Nakajima Information Technology Center, The University of Tokyo ScalA15: Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems in conjunction with SC15 November 16, 2015, Austin, Texas

Optimization of Communications towards Scalable Algorithms ... · Peta Turning point to Hybrid Parallel Prog. Model Hitachi SR16000/M1 based on IBM Power-7 54.9 TFLOPS, 11.2 TB Fujitsu

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Optimization of Communications towards Scalable Algorithms ... · Peta Turning point to Hybrid Parallel Prog. Model Hitachi SR16000/M1 based on IBM Power-7 54.9 TFLOPS, 11.2 TB Fujitsu

Opt

imiz

atio

n of

Com

mun

icat

ions

to

war

ds S

cala

ble

Alg

orith

ms

on

Post

Pet

asca

leSu

perc

ompu

ters

Ken

go N

akaj

ima

Info

rmat

ion

Tech

nolo

gy C

ente

r, Th

e U

nive

rsity

of T

okyo

Scal

A15

: Wor

ksho

p on

Lat

est A

dvan

ces

in S

cala

ble

Alg

orith

ms

for

Larg

e-Sc

ale

Syst

ems

in c

onju

nctio

n w

ith S

C15

Nov

embe

r 16,

201

5, A

ustin

, Tex

as

Page 2: Optimization of Communications towards Scalable Algorithms ... · Peta Turning point to Hybrid Parallel Prog. Model Hitachi SR16000/M1 based on IBM Power-7 54.9 TFLOPS, 11.2 TB Fujitsu

•pp

Ope

n-H

PC•

ppO

pen-

MAT

H–

ppO

pen-

MAT

H/M

G: M

ultig

rid S

olve

r–

Targ

et P

robl

ems,

Com

pute

r Sys

tem

s–

Opt

imiz

atio

n of

Ser

ial C

omm

unic

atio

n–

Opt

imiz

atio

n of

Par

alle

l Com

m. (

I): C

GA

–O

ptim

izat

ion

of P

aral

lel C

omm

. (II)

: hC

GA

•Su

mm

ary

2

Syst

em S

oftw

are

in P

ost K

Su

perc

ompu

ter

Yuta

ka Is

hika

wa

(RIK

EN)

Tues

day

10:3

0-11

:15

Page 3: Optimization of Communications towards Scalable Algorithms ... · Peta Turning point to Hybrid Parallel Prog. Model Hitachi SR16000/M1 based on IBM Power-7 54.9 TFLOPS, 11.2 TB Fujitsu

Post‐Peta CR

EST

Developm

ent o

f System Softw

are Techno

logies fo

r Post‐P

eta 

Scale High

 Perform

ance Com

putin

g•

Objectives

–Co

‐design of sy

stem

 softw

are with

 app

lications and

 post‐p

etascale 

compu

ter a

rchitectures 

–De

velopm

ent o

f deliverable so

ftware pieces 

•Re

search Sup

ervisor

–Prof. M

itsuh

isa Sato (RIKEN

 AICS)

•Ru

n by JST (Ja

pan Science and Techno

logy Agency)

•Bu

dget and

 Formation (201

0 to 201

8)–

55M‐60M

 $ in

 total

–Ro

und 1: From 201

0 for 5

.5 ye

ar (5

 Team

s)–

Roun

d 2: From 201

1 for 5

.5 ye

ar (5

 Team

s)–

Roun

d 3: From 201

2 for 5

.5 ye

ar (4

 Team

s)

3

http

://w

ww.

post

peta

.jst.g

o.jp

/en/

Page 4: Optimization of Communications towards Scalable Algorithms ... · Peta Turning point to Hybrid Parallel Prog. Model Hitachi SR16000/M1 based on IBM Power-7 54.9 TFLOPS, 11.2 TB Fujitsu

Syst

em S

oftw

are

4

Tais

uke

Boku

, U. o

f Tsu

kuba

Res

earc

h an

d D

evel

opm

ent o

n U

nifie

d En

viro

nmen

t of A

ccel

erat

ed C

ompu

ting

and

Inte

rcon

nect

ion

for P

ost-P

etas

cale

Era

Atsu

shi H

ori,

RIK

EN A

ICS

Para

llel S

yste

m S

oftw

are

for M

ulti-

core

and

Man

y-co

re

Tosh

ioEn

do, T

okyo

Tec

h.So

ftwar

e Te

chno

logy

that

Dea

ls

with

Dee

per M

emor

y H

iera

rchy

in

Post

-pet

asca

leEr

a

Take

shi N

anri,

Kyu

shu

Uni

vers

ityD

evel

opm

ent o

f Sca

labl

e C

omm

unic

atio

n Li

brar

y w

ith T

echn

olog

ies

for M

emor

y Sa

ving

an

d R

untim

e O

ptim

izat

ion

Osa

mu

Tate

be, U

. of T

suku

baSy

stem

Sof

twar

e fo

r Pos

t Pet

asca

leD

ata

Inte

nsiv

e Sc

ienc

e

Mas

aaki

Kond

o, U

. of T

okyo

Pow

er M

anag

emen

t Fra

mew

ork

for

Post

-Pet

asca

leSu

perc

ompu

ters

c/o

Y. Is

hika

wa

(RIK

EN)

2013

2014

2015

2016

2017

Roun

d 1:

5 te

ams

run

Roun

d 3

: 4 te

ams

run

Roun

d 2:

5 te

ams

run

Page 5: Optimization of Communications towards Scalable Algorithms ... · Peta Turning point to Hybrid Parallel Prog. Model Hitachi SR16000/M1 based on IBM Power-7 54.9 TFLOPS, 11.2 TB Fujitsu

Prog

ram

min

g M

odel

s &

Lan

guag

es

5

Nao

yaM

aruy

ama,

Rik

en A

ICS

Hig

hly

Prod

uctiv

e, H

igh

Perfo

rman

ce A

pplic

atio

n Fr

amew

orks

for P

ost P

etas

cale

Com

putin

g

Hiro

yuki

Tak

izaw

a, T

ohok

u U

nive

rsity

An e

volu

tiona

ry a

ppro

ach

to c

onst

ruct

ion

of a

so

ftwar

e de

velo

pmen

t env

ironm

ent f

or m

assi

vely

-pa

ralle

l het

erog

eneo

us s

yste

ms

Shig

eru

Chi

ba, U

. Tok

yoSo

ftwar

e de

velo

pmen

t for

pos

t pet

asca

lesu

per

com

putin

g ---

Mod

ular

ity fo

r Sup

er C

ompu

ting

c/o

Y. Is

hika

wa

(RIK

EN)

2013

2014

2015

2016

2017

Roun

d 1:

5 te

ams

run

Roun

d 3

: 4 te

ams

run

Roun

d 2:

5 te

ams

run

Page 6: Optimization of Communications towards Scalable Algorithms ... · Peta Turning point to Hybrid Parallel Prog. Model Hitachi SR16000/M1 based on IBM Power-7 54.9 TFLOPS, 11.2 TB Fujitsu

App

licat

ions

& N

umer

ical

Lib

rarie

s

6

2013

2014

2015

2016

2017

Roun

d 1:

5 te

ams

run

Roun

d 3

: 4 te

ams

run

Roun

d 2:

5 te

ams

run

Keng

oN

akaj

ima,

Uni

vers

ity o

f Tok

yopp

Ope

n-H

PC: O

pen

Sour

ce In

frast

ruct

ure

for

Dev

elop

men

t and

Exe

cutio

n of

Lar

ge-S

cale

Sc

ient

ific

Appl

icat

ions

with

Aut

omat

ic T

unin

g (A

T)

Tets

uya

Saku

rai,

Uni

vers

ity o

f Tsu

kuba

Dev

elop

men

t of a

n Ei

gen-

Supe

rcom

putin

g En

gine

us

ing

a Po

st-P

etas

cale

Hie

rarc

hica

l Mod

el

Ryu

ji Sh

ioya

, Toy

o U

nive

rsity

Dev

elop

men

t of a

Num

eric

al L

ibra

ry b

ased

on

Hie

rarc

hica

l Dom

ain

Dec

ompo

sitio

n fo

r Pos

t Pe

tasc

ale

Sim

ulat

ion

Kats

ukiF

ujis

awa,

Kyu

shu

Uni

vers

ityAd

vanc

ed C

ompu

ting

and

Opt

imiz

atio

n In

frast

ruct

ure

for E

xtre

mel

y La

rge-

Scal

e G

raph

s on

Pos

t Pet

a-Sc

ale

Supe

rcom

pute

rs

c/o

Y. Is

hika

wa

(RIK

EN)

Itsuk

iNod

a, A

IST

Fram

ewor

k fo

r Adm

inis

tratio

n of

Soc

ial

Sim

ulat

ions

on

Mas

sive

ly P

aral

lel C

ompu

ters

Page 7: Optimization of Communications towards Scalable Algorithms ... · Peta Turning point to Hybrid Parallel Prog. Model Hitachi SR16000/M1 based on IBM Power-7 54.9 TFLOPS, 11.2 TB Fujitsu

ppO

pen-

HPC

: Ove

rvie

w•

Appl

icat

ion

fram

ewor

k w

ith a

utom

atic

tuni

ng (A

T)

•“p

p” :

post

-pet

a-sc

ale

•Fi

ve-y

ear p

roje

ct (F

Y.20

11-2

015)

(sin

ce A

pril

2011

) •

P.I.:

Ken

goN

akaj

ima

(ITC

, The

Uni

vers

ity o

f Tok

yo)

•Pa

rt of

“Dev

elop

men

t of S

yste

m S

oftw

are

Tech

nolo

gies

for

Post

-Pet

a Sc

ale

Hig

h Pe

rform

ance

Com

putin

g” fu

nded

by

JST/

CR

EST

(Sup

ervi

sor:

Prof

. Mits

uhis

a Sa

to, C

o-D

irect

or,

RIK

EN A

ICS)

7

•Te

am w

ith 7

inst

itute

s, >

50 p

eopl

e (5

PD

s) fr

om v

ario

us fi

elds

: Co-

Des

ign

•IT

C/U

.Tok

yo, A

OR

I/U.T

okyo

, ER

I/U.T

okyo

, FS/

U.T

okyo

•H

okka

ido

U.,

Kyot

o U

., JA

MST

EC

Page 8: Optimization of Communications towards Scalable Algorithms ... · Peta Turning point to Hybrid Parallel Prog. Model Hitachi SR16000/M1 based on IBM Power-7 54.9 TFLOPS, 11.2 TB Fujitsu

•G

roup

Lea

ders

–M

asak

i Sat

oh (A

OR

I/U.T

okyo

)–

Taka

shi F

urum

ura

(ER

I/U.T

okyo

)–

Hiro

shi O

kuda

(GSF

S/U

.Tok

yo)

–Ta

kesh

i Iw

ashi

ta (K

yoto

U.,

ITC

/Hok

kaid

o U

.)–

Hid

e Sa

kagu

chi(

IFR

EE/J

AMST

EC)

•M

ain

Mem

bers

Taka

hiro

Kat

agiri

(ITC

/U.T

okyo

)–

Mas

ahar

uM

atsu

mot

o (IT

C/U

.Tok

yo)

–H

idey

uki J

itsum

oto

(Tok

yo T

ech)

–Sa

tosh

i Ohs

him

a (IT

C/U

.Tok

yo)

–H

iroya

su H

asum

i(AO

RI/U

.Tok

yo)

–Ta

kash

i Ara

kaw

a (R

IST)

–Fu

tosh

iMor

i (ER

I/U.T

okyo

)–

Take

shi K

itaya

ma

(GSF

S/U

.Tok

yo)

–Ak

ihiro

Ida

(AC

CM

S/Ky

oto

U.)

–M

iki Y

amam

oto

(IFR

EE/J

AMST

EC)

–D

aisu

ke N

ishi

ura

(IFR

EE/J

AMST

EC)

8

Page 9: Optimization of Communications towards Scalable Algorithms ... · Peta Turning point to Hybrid Parallel Prog. Model Hitachi SR16000/M1 based on IBM Power-7 54.9 TFLOPS, 11.2 TB Fujitsu

9

Fram

ewor

kA

ppl.

Dev

.

Mat

hLi

brar

ies

Aut

omat

icTu

ning

(AT)

Syst

emSo

ftwar

e

Page 10: Optimization of Communications towards Scalable Algorithms ... · Peta Turning point to Hybrid Parallel Prog. Model Hitachi SR16000/M1 based on IBM Power-7 54.9 TFLOPS, 11.2 TB Fujitsu

ppO

pen-

HPC

cov

ers

…1010

Page 11: Optimization of Communications towards Scalable Algorithms ... · Peta Turning point to Hybrid Parallel Prog. Model Hitachi SR16000/M1 based on IBM Power-7 54.9 TFLOPS, 11.2 TB Fujitsu

Supe

rcom

pute

rs in

U.T

okyo

2 bi

g sy

stem

s, 6

yr.

cycl

e

11

FY 0506

0708

0910

1112

1314

1516

1718

19

Hita

chi S

R11

000/

J218

.8TF

LOPS

, 16.

4TB

Fat n

odes with

 large mem

ory

(Flat) MPI, goo

d comm. perform

ance

京(=K)

Peta

Turning po

int to Hy

brid Parallel Prog. M

odel

Fujit

su P

RIM

EHPC

FX1

0ba

sed

on S

PAR

C64

IXfx

1.13

PFL

OPS

, 150

TB

Hita

chi S

R16

000/

M1

base

d on

IBM

Pow

er-7

54.9

TFL

OPS

, 11.

2 TB

Our last SMP, to be sw

itche

d to M

PP

Hita

chi H

A80

00 (T

2K)

140T

FLO

PS, 3

1.3T

B

11

Post

T2K

25+

PFLO

PS

Initi

al P

lan

Page 12: Optimization of Communications towards Scalable Algorithms ... · Peta Turning point to Hybrid Parallel Prog. Model Hitachi SR16000/M1 based on IBM Power-7 54.9 TFLOPS, 11.2 TB Fujitsu

Targ

et o

f ppO

pen-

HPC

: Po

st T

2K S

yste

m•

Targ

et s

yste

m is

Pos

t T2K

sys

tem

−25

+ PF

LOPS

, FY.

2016

9JC

AHPC

(Joi

nt C

ente

r for

Adv

ance

d H

igh

Perfo

rman

ce

Com

putin

g): U

. Tsu

kuba

& U

. Tok

yo9

http

://jc

ahpc

.jp/

−M

any-

core

bas

ed (e

.g. I

ntel

MIC

/Xeo

n Ph

i)9

MPI

+ O

penM

P+

X−

ppO

pen-

HPC

hel

ps s

moo

th tr

ansi

tion

of u

sers

(> 2

,000

) to

new

sys

tem

yK/

FX10

, Cra

y, X

eon

clus

ters

are

als

o in

sco

pe

12

Page 13: Optimization of Communications towards Scalable Algorithms ... · Peta Turning point to Hybrid Parallel Prog. Model Hitachi SR16000/M1 based on IBM Power-7 54.9 TFLOPS, 11.2 TB Fujitsu

Sche

dule

of P

ublic

Rel

ease

(w

ith E

nglis

h D

ocum

ents

, MIT

Lic

ense

)ht

tp://

ppop

enhp

c.cc

.u-to

kyo.

ac.jp

/•

Rel

ease

d at

SC

-XY

(or c

an b

e do

wnl

oade

d)•

Mul

ticor

e/m

anyc

ore

clus

ter v

ersi

on (F

lat M

PI,

Ope

nMP/

MPI

Hyb

rid) w

ith d

ocum

ents

in E

nglis

h•

We

are

now

focu

sing

on

MIC

/Xeo

n Ph

i•

Col

labo

ratio

ns a

re w

elco

me

•H

isto

ry–

SC12

, Nov

201

2 (V

er.0

.1.0

)–

SC13

, Nov

201

3 (V

er.0

.2.0

)–

SC14

, Nov

201

4 (V

er.0

.3.0

)–

SC15

, Nov

201

5 (V

er.1

.0.0

)

13

Page 14: Optimization of Communications towards Scalable Algorithms ... · Peta Turning point to Hybrid Parallel Prog. Model Hitachi SR16000/M1 based on IBM Power-7 54.9 TFLOPS, 11.2 TB Fujitsu

New

Fea

ture

s in

Ver

.1.0

.0ht

tp://

ppop

enhp

c.cc

.u-to

kyo.

ac.jp

/•

HA

CA

pKlib

rary

for H

-mat

rix c

omp.

in p

pOpe

n-A

PPL/

BEM

(Ope

nMP/

MPI

Hyb

rid V

ersi

on)

–Fi

rst O

pen

Sour

ce L

ibra

ry b

y O

penM

P/M

PI H

ybrid

•pp

Ope

n-M

ATH

/MP

(Cou

pler

for M

ultip

hysi

cs

Sim

ulat

ions

, Loo

se C

oupl

ing

of F

EM &

FD

M)

•M

atrix

Ass

embl

y an

d Li

near

Sol

vers

for p

pOpe

n-AP

PL/F

VM

14

Page 15: Optimization of Communications towards Scalable Algorithms ... · Peta Turning point to Hybrid Parallel Prog. Model Hitachi SR16000/M1 based on IBM Power-7 54.9 TFLOPS, 11.2 TB Fujitsu

■Fo

r lar

ge-s

cale

d si

mul

atio

ns►

Appr

oxim

atio

n te

chni

que

for m

atric

es・H

-mat

rices

with

AC

A(A

dapt

ive

Cro

ss A

ppro

xim

atio

n):

⇒►

Para

llel c

ompu

ting

・H

ybrid

MPI

+Ope

nMP

prog

ram

min

g m

odel

■D

ownl

oad

site

: ht

tp://

ppop

enhp

c.cc

.u-to

kyo.

ac.jp

15

AC

ApK

libra

ry

sing

ular

ker

nel:

whe

re

,∈span

,0

e.

g.,

■Li

brar

y fo

r sim

ulat

ions

usi

ng th

e in

tegr

al e

quat

ion

met

hod

・O

pen

sour

ce・M

IT li

cens

e

[A. I

da &

T. I

was

hita

]

Page 16: Optimization of Communications towards Scalable Algorithms ... · Peta Turning point to Hybrid Parallel Prog. Model Hitachi SR16000/M1 based on IBM Power-7 54.9 TFLOPS, 11.2 TB Fujitsu

sing

ular

ker

nel:

,d

,∈span

,0

Full‐Rank

Low‐Rank

16

20,0

00

Fullrankd

ensematrixPe

rmut

atio

nPa

rtitio

n

Dis

cret

izat

ion

ACA

H-m

atric

es w

ith A

CA

Ove

rvie

w o

f H-m

atric

es w

ith A

CA

█Ap

prox

imat

ion

tech

niqu

e fo

r mat

rices

from

Inte

gral

ope

rato

r.

・Lo

w-ra

nk m

atrix

can

be

appr

oxim

ated

by s

ome

pivo

t col

umns

and

row

s.

AC

A: A

dapt

ive

Cro

ss A

ppro

xim

atio

n

[A. I

da &

T. I

was

hita

]

Page 17: Optimization of Communications towards Scalable Algorithms ... · Peta Turning point to Hybrid Parallel Prog. Model Hitachi SR16000/M1 based on IBM Power-7 54.9 TFLOPS, 11.2 TB Fujitsu

・Pi

vot c

olum

n an

d pi

vot r

ow a

re

alte

rnat

ely

sele

cted

vec

tor b

y ve

ctor

.

:||

||⋅||

||

∑||

||⋅||

||

・Ap

prox

imat

ion

erro

r est

imat

ion:

・H

euris

tic:

■M

emor

y us

age

and

appr

oxim

atio

n ac

cura

cy a

re c

ontro

llabl

e by

the

num

ber o

f the

sel

ecte

d ve

ctor

s.

Appl

ied

to b

lock

s de

tect

ed a

s po

ssib

le lo

w-ra

nk s

ubm

atric

es

ACA

17

Low

-ran

k ap

prox

imat

ion

usin

g A

CA

: arb

itrar

y co

lum

n (e

.g. l

eftm

ost)

:

th-ro

w,

≔argm

ax|

|

[A. I

da &

T. I

was

hita

]

Page 18: Optimization of Communications towards Scalable Algorithms ... · Peta Turning point to Hybrid Parallel Prog. Model Hitachi SR16000/M1 based on IBM Power-7 54.9 TFLOPS, 11.2 TB Fujitsu

■Ea

rthqu

ake

Cyc

le S

imul

atio

n

1(

)2

N

iij

jpl

ii

sGKu

Vt

VV[

W

�¦

**

*ln

(/

)ln

(/

)eff

in

ii

ii

iA

VV

BV

LW

PVW

T

exp(

/)

/ln

(/

)i

ic

ii

iii

id

VV

VL

VL

dtTT

T

��

➢eq

. mot

ion

➢fri

ctio

n la

w

Inte

gral

ope

rato

r with

,

Subd

ivid

e fa

ult s

urfa

ce18

Exam

ple

anal

ysis

usi

ng H

ACAp

K[A

. Ida

& T

. Iw

ashi

ta]

Page 19: Optimization of Communications towards Scalable Algorithms ... · Peta Turning point to Hybrid Parallel Prog. Model Hitachi SR16000/M1 based on IBM Power-7 54.9 TFLOPS, 11.2 TB Fujitsu

19

Ana

lysi

s re

sult

Exam

ple

anal

ysis

usi

ng H

ACAp

K

Gro

und

Ana

lysi

s co

nditi

on

■St

atic

ele

ctric

fiel

d an

alys

is・Po

tent

ial o

pera

tor:

・Su

rface

cha

rge

is c

alcu

late

d in

hal

f-inf

inite

dom

ain.

[A. I

da &

T. I

was

hita

]

Page 20: Optimization of Communications towards Scalable Algorithms ... · Peta Turning point to Hybrid Parallel Prog. Model Hitachi SR16000/M1 based on IBM Power-7 54.9 TFLOPS, 11.2 TB Fujitsu

20

■St

atic

ele

ctric

fiel

d an

alys

is・Po

tent

ial o

pera

tor:

Num

eric

al re

sult

Exam

ple

anal

ysis

usi

ng H

ACAp

K

Ana

lysi

s co

nditi

on

0.5m

1V

Air

Conductor

Ground

0.25m

・Su

rface

cha

rge

is c

alcu

late

d in

hal

f-inf

inite

dom

ain.

[A. I

da &

T. I

was

hita

]

Page 21: Optimization of Communications towards Scalable Algorithms ... · Peta Turning point to Hybrid Parallel Prog. Model Hitachi SR16000/M1 based on IBM Power-7 54.9 TFLOPS, 11.2 TB Fujitsu

Mem

ory

usag

eof

HA

CA

pKan

d or

igin

al d

ense

mat

rices

■H

-mat

rices

with

AC

A re

duce

mem

ory

usag

e.

21

Mem

ory

usa

ge (

Log-

Log

scal

e)

104

105

106

107

108

0.1110100

1000

Memory[GB]

Num

ber o

f unk

nown

s

Den

se m

atric

es

ACAp

K(S

tatic

ele

ctric

fiel

d)

ACAp

K(S

tatic

ele

ctric

fiel

d)

ACAp

K(E

arth

quak

e cy

cle)

[A. I

da &

T. I

was

hita

]

Page 22: Optimization of Communications towards Scalable Algorithms ... · Peta Turning point to Hybrid Parallel Prog. Model Hitachi SR16000/M1 based on IBM Power-7 54.9 TFLOPS, 11.2 TB Fujitsu

■Pa

ralle

lizat

ion

to e

xplo

it SM

P cl

uste

r sys

tem

■Im

prov

emen

t for

larg

e-si

zed

prob

lem

・C

onve

ntio

nal H

-mat

rices

can

fail

to m

ake

effic

ient

app

roxi

mat

ion

whe

n ap

plie

d to

larg

e sc

ale

prob

lem

.

22To

app

ly H

-mat

rices

for h

uge-

size

d pr

oble

ms

020

4060

8010

0-8-7-6-5-4 log(||r||/||b||)

21,600

元 100,000

元 338,000

元 1,000,000

Numb

erofite

rations

N128,0

00N

288,0

00 22

Our

effo

rts in

clud

e:

■N

ew a

lgor

ithm

of l

inea

r sol

ver

・Bi

CG

STAB

and

GC

R a

re a

vaila

ble.

・Is

any

pre

cond

ition

er n

eede

d?

[A. I

da &

T. I

was

hita

]

Page 23: Optimization of Communications towards Scalable Algorithms ... · Peta Turning point to Hybrid Parallel Prog. Model Hitachi SR16000/M1 based on IBM Power-7 54.9 TFLOPS, 11.2 TB Fujitsu

Para

lleliz

atio

n o

f H

-mat

rices

in

ACAp

K

step

1 M

ake

Clu

ster

tree

st

ep2

Mak

e H

-mat

rix s

truct

ure

step

3Fi

ll in

sub-

mat

rices

(AC

A)

Red

unda

nt c

ompu

tatio

non

all

MPI

pro

cess

ors

para

llel c

ompu

ting

■W

hen

cons

truct

ing

H-m

atric

es・

Onl

y st

ep 3

(tim

e-co

nsum

ing

part)

is p

aral

leliz

ed.

・An

y M

PI c

omm

unic

atio

n is

NO

T ne

eded

.

23

■W

hen

perfo

rmin

g H

MVM

(H-m

atrix

-vec

tor m

ultip

licat

ion)

・Al

l MPI

pro

cess

es h

ave

the

full

mul

tiplic

and

vect

or.

・M

PI c

omm

unic

atio

ns a

re n

eede

d to

real

ize

it.

■In

bot

h pa

ralle

lizat

ion

abov

e・Sa

me

assi

gnm

ent a

re u

sed.

・Ar

ithm

etic

are

con

duct

ed s

ub-m

atrix

by

sub-

mat

rix.

・As

sign

men

t to

each

pro

cess

is a

col

lect

ion

of s

ub-m

atric

es.

[A. I

da &

T. I

was

hita

]

Page 24: Optimization of Communications towards Scalable Algorithms ... · Peta Turning point to Hybrid Parallel Prog. Model Hitachi SR16000/M1 based on IBM Power-7 54.9 TFLOPS, 11.2 TB Fujitsu

■Fo

r MPI

-pro

cess

es

24

①M

inim

ize

the

max

imum

as

pos

sibl

e⇒

Red

ucin

g tra

nsfe

rred

data

siz

e②

Min

imiz

e th

e lo

ad im

bala

nce

amon

g M

PI p

roce

sses

■Fo

r Ope

nMP-

thre

ads

・M

inim

ize

the

load

imba

lanc

e am

ong

Ope

nMP-

thre

ads

Inte

ntio

n fo

r ass

ignm

ent s

trat

egy

[A. I

da &

T. I

was

hita

]

Page 25: Optimization of Communications towards Scalable Algorithms ... · Peta Turning point to Hybrid Parallel Prog. Model Hitachi SR16000/M1 based on IBM Power-7 54.9 TFLOPS, 11.2 TB Fujitsu

Diff

eren

ce in

ass

ignm

ent b

etw

een

stra

tegi

es

Assi

gned

sub

mat

rices

to

MPI

-pro

cess

es in

HAC

ApK

Assi

gned

sub

mat

rices

op

timiz

ed fo

r loa

d ba

lanc

e

25[A

. Ida

& T

. Iw

ashi

ta]

Page 26: Optimization of Communications towards Scalable Algorithms ... · Peta Turning point to Hybrid Parallel Prog. Model Hitachi SR16000/M1 based on IBM Power-7 54.9 TFLOPS, 11.2 TB Fujitsu

Com

pute

r: Fu

jitsu

FX1

0 at

the

univ

ersi

ty o

f Tok

yo

Proc

esso

r : S

PAR

C64

TMIx

fx(1

6cor

es/n

ode)

Mem

ory

: 3

2GB

Net

wor

k

: 5 G

B/s,

Tof

u.

The

num

ber o

f unk

now

ns

case1:N1,0

00case2:N10,0

00case3:N100,0

00

Perf

orm

ance

test

of

AC

ApK

26

Para

llel s

cala

bilit

y is

exa

min

ed

・w

hen

cons

truct

ing

H-m

atric

es・

whe

n pe

rform

ing

HM

VM

■Te

st m

odel

[A. I

da &

T. I

was

hita

]

Page 27: Optimization of Communications towards Scalable Algorithms ... · Peta Turning point to Hybrid Parallel Prog. Model Hitachi SR16000/M1 based on IBM Power-7 54.9 TFLOPS, 11.2 TB Fujitsu

Para

llel S

cala

bilit

y of

A

CA

pK(F

lat-M

PI)

27

■Th

e la

rger

the

data

siz

e be

com

es,

the

bette

r par

alle

l sca

labi

lity

AC

ApK

atta

ins

in b

oth

case

s.■

Bette

r par

alle

l sca

labi

lity

is s

how

n w

hen

cons

truct

ing

H-m

atric

es.

■Pa

ralle

l spe

ed-u

p in

a H

MVM

stro

ngly

dep

ends

on

the

data

siz

e.

H-m

atrix

vecto

r m

ultip

licat

ion

020

4060

0204060

Numb

er of

Pro

cesso

rs

Speed-up

100

,000u

nkno

wn

10,00

0unk

nown

1

,000u

nkno

wn

020

4060

0204060

Num

ber o

f Pro

cess

ors

Speed-up

100

,000

unkn

own

1

0,00

0unk

nown

1,00

0unk

nown

Const

ructing

H-m

atrices

[A. I

da &

T. I

was

hita

]

Page 28: Optimization of Communications towards Scalable Algorithms ... · Peta Turning point to Hybrid Parallel Prog. Model Hitachi SR16000/M1 based on IBM Power-7 54.9 TFLOPS, 11.2 TB Fujitsu

Effe

cts

of u

sing

Hyb

rid M

PI+O

penM

Pin

HM

VM (F

X10)

28

We

exam

ined

spe

edup

vs.

the

time

of th

e Fl

at-M

PI v

er. o

n 1

node

.■

Para

llel s

cala

bilit

y is

impr

oved

in c

ase

of h

ybrid

MPI

+Ope

nMP

by re

duci

ng M

PI c

omm

unic

atio

n co

st.

■Sp

eed-

up re

ache

s a

limit

arou

nd 9

6-co

res

in c

ase

of F

lat-M

PI.

050

100

150

200

250

0246810

Flat

-MPI

MPI

+OM

P2th

read

s M

PI+O

MP4

thre

ads

MPI

+OM

P8th

read

s M

PI+O

MP1

6thr

eads

Num

ber o

f cor

es

Speed-up vs. 16 core flat-MPI Par

alle

l sc

alab

ility

when p

erf

orm

ing

an H

-m

atrix

vecto

r m

ultip

licat

ion

・1,

000,

000

unkn

owns

・FX

10

[A. I

da &

T. I

was

hita

]

Page 29: Optimization of Communications towards Scalable Algorithms ... · Peta Turning point to Hybrid Parallel Prog. Model Hitachi SR16000/M1 based on IBM Power-7 54.9 TFLOPS, 11.2 TB Fujitsu

29

Col

labo

ratio

ns, O

utre

achi

ng•

Col

labo

ratio

ns–

Inte

rnat

iona

l Col

labo

ratio

ns•

Law

renc

e Be

rkel

ey N

atio

nal L

ab.

•N

atio

nal T

aiw

an U

nive

rsity

•ES

SEX/

SPPE

XA/D

FG, G

erm

any

•IP

CC

(In

tel P

aral

lel C

ompu

ting

Cen

ter)

•O

utre

achi

ng, A

pplic

atio

ns–

Larg

e-Sc

ale

Sim

ulat

ions

•G

eolo

gic

CO

2St

orag

e•

Astro

phys

ics

•Ea

rthqu

ake

Sim

ulat

ions

etc

.•

ppO

pen-

AT, p

pOpe

n-M

ATH

/VIS

, pp

Ope

n-M

ATH

/MP,

Lin

ear S

olve

rs–

Intl.

Wor

ksho

ps (2

012,

13,1

5)–

Tuto

rials

, Cla

sses

Page 30: Optimization of Communications towards Scalable Algorithms ... · Peta Turning point to Hybrid Parallel Prog. Model Hitachi SR16000/M1 based on IBM Power-7 54.9 TFLOPS, 11.2 TB Fujitsu

•pp

Ope

n-H

PC•

ppO

pen-

MAT

H–

ppO

pen-

MAT

H/M

G: M

ultig

rid S

olve

r–

Targ

et P

robl

ems,

Com

pute

r Sys

tem

s–

Opt

imiz

atio

n of

Ser

ial C

omm

unic

atio

n–

Opt

imiz

atio

n of

Par

alle

l Com

m. (

I): C

GA

–O

ptim

izat

ion

of P

aral

lel C

omm

. (II)

: hC

GA

•Su

mm

ary

30

Page 31: Optimization of Communications towards Scalable Algorithms ... · Peta Turning point to Hybrid Parallel Prog. Model Hitachi SR16000/M1 based on IBM Power-7 54.9 TFLOPS, 11.2 TB Fujitsu

Spar

se L

inea

r Sol

vers

in p

pOpe

n-H

PC•

(Ope

nMP+

MPI

) Hyb

rid

•M

ultic

olor

ing/

RC

M/C

M-R

CM

for O

penM

P–

Col

orin

g pr

oced

ures

are

NO

T pa

ralle

lized

yet

•pp

Ope

n-AP

PL/F

EM, F

VM, F

DM

–IL

U/B

ILU

(p,d

,t)+C

G/G

PBiC

G/G

MR

ES, D

epth

of O

verla

ppin

g–

Hie

rarc

hica

l Int

erfa

ce D

ecom

posi

tion

(HID

) [H

enon

& Sa

ad20

07],

Exte

nded

HID

[KN

201

0]•

ppO

pen-

MA

TH/M

G–

Geo

met

ric M

ultig

rid S

olve

rs/P

reco

nditi

oner

s–

Com

m./s

ynch

. avo

idin

g/re

duci

ng b

ased

on hC

GA

•[K

N 2

014,

Bes

t Pap

er A

war

d in

IEEE

/ICPA

DS

2014

]•

ppO

pen-

APPL

/BEM

–H

-Mat

rix S

olve

r: H

ACAp

K–

Onl

y O

pen-

Sour

ce H

-Mat

rix S

olve

r Lib

rary

by

Ope

nMP/

MPI

31

Page 32: Optimization of Communications towards Scalable Algorithms ... · Peta Turning point to Hybrid Parallel Prog. Model Hitachi SR16000/M1 based on IBM Power-7 54.9 TFLOPS, 11.2 TB Fujitsu

32

Page 33: Optimization of Communications towards Scalable Algorithms ... · Peta Turning point to Hybrid Parallel Prog. Model Hitachi SR16000/M1 based on IBM Power-7 54.9 TFLOPS, 11.2 TB Fujitsu

ppO

pen-

MA

TH•

A se

t of c

omm

on n

umer

ical

libr

arie

s–

Mul

tigrid

solv

ers

(ppO

pen-

MAT

H/M

G)

–Pa

ralle

l gra

ph li

brar

ies

(ppO

pen-

MAT

H/G

RAP

H)

•M

ultit

hrea

ded

RC

M fo

r reo

rder

ing

(und

er d

evel

opm

ent)

–Pa

ralle

l vis

ualiz

atio

n (p

pOpe

n-M

ATH

/VIS

)–

Libr

ary

for c

oupl

ed m

ulti-

phys

ics

sim

ulat

ions

(loo

se-

coup

ling)

(ppO

pen-

MAT

H/M

P)•

Orig

inal

ly d

evel

oped

as

a co

uple

r for

NIC

AM (a

tmos

pher

e,

unst

ruct

ured

), an

d C

OC

O (o

cean

, stru

ctur

ed) i

n gl

obal

clim

ate

sim

ulat

ions

usi

ng K

com

pute

r–

Both

cod

es a

re m

ajor

cod

es o

n th

e K

com

pute

r. »

Prof

. Mas

aki S

atoh

(AO

RI/U

.Tok

yo):

NIC

AM»

Prof

. Hiro

yasu

Has

umi(

AOR

I/U.T

okyo

): C

OC

O

•D

evel

oped

cou

pler

is e

xten

ded

to m

ore

gene

ral u

se.

–C

oupl

ed s

eism

ic s

imul

atio

ns

33

Page 34: Optimization of Communications towards Scalable Algorithms ... · Peta Turning point to Hybrid Parallel Prog. Model Hitachi SR16000/M1 based on IBM Power-7 54.9 TFLOPS, 11.2 TB Fujitsu

•3D

Gro

undw

ater

Flo

w v

ia

Het

erog

eneo

us P

orou

s M

edia

−Po

isso

n’s

equa

tion

−R

ando

mly

dis

tribu

ted

wat

er c

ondu

ctiv

ity−

Fini

te-V

olum

e M

etho

d on

Cub

ic V

oxel

M

esh

−O=

10-5

~10+

5 , Av

erag

e: 1

.00

–M

GC

G s

olve

r with

IC(0

) sm

ooth

er•

Mul

tigrid

−Sc

alab

le, o

ne o

f the

cho

ices

for p

ost-

peta

/exa

scal

eH

PC−

HPC

G

pGW

3D-F

VM w

ith p

pOpe

n-M

ATH

/MG34

��

��q

zy

x

��

�I

O,

,

Page 35: Optimization of Communications towards Scalable Algorithms ... · Peta Turning point to Hybrid Parallel Prog. Model Hitachi SR16000/M1 based on IBM Power-7 54.9 TFLOPS, 11.2 TB Fujitsu

35

•Pr

econ

ditio

ned

CG

Met

hod

–(G

eom

etric

) Mul

tigrid

Prec

ondi

tioni

ng (M

GC

G)

–IC

(0) f

or S

moo

thin

g O

pera

tor (

Smoo

ther

): go

od fo

r ill-

cond

ition

ed p

robl

ems

Line

ar S

olve

rs

•Pa

ralle

l Geo

met

ric M

ultig

ridM

etho

d–

8 fin

e m

eshe

s (c

hild

ren)

form

1 c

oars

e m

esh

(par

ent)

in

isot

ropi

c m

anne

r (oc

tree)

–V-

cycl

e–

Dom

ain-

Dec

ompo

sitio

n-ba

sed:

Loc

aliz

ed B

lock

-Jac

obi,

Ove

rlapp

ed A

dditi

ve S

chw

artz

Dom

ain

Dec

ompo

sitio

n (A

SDD

)–

Ope

ratio

ns u

sing

a s

ingl

e co

re a

t the

coa

rses

t lev

el

(redu

ndan

t)

Page 36: Optimization of Communications towards Scalable Algorithms ... · Peta Turning point to Hybrid Parallel Prog. Model Hitachi SR16000/M1 based on IBM Power-7 54.9 TFLOPS, 11.2 TB Fujitsu

Com

puta

tions

on

Fujit

su F

X10

•Fu

jitsu

PR

IMEH

PC F

X10

at U

.Tok

yo(O

akle

af-F

X)–

Com

mer

cial

ver

sion

of K

16 c

ores

/nod

e, fl

at/u

nifo

rm a

cces

s to

mem

ory

–4,

800

node

s 1.

043

PF (7

4th ,

TOP

500,

201

5 N

ov.)

36

•U

p to

4,0

96 n

odes

(65,

536

core

s)(L

arge

-Sca

le H

PC C

halle

nge)

Max

17,

179,

869,

184

unkn

owns

–Fl

at M

PI, H

B 4x

4, H

B 8x

2, H

B 16

x1•

Wea

k Sc

alin

g•

Stro

ng S

calin

g–

1283

×8=

16,

777,

216

unkn

owns

, fro

m 8

to

4,0

96 n

odes

•N

etw

ork

Topo

logy

is n

ot s

peci

fied

–1D

L1 CL1 C

L1 CL1 C

L1 CL1 C

L1 CL1 C

L1 CL1 C

L1 CL1 C

L1 CL1 C

L1 CL1 C

L2

Mem

ory

Page 37: Optimization of Communications towards Scalable Algorithms ... · Peta Turning point to Hybrid Parallel Prog. Model Hitachi SR16000/M1 based on IBM Power-7 54.9 TFLOPS, 11.2 TB Fujitsu

37

HB

M x

NL1 C

L1 CL1 C

L1 CL1 C

L1 CL1 C

L1 CL1 C

L1 CL1 C

L1 CL1 C

L1 CL1 C

L1 C

L2

Mem

ory

Num

ber o

f Ope

nMP

thre

ads

per a

sin

gle

MPI

pro

cess

Num

ber o

f MPI

pro

cess

per a

sin

gle

node

Page 38: Optimization of Communications towards Scalable Algorithms ... · Peta Turning point to Hybrid Parallel Prog. Model Hitachi SR16000/M1 based on IBM Power-7 54.9 TFLOPS, 11.2 TB Fujitsu

Reo

rder

ing

Met

hods

for I

C/IL

U F

act.

& F

/B S

ubst

. on

Each

MPI

Pro

c.El

emen

ts in

“sam

e co

lor”

are

inde

pend

ent:

to b

e pa

ralle

lized

by

Ope

nMP

on e

ach

MPI

pro

cess

.

6463

6158

5449

4336

6260

5753

4842

3528

5956

5247

4134

2721

5551

4640

3326

2015

5045

3932

2519

1410

4438

3124

1813

96

3730

2317

128

53

2922

1611

74

21

4832

3115

1462

6144

4326

258

754

5336

1664

6346

4528

2710

956

5538

3720

192

4730

2912

1158

5740

3922

214

350

4933

1360

5942

4124

236

552

5135

3418

171

6463

6158

5449

4336

6260

5753

4842

3528

5956

5247

4134

2721

5551

4640

3326

2015

5045

3932

2519

1410

4438

3124

1813

96

3730

2317

128

53

2922

1611

74

21

117

318

519

720

3349

3450

3551

3652

1721

1922

2123

2324

3753

3854

3955

4056

3325

3526

3727

3928

4157

4258

4359

4460

4929

5130

5331

5532

4561

4662

4763

4864

12

34

56

78

910

1112

1314

1516

RC

MR

ever

se C

uthi

ll-M

ckee

MC

(Col

or#=

4)M

ultic

olor

ing

CM

-RC

M (C

olor

#=4)

Cyc

lic M

C +

RC

M

38

Page 39: Optimization of Communications towards Scalable Algorithms ... · Peta Turning point to Hybrid Parallel Prog. Model Hitachi SR16000/M1 based on IBM Power-7 54.9 TFLOPS, 11.2 TB Fujitsu

39

•MC

: Goo

d pa

ralle

l effi

cien

cy w

ith s

mal

ler #

of c

olor

s, b

ad

conv

erge

nce.

Bet

ter c

onve

rgen

ce w

ith m

any

colo

rs, s

ynch

. ov

erhe

ad•R

CM

: Goo

d co

nver

genc

e, p

oor p

aral

lel e

ffici

ency

, syn

ch.

over

head

•CM

-RC

M: R

easo

nabl

e co

nver

genc

e &

effi

cien

cy

6463

6158

5449

4336

6260

5753

4842

3528

5956

5247

4134

2721

5551

4640

3326

2015

5045

3932

2519

1410

4438

3124

1813

96

3730

2317

128

53

2922

1611

74

21

4832

3115

1462

6144

4326

258

754

5336

1664

6346

4528

2710

956

5538

3720

192

4730

2912

1158

5740

3922

214

350

4933

1360

5942

4124

236

552

5135

3418

171

6463

6158

5449

4336

6260

5753

4842

3528

5956

5247

4134

2721

5551

4640

3326

2015

5045

3932

2519

1410

4438

3124

1813

96

3730

2317

128

53

2922

1611

74

21

117

318

519

720

3349

3450

3551

3652

1721

1922

2123

2324

3753

3854

3955

4056

3325

3526

3727

3928

4157

4258

4359

4460

4929

5130

5331

5532

4561

4662

4763

4864

12

34

56

78

910

1112

1314

1516

RC

MR

ever

se C

uthi

ll-M

ckee

MC

(Col

or#=

4)M

ultic

olor

ing

CM

-RC

M (C

olor

#=4)

Cyc

lic M

C +

RC

M

Page 40: Optimization of Communications towards Scalable Algorithms ... · Peta Turning point to Hybrid Parallel Prog. Model Hitachi SR16000/M1 based on IBM Power-7 54.9 TFLOPS, 11.2 TB Fujitsu

40

•Se

rial C

omm

unic

atio

ns–

Dat

a Tr

ansf

er th

roug

h M

emor

y H

iera

rchy

¾Sp

arse

Mat

rix O

pera

tions

in P

aral

lel M

G

•Pa

ralle

l Com

mun

icat

ions

–M

essa

ge P

assi

ng th

roug

h N

etw

ork

Com

mun

icat

ions

in M

GC

G a

re

expe

nsiv

e !

Page 41: Optimization of Communications towards Scalable Algorithms ... · Peta Turning point to Hybrid Parallel Prog. Model Hitachi SR16000/M1 based on IBM Power-7 54.9 TFLOPS, 11.2 TB Fujitsu

pGW

3D-F

VM w

ith p

pOpe

n-M

ATH

/MG41

•St

orag

e fo

rmat

of c

oeffi

cien

t mat

rices

(S

eria

l Com

mun

icat

ion)

–C

RS

(Com

pres

sed

Row

Sto

rage

)–

ELL

(Ellp

ack-

Itpac

k)

•C

omm

unic

atio

n/Sy

chro

niza

tion

Red

ucin

g M

G (P

aral

lel

Com

mun

icat

ion)

–C

oars

e G

rid A

ggre

gatio

n (C

GA)

–H

iera

rchi

cal C

GA:

Com

m. R

educ

ing

CG

A

Page 42: Optimization of Communications towards Scalable Algorithms ... · Peta Turning point to Hybrid Parallel Prog. Model Hitachi SR16000/M1 based on IBM Power-7 54.9 TFLOPS, 11.2 TB Fujitsu

•pp

Ope

n-H

PC•

ppO

pen-

MAT

H–

ppO

pen-

MAT

H/M

G: M

ultig

rid S

olve

r–

Targ

et P

robl

ems,

Com

pute

r Sys

tem

s–

Opt

imiz

atio

n of

Ser

ial C

omm

unic

atio

n–

Opt

imiz

atio

n of

Par

alle

l Com

m. (

I): C

GA

–O

ptim

izat

ion

of P

aral

lel C

omm

. (II)

: hC

GA

•Su

mm

ary

42

Page 43: Optimization of Communications towards Scalable Algorithms ... · Peta Turning point to Hybrid Parallel Prog. Model Hitachi SR16000/M1 based on IBM Power-7 54.9 TFLOPS, 11.2 TB Fujitsu

ELL:

Fix

ed L

oop-

leng

th, N

ice

for

Pre-

fetc

hing

(if R

OW

maj

or)

43

»»»»»» ¼º

«««««« ¬ª

50

00

10

47

30

00

31

40

05

21

00

03

11

31

25

41

33

74

15

13

12

54

13

37

41

5

0 0

(a) C

RS(b

) ELL

Addi

tiona

l Mem

ory

& Co

mpu

tatio

ns

Page 44: Optimization of Communications towards Scalable Algorithms ... · Peta Turning point to Hybrid Parallel Prog. Model Hitachi SR16000/M1 based on IBM Power-7 54.9 TFLOPS, 11.2 TB Fujitsu

ELL

with

Row

-wis

e Sw

eepi

ngC

RS

with

fixe

d le

ngth

Back

war

d Su

bstit

utio

n

44

!$ompparallel

do icol= 1, NCOLORtot

!$ompdo

do ip

= 1, PEsmpTOT

do i= Index(ip-1,icol)+1, Index(ip,icol)

do k= 1, 6

Z(i)= Z(i) -

AMU(k,i)*Z(IAMU(k,i))

enddo

Z(i)= Z(i) / DD(i)

enddo

enddo

enddo

!ompend parallel

i

k

Page 45: Optimization of Communications towards Scalable Algorithms ... · Peta Turning point to Hybrid Parallel Prog. Model Hitachi SR16000/M1 based on IBM Power-7 54.9 TFLOPS, 11.2 TB Fujitsu

Spec

ial T

reat

men

t for

“B

ound

ary”

M

eshe

sco

nnec

ted

to “

Hal

o”•

Dis

tribu

tion

of

Low

er/U

pper

Non

-Zer

o O

ff-D

iago

nal

Com

pone

nts

•If

we

adop

t RC

M (o

r C

M) r

eord

erin

g ...

•Pu

re In

tern

al M

eshe

s–

L: ~

3, U

: ~3

•Bo

unda

ry M

eshe

s–

L: ~

3, U

: ~6

45

Exte

rnal

M

eshe

sIn

tern

al M

eshe

s on

Bou

ndar

y

Pure

Inte

rnal

M

eshe

s

x

yz

Pure

Inte

rnal

M

eshe

sIn

tern

al M

eshe

s on

Bou

ndar

y

●In

tern

al

(low

er)

●In

tern

al

(upp

er)

●Ex

tern

al

(upp

er)

Page 46: Optimization of Communications towards Scalable Algorithms ... · Peta Turning point to Hybrid Parallel Prog. Model Hitachi SR16000/M1 based on IBM Power-7 54.9 TFLOPS, 11.2 TB Fujitsu

Orig

inal

ELL

: Bac

kwar

d Su

bstit

utio

nN

umbe

r of N

on-Z

ero

Off-

Dia

g.

Com

pone

nts

for U

pper

Tri.

Par

tC

ache

is n

ot w

ell-u

tiliz

ed: I

AUne

w(6

,N),

Aune

w(6

,N)

46

Pure

Inte

rnal

Cel

lsAUne

w(6,

N)

Bou

ndar

y C

ells

AUne

w(6,

N)up

to 6

up to

3

Page 47: Optimization of Communications towards Scalable Algorithms ... · Peta Turning point to Hybrid Parallel Prog. Model Hitachi SR16000/M1 based on IBM Power-7 54.9 TFLOPS, 11.2 TB Fujitsu

Orig

inal

ELL

: Bac

kwar

d Su

bst.

Cac

he is

not

wel

l-util

ized

: IAU

new

(6,N

), Au

new

(6,N

)

47

do icol= NHYP(lev), 1, -1

if (mod(icol,2).eq.1) then

!$omp

parallel do private (ip,icel,j,SW)

do ip= 1, PEsmpTOT

do icel= SMPindex(icol-1,ip,lev)+1, SMPindex(icol,ip,lev)

SW= 0.0d0

do j

= 1, 6

SW=

SW +

AUn

ew(j

,ice

l)*R

mg(I

AUne

w(j,

icel

))enddo

Rmg(icel)= Rmg(icel) -

SW*DDmg(icel)

enddo

enddo

else

!$omp

parallel do private (ip,icel,j,SW)

do ip= 1, PEsmpTOT

do icel= SMPindex(icol-1,ip,lev)+1, SMPindex(icol,ip,lev)

SW= 0.0d0

do j

= 1, 3

SW=

SW +

AUn

ew(j

,ice

l)*R

mg(I

AUne

w(j,

icel

))enddo

Rmg(icel)= Rmg(icel) -

SW*DDmg(icel)

enddo

enddo

endif

enddo

IAUnew(6,N), AUnew(6,N)

for P

ure

Inte

rnal

Cel

ls

for B

ound

ary

Cel

ls

Page 48: Optimization of Communications towards Scalable Algorithms ... · Peta Turning point to Hybrid Parallel Prog. Model Hitachi SR16000/M1 based on IBM Power-7 54.9 TFLOPS, 11.2 TB Fujitsu

Impr

oved

ELL

: Bac

kwar

d Su

bstit

utio

nSe

para

te A

rray

s In

trod

uced

Cac

he is

wel

l-util

ized

: AU

new

3/AU

new

6Sl

iced

ELL

[Mon

akov

et a

l. 20

10] (

for S

pMV/

GPU

)

48

Pure

Inte

rnal

Cel

lsAUne

w3(3

,N)

Bou

ndar

y C

ells

AUne

w6(6

,N)

sepa

rate

arra

ysar

e in

trodu

ced

Page 49: Optimization of Communications towards Scalable Algorithms ... · Peta Turning point to Hybrid Parallel Prog. Model Hitachi SR16000/M1 based on IBM Power-7 54.9 TFLOPS, 11.2 TB Fujitsu

Impr

oved

ELL

: Bac

kwar

d Su

bst.

Cac

he is

wel

l-util

ized

, sep

arat

ed: A

Une

w3/

AUne

w6

49

do icol= NHYP(lev), 1, -1

if (mod(icol,2).eq.1) then

!$omp

parallel do private (ip,icel,j,SW)

do ip= 1, PEsmpTOT

do icel= SMPindex(icol-1,ip,lev)+1, SMPindex(icol,ip,lev)

SW= 0.0d0

do j= 1, 6

SW= SW + AUnew6(j,icel)*Rmg(IAUnew6(j,icel))

enddo

Rmg(icel)= Rmg(icel) -

SW*DDmg(icel)

enddo

enddo

else

!$omp

parallel do private (ip,icel,j,SW)

do ip= 1, PEsmpTOT

do icel= SMPindex(icol-1,ip,lev)+1, SMPindex(icol,ip,lev)

SW= 0.0d0

do j= 1, 3

SW= SW + AUnew3(j,icel)*Rmg(IAUnew3(j,icel))

enddo

Rmg(icel)= Rmg(icel) -

SW*DDmg(icel)

enddo

enddo

endif

enddo

IAUnew3(3,N), AUnew3(3,N)

IAUnew6(6,N), AUnew6(6,N)

for P

ure

Inte

rnal

Cel

ls

for B

ound

ary

Cel

ls

Page 50: Optimization of Communications towards Scalable Algorithms ... · Peta Turning point to Hybrid Parallel Prog. Model Hitachi SR16000/M1 based on IBM Power-7 54.9 TFLOPS, 11.2 TB Fujitsu

Ther

e ar

e a

lot o

f “X”

-ELL

’s•

Mai

nly

focu

sing

on

SpM

Vco

mpu

tatio

ns•

SELL

-C-V

–M

. Kre

utze

r et a

l.: A

uni

fied

spar

se m

atrix

dat

a fo

rmat

for

effic

ient

gen

eral

spa

rse

mat

rix-v

ecto

r mul

tiplic

atio

n on

m

oder

n pr

oces

sors

with

wid

e S

IMD

uni

ts.S

IAM

SIS

C 3

6(5)

, pp

.401

–423

(201

4)•

Rec

ently

, “X”

-ELL

’s a

re a

pplie

d to

forw

ard/

back

war

d su

bstit

utio

ns w

ith d

ata

depe

nden

cy

–M

ost o

f HPC

G im

plem

enta

tions

: SC

14 B

oF–

They

are

focu

sing

on

Gau

ss-S

eide

l: m

uch

easi

er

•IL

U –U

pper

/low

er c

ompo

nent

s m

ust b

e tre

ated

sep

arat

ely

–M

ore

diffi

cult,

com

plic

ated

–(In

this

cas

e L/

U c

ompo

nent

s ar

e se

para

tely

sto

red)

50

Page 51: Optimization of Communications towards Scalable Algorithms ... · Peta Turning point to Hybrid Parallel Prog. Model Hitachi SR16000/M1 based on IBM Power-7 54.9 TFLOPS, 11.2 TB Fujitsu

Ana

lyse

s by

Det

aile

d Pr

ofile

r of

Fujit

su F

X10,

sin

gle

node

, Fla

t M

PI, R

CM

(Mul

tigrid

Part

), 64

3 cel

ls/c

ore,

1-n

ode

51

Inst

ruct

ion

L1D

mis

sL2

mis

sSI

MD

Op.

Rat

ioG

FLO

PS

CR

S1.

53u1

092.

32u1

071.

67u1

0730

.14%

6.05

Orig

inal

ELL

4.91

u108

1.67

u107

1.27

u107

93.8

8%6.

99

Impr

oved

ELL

4.91

u108

1.67

u107

9.14

u106

93.8

8%8.

56

Page 52: Optimization of Communications towards Scalable Algorithms ... · Peta Turning point to Hybrid Parallel Prog. Model Hitachi SR16000/M1 based on IBM Power-7 54.9 TFLOPS, 11.2 TB Fujitsu

•pp

Ope

n-H

PC•

ppO

pen-

MAT

H–

ppO

pen-

MAT

H/M

G: M

ultig

rid S

olve

r–

Targ

et P

robl

ems,

Com

pute

r Sys

tem

s–

Opt

imiz

atio

n of

Ser

ial C

omm

unic

atio

n–

Opt

imiz

atio

n of

Par

alle

l Com

m. (

I): C

GA

–O

ptim

izat

ion

of P

aral

lel C

omm

. (II)

: hC

GA

•Su

mm

ary

52

Page 53: Optimization of Communications towards Scalable Algorithms ... · Peta Turning point to Hybrid Parallel Prog. Model Hitachi SR16000/M1 based on IBM Power-7 54.9 TFLOPS, 11.2 TB Fujitsu

Para

llel M

ultig

rid: O

rigin

al A

ppro

ach

Coa

rse

grid

sol

ver a

t a s

ingl

e co

re [K

N 2

010]

53

Leve

l=1

Leve

l=2

Leve

l=m

-3

Leve

l=m

-2

Leve

l=m

-1

Leve

l=m

Mes

h #

for

each

MPI

= 1

Fine

Coa

rse

Com

mun

icat

ion

Ove

rhea

dat

Coa

rser

Lev

els C

oars

e gr

id s

olve

r on

a si

ngle

co

re (f

urth

er m

ultig

rid)

Page 54: Optimization of Communications towards Scalable Algorithms ... · Peta Turning point to Hybrid Parallel Prog. Model Hitachi SR16000/M1 based on IBM Power-7 54.9 TFLOPS, 11.2 TB Fujitsu

Coa

rse

Grid

Agg

rega

tion

(CG

A)

Coa

rse

Grid

Sol

ver i

s m

ultit

hrea

ded

[KN

201

2]

54

Leve

l=1

Leve

l=2

Leve

l=m

-3

Fine

Coa

rse

Coa

rse

grid

sol

ver o

n a

sing

le M

PI p

roce

ss (m

ulti-

thre

aded

, fur

ther

m

ultig

rid)

•C

omm

unic

atio

n ov

erhe

ad

coul

d be

redu

ced

•C

oars

e gr

id s

olve

r is

mor

e ex

pens

ive

than

orig

inal

ap

proa

ch.

•If

proc

ess

num

ber i

s la

rger

, th

is e

ffect

mig

ht b

e si

gnifi

cant

Leve

l=m

-2

Page 55: Optimization of Communications towards Scalable Algorithms ... · Peta Turning point to Hybrid Parallel Prog. Model Hitachi SR16000/M1 based on IBM Power-7 54.9 TFLOPS, 11.2 TB Fujitsu

55

Wea

k Sc

alin

g: ~

4,09

6 no

des

up to

17,

179,

869,

184

mes

hes

(643

mes

hes/

core

)D

OW

N is

GO

OD

0.00

5.00

10.0

0

15.0

0

20.0

0

100

1000

1000

010

0000

sec.

CORE

#

HB

8x2:

C0

HB

8x2:

C1

HB

8x2:

C2

HB

8x2:

C3

5.0

7.5

10.0

12.5

15.0

100

1000

1000

010

0000

sec.

CO

RE#

Flat

MPI

:C3

HB

4x4

:C3

HB

8x2

:C3

HB

16x1

:C3M

atrix

Coa

rse

Grid

C0

CR

SSi

ngle

Cor

e

C1

ELL

(org

)Si

ngle

Cor

e

C2

ELL

(org

)C

GA

C3

ELL

(slic

ed)

CG

A

Page 56: Optimization of Communications towards Scalable Algorithms ... · Peta Turning point to Hybrid Parallel Prog. Model Hitachi SR16000/M1 based on IBM Power-7 54.9 TFLOPS, 11.2 TB Fujitsu

56

Wea

k Sc

alin

g: ~

4,09

6 no

des

up to

17,

179,

869,

184

mes

hes

(643

mes

hes/

core

)D

OW

N is

GO

OD

0.00

5.00

10.0

0

15.0

0

20.0

0

100

1000

1000

010

0000

sec.

CORE

#

HB

8x2:

C0

HB

8x2:

C1

HB

8x2:

C2

HB

8x2:

C3

5.0

7.5

10.0

12.5

15.0

100

1000

1000

010

0000

sec.

CO

RE#

Flat

MPI

:C3

HB

4x4

:C3

HB

8x2

:C3

HB

16x1

:C3M

atrix

Coa

rse

Grid

C0

CR

SSi

ngle

Cor

e

C1

ELL

(org

)Si

ngle

Cor

e

C2

ELL

(org

)C

GA

C3

ELL

(slic

ed)

CG

A

CR

S

SELL +

CG

A

x1.9

0

Page 57: Optimization of Communications towards Scalable Algorithms ... · Peta Turning point to Hybrid Parallel Prog. Model Hitachi SR16000/M1 based on IBM Power-7 54.9 TFLOPS, 11.2 TB Fujitsu

57

Wea

k Sc

alin

g: C

3R

esul

ts a

t 4,0

96 n

odes

(1.7

2x10

10D

OF)

0.0

5.0

10.0

15.0

Flat

MPI

:C

3:64

HB

4x4:

C3:

59H

B 8x

2:C

3:55

HB

16x1

:C

3:55

sec.R

est

Coa

rse

Grid

Sol

ver

MPI

_Allg

athe

rM

PI_I

send

/Irec

v/Al

lredu

ce

Page 58: Optimization of Communications towards Scalable Algorithms ... · Peta Turning point to Hybrid Parallel Prog. Model Hitachi SR16000/M1 based on IBM Power-7 54.9 TFLOPS, 11.2 TB Fujitsu

58

Wea

k Sc

alin

g: C

2(w

ith C

GA

)Ti

me

for C

oars

e G

rid S

olve

rEf

ficie

ncy

of c

oars

e gr

id s

olve

r for

HB

16x1

is x

256

of th

at o

f fla

t M

PI (1

/16

prob

lem

siz

e, x

16 re

sour

ce fo

r coa

rse

grid

sol

ver)

0.00

1.00

2.00

3.00

4.00

1024

2048

4096

8192

1638

432

768

4915

265

536

sec.

CORE

#

Flat

MPI

HB

4x4

HB

8x2

HB

16x1

Page 59: Optimization of Communications towards Scalable Algorithms ... · Peta Turning point to Hybrid Parallel Prog. Model Hitachi SR16000/M1 based on IBM Power-7 54.9 TFLOPS, 11.2 TB Fujitsu

Sum

mar

y so

far .

..•

“Coa

rse

Grid

Agg

rega

tion

(CG

A)” i

s ef

fect

ive

for

stab

ilizat

ion

of c

onve

rgen

ce a

t O(1

04) c

ores

for M

GC

G–

Smal

ler n

umbe

r of p

aral

lel d

omai

ns–

HB

8x2

is th

e be

st a

t 4,0

96 n

odes

–Fl

at M

PI, H

B 4x

4•

Coa

rse

grid

sol

vers

are

mor

e ex

pens

ive,

bec

ause

thei

r num

ber o

f M

PI p

roce

sses

are

mor

e th

an th

ose

of H

B 8x

2 an

d H

B 16

x1.

•EL

L fo

rmat

is e

ffect

ive

!–

C0

(CR

S)

->

C1

(ELL

-org

.): +

20-3

0%–

C2

(ELL

-org

)-> C

3(EL

L-ne

w):

+20-

30%

–C

0 ->

C3:

+80

-90%

•C

oars

e G

rid S

olve

r –

Very

exp

ensi

ve fo

r cas

es w

ith m

ore

than

O(1

05) c

ores

Mem

ory

of a

sin

gle

node

is n

ot e

noug

h–

Mul

tiple

nod

es s

houl

d be

util

ized

for c

oars

e gr

id s

olve

r59

Mat

rixC

oars

e G

rid

C0

CR

SSi

ngle

Cor

e

C1

ELL

(org

)Si

ngle

Cor

e

C2

ELL

(org

)C

GA

C3

ELL

(slic

ed)

CG

A

Page 60: Optimization of Communications towards Scalable Algorithms ... · Peta Turning point to Hybrid Parallel Prog. Model Hitachi SR16000/M1 based on IBM Power-7 54.9 TFLOPS, 11.2 TB Fujitsu

•pp

Ope

n-H

PC•

ppO

pen-

MAT

H–

ppO

pen-

MAT

H/M

G: M

ultig

rid S

olve

r–

Targ

et P

robl

ems,

Com

pute

r Sys

tem

s–

Opt

imiz

atio

n of

Ser

ial C

omm

unic

atio

n–

Opt

imiz

atio

n of

Par

alle

l Com

m. (

I): C

GA

–O

ptim

izat

ion

of P

aral

lel C

omm

. (II)

: hC

GA

•Su

mm

ary

60

Page 61: Optimization of Communications towards Scalable Algorithms ... · Peta Turning point to Hybrid Parallel Prog. Model Hitachi SR16000/M1 based on IBM Power-7 54.9 TFLOPS, 11.2 TB Fujitsu

Hie

rarc

hica

l CG

A: C

omm

. Red

ucin

g M

GR

educ

ed n

umbe

r of M

PI p

roce

sses

[KN

201

3]

61

Leve

l=1

Leve

l=2

Leve

l=m

-3

Leve

l=m

-3

Fine

Coa

rse

Leve

l=m

-2

Coa

rse

grid

sol

ver o

n a

sing

le M

PI p

roce

ss (m

ulti-

thre

aded

, fur

ther

m

ultig

rid)

Page 62: Optimization of Communications towards Scalable Algorithms ... · Peta Turning point to Hybrid Parallel Prog. Model Hitachi SR16000/M1 based on IBM Power-7 54.9 TFLOPS, 11.2 TB Fujitsu

hCG

A: R

elat

ed W

ork

•N

ot a

new

idea

, but

ver

y fe

w im

plem

enta

tions

.–

Not

effe

ctiv

e fo

r pet

a-sc

ale

syst

ems

(Dr.

U.M

.Yan

g(L

LNL)

, dev

elop

er o

f H

ypre

)

•Ex

istin

g W

orks

: Rep

artit

ioni

ng a

t Coa

rse

Leve

ls–

Lin,

P.T

., Im

prov

ing

mul

tigrid

perfo

rman

ce fo

r uns

truct

ured

m

esh

drift

-diff

usio

n si

mul

atio

ns o

n 14

7,00

0 co

res,

In

tern

atio

nal J

ourn

al fo

r Num

eric

al M

etho

ds in

Eng

inee

ring

91 (2

012)

971

-989

(San

dia)

–Su

ndar

, H. e

t al,

Para

llel G

eom

etric

-Alg

ebra

ic M

ultig

ridon

U

nstru

ctur

ed F

ores

ts o

f Oct

rees

, AC

M/IE

EE P

roce

edin

gs o

f th

e 20

12 In

tern

atio

nal C

onfe

renc

e fo

r Hig

h Pe

rform

ance

C

ompu

ting,

Net

wor

king

, Sto

rage

and

Ana

lysi

s (S

C12

) (2

012)

(UT

Aust

in)

–Fl

at M

PI,

Rep

artit

ioni

ng if

DO

F <

O(1

03) o

n ea

ch p

roce

ss62

Page 63: Optimization of Communications towards Scalable Algorithms ... · Peta Turning point to Hybrid Parallel Prog. Model Hitachi SR16000/M1 based on IBM Power-7 54.9 TFLOPS, 11.2 TB Fujitsu

hCG

Ain

the

pres

ent w

ork

•Ac

cele

rate

the

coar

ser g

rid s

olve

r–

usin

g m

ultip

le p

roce

sses

inst

ead

of a

sin

gle

proc

ess

in C

GA

–O

nly

64 c

ells

on

each

pro

cess

of l

ev=6

in th

e fig

ure

Stra

ight

forw

ard

Appr

oach

–M

PI_C

omm

_spl

it, M

PI_G

athe

r, M

PI_B

cast

etc.

63

0.0

5.0

10.0

15.0

20.0

ELL-

CG

A,le

v=6:

51

ELL-

CG

A,le

v=7:

55

ELL-

CG

A,le

v=8:

60

ELL:

65,

(NO

CG

A)C

RS:

66,

(NO

CG

A)

sec.

Res

tC

oars

e G

rid S

olve

rM

PI_A

llgat

her

MPI

_Ise

nd/Ir

ecv/

Allre

duce

Page 64: Optimization of Communications towards Scalable Algorithms ... · Peta Turning point to Hybrid Parallel Prog. Model Hitachi SR16000/M1 based on IBM Power-7 54.9 TFLOPS, 11.2 TB Fujitsu

0.0

5.0

10.0

15.0

Flat

MPI

HB

4x4

HB

8x2

HB

16x

1

sec.

C3,

4,0

96 n

odes

C4,

4,0

96 n

odes

64

Wea

k Sc

alin

g:

~4,0

96 n

odes

up to

17,

179,

869,

184

mes

hes

(643

mes

hes/

core

)D

OW

N is

GO

OD

Mat

rixC

oars

e G

ridC

0C

RS

Sing

le C

ore

C1

ELL

(org

)Si

ngle

Cor

e

C2

ELL

(org

)C

GA

C3

ELL

(slic

ed)

CG

A

C4

ELL

(slic

ed)

hCG

A

5.0

7.5

10.0

12.5

15.0

100

1000

1000

010

0000

sec.

CO

RE#

Flat

MPI

:C3

Flat

MPI

:C4

HB

4x4

:C4

HB

8x2

:C3

HB

16x1

:C3

x1.6

1

Page 65: Optimization of Communications towards Scalable Algorithms ... · Peta Turning point to Hybrid Parallel Prog. Model Hitachi SR16000/M1 based on IBM Power-7 54.9 TFLOPS, 11.2 TB Fujitsu

Opt

imum

Par

amet

ers

at 4

,096

nod

esW

eak

Scal

ing

•O

ptim

um le

vel f

or s

witc

hing

to re

duce

d nu

mbe

r of M

PI

proc

esse

s fo

r CG

A (le

v CG

Aop

t) an

d h

CG

A(le

v hC

GA

opt)

•N

@le

v CG

Aop

t, N

@le

v hC

GA

opt

–N

umbe

r of u

nkno

wns

per

eac

h M

PI p

roce

ss a

t the

sw

itchi

ng le

vel (

muc

h sm

alle

r tha

n O

(103

) use

d in

rela

ted

wor

ks)

•O

ptim

um #

of M

PI p

roce

sses

afte

r rep

artit

ioni

ng (P

Ere

p)

65

lev C

GA

opt

lev h

CG

Aop

t

N@

lev C

GA

opt

N@

lev h

CG

Aop

tPE

rep

Iter’s

sec.

Flat

MPI

C3

71

-64

13.2

C4

68

128

proc

’s8

node

s61

8.22

HB

4u4

C3

81

-59

8.08

C4

632

256

proc

’s,

64 n

odes

567.

97

Page 66: Optimization of Communications towards Scalable Algorithms ... · Peta Turning point to Hybrid Parallel Prog. Model Hitachi SR16000/M1 based on IBM Power-7 54.9 TFLOPS, 11.2 TB Fujitsu

66

Level=1

Level=2

Level=m‐3

Level=m‐3

Fine

CoarseLe

vel=m‐2

hCGA

lev hCGAopt,N@lev hCGAopt

Leve

l=1

Leve

l=2

Leve

l=m

-3

Fine

Coa

rse

Leve

l=m

-2CGA

lev C

GAopt,N@lev C

GAopt

Page 67: Optimization of Communications towards Scalable Algorithms ... · Peta Turning point to Hybrid Parallel Prog. Model Hitachi SR16000/M1 based on IBM Power-7 54.9 TFLOPS, 11.2 TB Fujitsu

020406080100

120

1024

8192

6553

6

Parallel Performance (%)

CO

RE#

Flat

MPI

:C3

Flat

MPI

:C4

Stro

ng S

calin

g at

4,0

96 n

odes

268,

435,

456

mes

hes,

163

mes

hes/

core

at 4

,096

nod

esU

P is

GO

OD

Flat

MPI

/ELL

(C3)

, 8

node

s (1

28 c

ores

) :

100%

67

x6.2

7

Mat

rixC

oars

e G

ridC

0C

RS

Sing

le C

ore

C1

ELL

(org

)Si

ngle

Cor

e

C2

ELL

(org

)C

GA

C3

ELL

(slic

ed)

CG

A

C4

ELL

(slic

ed)

hCG

A

Page 68: Optimization of Communications towards Scalable Algorithms ... · Peta Turning point to Hybrid Parallel Prog. Model Hitachi SR16000/M1 based on IBM Power-7 54.9 TFLOPS, 11.2 TB Fujitsu

•pp

Ope

n-H

PC•

ppO

pen-

MAT

H–

ppO

pen-

MAT

H/M

G: M

ultig

rid S

olve

r–

Targ

et P

robl

ems,

Com

pute

r Sys

tem

s–

Opt

imiz

atio

n of

Ser

ial C

omm

unic

atio

n–

Opt

imiz

atio

n of

Par

alle

l Com

m. (

I): C

GA

–O

ptim

izat

ion

of P

aral

lel C

omm

. (II)

: hC

GA

•Su

mm

ary

68

Page 69: Optimization of Communications towards Scalable Algorithms ... · Peta Turning point to Hybrid Parallel Prog. Model Hitachi SR16000/M1 based on IBM Power-7 54.9 TFLOPS, 11.2 TB Fujitsu

Sum

mar

y

•hC

GA

is e

ffect

ive,

but

not

so

sign

ifica

nt(e

xcep

t fla

t M

PI)

–fla

t MPI

: x1.

61 fo

r wea

k sc

alin

g, x

6.27

for s

trong

sca

ling

at

4,09

6 no

des

of F

ujits

u FX

10

–hC

GA

will

be e

ffect

ive

for H

B 16

x1 w

ith m

ore

than

2.5

0x10

5

node

s (=

4.0

0x10

6co

res)

of F

X10

(=60

PFL

OPS

)•

Com

p. ti

me

of c

oars

e gr

id s

olve

r is

sign

ifica

nt fo

r Fla

t MPI

with

>10

3

node

s–

Com

mun

icat

ion

over

head

has

bee

n (s

light

ly) r

educ

ed b

y hC

GA

69

Page 70: Optimization of Communications towards Scalable Algorithms ... · Peta Turning point to Hybrid Parallel Prog. Model Hitachi SR16000/M1 based on IBM Power-7 54.9 TFLOPS, 11.2 TB Fujitsu

Futu

re W

orks

, Ope

n Pr

oble

ms

•Im

prov

emen

t of h

CG

A–

Ove

rhea

d by

MPI

_Allr

educ

eet

c. ->

P2P

com

m.:

Put-G

et•

Algo

rithm

s–

CA-

Mul

tigrid

(for

coa

rser

leve

ls),

CA-

SPAI

, Pip

elin

ed M

etho

d (T

ianh

e-2)

•St

rate

gy fo

r Aut

omat

ic S

elec

tion

–sw

itchi

ng le

vel,

num

ber o

f pro

cess

es fo

r hC

GA,

opt

imum

co

lor #

–ef

fect

s on

con

verg

ence

•M

ore

Flex

ible

ELL

for U

nstru

ctur

ed G

rids

–SE

LL-C

-V•

Xeon

Phi

Clu

ster

s70

Page 71: Optimization of Communications towards Scalable Algorithms ... · Peta Turning point to Hybrid Parallel Prog. Model Hitachi SR16000/M1 based on IBM Power-7 54.9 TFLOPS, 11.2 TB Fujitsu

Num

ber o

f Col

ors

and

Com

p. T

ime

•IC

CG

Sol

vers

•FX

10•

Ivy-

Brid

ge (I

vyB)

•KN

C (M

IC)

•“O

ptim

um” n

umbe

r for

ea

ch a

rchi

tect

ure

is

diffe

rent

71

2.00

4.00

6.00

8.00

10.0

0

12.0

0

110

100

1000

sec.

Colo

r#

FX10

: AR-

1FX

10: B

R-1

MIC

: AR-

1M

IC: B

R-1

IvyB

: AR-

1Iv

yB: B

R-1

260

280

300

320

340

360

380

400

110

100

1000

Iterations

Colo

r#

Page 72: Optimization of Communications towards Scalable Algorithms ... · Peta Turning point to Hybrid Parallel Prog. Model Hitachi SR16000/M1 based on IBM Power-7 54.9 TFLOPS, 11.2 TB Fujitsu

Ove

rhea

d by

Col

lect

ive

Com

m.

72

0.00

E+00

1.00

E-03

2.00

E-03

3.00

E-03

4.00

E-03

5.00

E-03

6.00

E-03

7.00

E-03

100

1000

1000

010

0000

sec./MPI_Allreduce

MPI

Pro

cess

#

Flat

MPI

HB

4x4

HB

8x2

HB

16x1

Ove

rhea

d by

MPI

_Allr

educ

efo

r MG

CG

cas

e

•O

verh

ead

by g

loba

l col

lect

ive

com

m. (

e.g.

MPI

_Allr

educ

e)•

Cha

nge

orig

inal

Kry

lov

solv

er s

o th

at c

omm

. ove

rhea

d by

glo

bal

coll.

com

m. a

re h

idde

n by

ove

rlapp

ing

with

oth

er c

ompu

tatio

ns

(Gro

pp’s

asyn

ch. C

G, s

-ste

p, p

ipel

ined

...)

•“M

PI_I

allre

duce

” in

MPI

-3: M

PI-3

on

FX10

, Dec

embe

r 201

5

Page 73: Optimization of Communications towards Scalable Algorithms ... · Peta Turning point to Hybrid Parallel Prog. Model Hitachi SR16000/M1 based on IBM Power-7 54.9 TFLOPS, 11.2 TB Fujitsu

SELL

-C-V

for P

CG

in F

EMIn

tel X

eon

Phi (

KN

C)

73

0.0

5.0

10.0

15.0

20.0

25.0

110

100

1000

GFLOPS

C o

f SEL

L-C

-Sig

ma

MIC

: HB

240x

1M

IC: H

B 12

0x2

MIC

: HB

60x4

0.00

0.50

1.00

1.50

2.00

110

100

1000

Ratio to CRS

C o

f SEL

L-C

-Sig

ma

MIC

: HB

240x

1M

IC: H

B 12

0x2

MIC

: HB

60x4

Page 74: Optimization of Communications towards Scalable Algorithms ... · Peta Turning point to Hybrid Parallel Prog. Model Hitachi SR16000/M1 based on IBM Power-7 54.9 TFLOPS, 11.2 TB Fujitsu

Nex

t Sta

ge o

f ppO

pen-

HPC

•FY

.201

6-FY

.201

8–

JST/

CR

EST

& D

FG/S

PPEX

A (G

erm

any)

Col

labo

ratio

n–

ESSE

X: E

quip

ping

Spa

rse

Solv

ers

for E

xasc

ale

•ht

tp://

blog

s.fa

u.de

/ess

ex/

•Le

adin

g PI

: Pro

f. G

erha

rd W

elle

in(U

. Erla

ngen

)–

ESSE

X II:

ESS

EX, S

akur

ai-T

, Nak

ajim

a-T

•Ite

rativ

e So

lver

for Q

uant

um C

hem

istry

: pK-

Ope

n-SO

L–

Mul

tgrid

/Low

-Ran

k Ap

prox

imat

ion

–D

LR (G

erm

an A

eros

pace

Res

earc

h C

ente

r)•

Perfo

rman

ce M

odel

for S

tenc

il C

ompu

tatio

n: p

K-O

pen-

AT–

U. E

rlang

en–

kern

craf

t: Lo

op K

erne

l Ana

lysi

s an

d Pe

rform

ance

Mod

elin

g To

olki

t »

http

s://g

ithub

.com

/cod

3mon

k/ke

rncr

aft

74

Page 75: Optimization of Communications towards Scalable Algorithms ... · Peta Turning point to Hybrid Parallel Prog. Model Hitachi SR16000/M1 based on IBM Power-7 54.9 TFLOPS, 11.2 TB Fujitsu

Plea

se v

isit

the

boot

h of

O

akle

af/K

ashi

wa

Alli

ance

,th

e U

nive

rsity

of T

okyo

#220

3

75