PTG: an abstraction for unhindered parallelismhpc.pnl.gov/conf/wolfhpc/2014/talks/danalis.pdfPTG: an abstraction for unhindered parallelism Anthony Danalis Innovative Computing Laboratory

PTG: an abstraction for unhindered parallelism

Anthony Danalis Innovative Computing Laboratory

University of Tennessee WOLFHPC 2014, New Orleans

George Bosilca Aurelien Bouteiller Mathieu Faverge Thomas Herault Jack Dongarra

Programming Paradigms

² PTG: Parameterized Task Graph

² Key abstraction behind the PaRSEC runtime

² PTG revives an old programming paradigm:

² Dataflow-based execution

² What is wrong with current prog. paradigms?

² What is the current programming paradigm?

Serial Code

do i = 1, len y(i) = y(i) + a*x(i)enddo

3

Serial Code


4

We tell the computer what to do and exactly in what order to do it.

Vector Architectures


5

VLIW


6

Superscalar


7

OpenMP Code

#pragma omp paralleldo i = 1, len y(i) = y(i) + a*x(i)enddo

8

OpenMP Code

#pragma omp paralleldo i = 1, len y(i) = y(i) + a*x(i)enddo

9

MPI Code

// Exchange initial datado i = my_start, my_len y(i) = y(i) + a*x(i)enddo// Exchange results

10

MPI Code


11

MPI Code


12

MPI+X Code

• Works well (maybe) only for isomorphic systems • Plagued by limitations of MPI and X

•  Dynamic Load balancing •  Handling Jitter •  Data distribution •  Difficult to develop •  Difficult to debug

• Replaces “big hammer” by two big hammers!

13

Current Programming Paradigm

Coarse Grain Parallelism with Explicit Message Passing (CGP)

14

Most parallel programs are essentially sequential, with some code to coordinate.

Panel vs Tile Algorithms

15

LAPACK / Panel PLASMA / Tile

What does the code look like?

16

for (k = 0; k < MT; k++) { Insert_Task( geqrt, A[k][k], INOUT, T[k][k], OUTPUT); for (m = k+1; m < MT; m++) { Insert_Task( tsqrt, A[k][k], INOUT | REGION_D|REGION_U, A[m][k], INOUT | LOCALITY, T[m][k], OUTPUT); } for (n = k+1; n < NT; n++) { Insert_Task( unmqr, A[k][k], INPUT | REGION_L, T[k][k], INPUT, A[k][m], INOUT); for (m = k+1; m < MT; m++) { Insert_Task( tsmqr, A[k][n], INOUT, A[m][n], INOUT | LOCALITY, A[m][k], INPUT, T[m][k], INPUT);

What’s wrong with this code

✗ It has: ✗ Control Flow ✗ Hints for runtime to infer Data Flow ✗ High memory requirements or reduced parallelism

ü It should have: ü No (or minimal, user defined) Control Flow ü Explicit Data Flow ü Unhindered parallelism

17

What captures the semantics?

{[k]->[k,k+1]:k<=mt-2}

GEQRT(k)k = 0 .. mt-1

TSMQR(k,m,n)k = 0 .. mt-1m = k+1 .. mt-1n = k+1 .. mt-1

TSQRT(k,m)k = 0 .. mt-1m = k+1 .. mt-1

UNMQR(k,n)k = 0 .. mt-1n = k+1 .. mt-1

{[k,m,n]->[n]:k+1==n && k+1==m}

{[k,m]->[k,m,n]:k<nt-1 && k<n<nt} {

[k,m,n]->[n,m]:k+1==n && m>n}

{[k,m,n]->[k+1,n]:k+1==m && n>m}

{[k]->[k,n]:k<n<nt && k<nt-1}

{[k,n]->[k,k+1,n]:k<mt-1}

{[k,m,n]->[k+1,m,n]:n>k+1 && m>k+1}

{[k,m,n]->[k,m+1,n]:m<mt-1}

{[k,m]->[k,m+1]:m<mt-1}

Parameterized Task Graph (PTG)

ü Task Classes w/ parameters ü  geqrt(k), tsqrt(k,m), unmqr(k,n), tsmqr(k,n,m)

ü Precedence constraints between Tasks

ü Compressed form of the Execution DAG

ü Fixed size (problem size independent)

19

Why not discover the whole DAG?

20

Node0

Node1

Node2

Node3

PO

GE

TR

SYTR TR

PO GE GE

TR TR

SYSY GE

PO

TR

SY

PO

SY SY


21

Node0

Node1

Node2

Node3

PO

GE

TR

SYTR TR

PO GE GE

TR TR

SYSY GE

PO

TR

SY

PO

SY SY


Node0

Node1

Node2

Node3

PO

GE

TR

SYTR TR

PO GE GE

TR TR

SYSY GE

PO

TR

SY

PO

SY SY

When building the DTG:

The runtime uses a single:

•  Insert_Task() function

• Structure for the elements of the DAG

• Function for traversing a task’s neighborhood

The Dynamic Task Graph is not program specific

23

When using a PTG:

{[k]->[k,k+1]:k<=mt-2}

TSQRT(k,m)k = 0 .. mt-1m = k+1 .. mt-1

{[k,m]->[k,m,n]:k<nt-1 && k<n<nt}

{[k,m,n]->[n,m]:k+1==n && m>n} {[

k,m]->[k,m+1]:m<mt-1}

Program specific DAG walkers

PTG: PING-PONG

25

PTG: Binary Tree Reduction

26

Conc

epts

•  Separation of roles: compiler optimizes each task, developer describes dependencies between tasks, runtime orchestrates dynamic execution

•  Separate algorithms from data distribution •  Avoid limitations of control flow execution

Runt

ime

•  Portability layer for heterogeneous architectures

•  Scheduling policies adapt execution to the hardware & ongoing system status

•  Data movements between consumers are inferred from dependencies. Communication/computation overlap

•  Coherency protocols minimize data movements

•  Memory hierarchy (including NVRAM and disk) integral part of the scheduling decisions

PaRSEC: focus on the algorithm

Developer’s view of PaRSEC

28

Load Balance, Idle time & Jitter

!me$

SPMD$/$M

PI$

PaRSEC

$

29

Performance: (H)QR

0

500

1000

1500

2000

2500

3000

0 50000 100000 150000 200000 250000 300000

PE

RF

OR

MA

NC

E (

GF

LO

P/S

)

MATRIX SIZE M(N=4,480)

Solving Linear Least Square Problem (DGEQRF)

60-node, 480-core, 2.27GHz Intel Xeon Nehalem, IB 20G System

Theoretical Peak: 4358.4 GFlop/s

LibSCI Scalapack

DPLASMA DGEQRF

Hierarchical QR

30

Performance: Systolic QR

0

5

10

15

20

25

768 2304 4032 5760 7776 10080 14784 19584 23868

PE

RF

OR

MA

NC

E (

TF

LO

P/S

)

NUMBER OF CORES

DGEQRF performance strong scaling

LibSCI Scalapack

PaRSEC DPLASMA

Cray XT5 (Kraken) - N = M = 41,472

31

Distributed CPUs + GPUs

0

10

20

30

40

50

60

70

80

16+3N=31k

144+27 400+75 576+108 784+147 1024+192N=246k

Perf

orm

an

ce (

TF

lop

/s)

Number of Cores+GPUs

Distributed Hybrid DPOTRF Weak Scaling on Keeneland1 to 64 nodes (16 cores, 3 M2090 GPUs, Infiniband 20G per node)

PaRSEC DPOTRF (3 GPU per node)Ideal Scaling

32

NWChem Coupled Cluster (CC)

• Computational Chemistry

• TCE Coupled Cluster

• Machine generated Fortran 77 code

• Behavior depends on dynamic, immutable data

• Long sequential chains of DGEMMs

• Work (chain) stealing through GA atomics

33

CC t1_2_2_2

��

��

�

��

��

��

��

� ��

��

��

��

��

��

��

��

��

�� !�"��

#��$�� %� ��

34

CC t2_8

0

1

2

3

4

5

6

7

16x1 16x2 16x4 16x8

Ex

ec

uti

on

Tim

e (

se

c)

Nodes x Cores/Node

Execution Time of icsd_t2_8() subroutine in CCSD of NWChem

Original (isolated) icsd_t2_8()PaRSEC (isolated) icsd_t2_8()

35

Summary

² PTG offers a Data-Flow based Prog. Paradigm

² PaRSEC offers state-of-the-art performance

² Data distribution is decoupled from Algorithm

² Control Flow limitations are avoided

² CGP != highest performance

² CGP != easiest to program (?)

36

Backup slides

37


Quantify Reduced Parallelism (if using window)

38

Question: Given window size W and P processors, what is the highest level of parallelism that can be missed because of control flow limitations? Assuming P<W

PTG vs DTG

39

for(i=0;i<W;i++){ Task(RW:A[i][0]); for(j=1; j<a*W; j++){ Task(R:A[i][j-1], W:A[i][j]); } }

W

...

... a*W

PTG vs DTG

40

for(i=0;i<W;i++){ Task(RW:A[i][0]); for(j=1; j<a*W; j++){ Task(R:A[i][j-1], W:A[i][j]); } }

W

...

... a*W

PTG vs DTG

41

W

...

... a*W

Dynamic Task Graph (DTG): a*W+(W-1)*(a-1)*W = O(a*W^2)

PTG vs DTG

42

W

...

... a*W


Parameterized Task Graph (PTG):

O(a*W^2/P)

PTG vs DTG

43

W

...

... a*W


Parameterized Task Graph (PTG):

O(a*W^2/P)

O(DTG)/O(PTG) = P

Documents

PTG: an abstraction for unhindered parallelismhpc.pnl.gov/conf/wolfhpc/2014/talks/danalis.pdfPTG: an abstraction for unhindered parallelism Anthony Danalis Innovative Computing Laboratory