43
PTG: an abstraction for unhindered parallelism Anthony Danalis Innovative Computing Laboratory University of Tennessee WOLFHPC 2014, New Orleans George Bosilca Aurelien Bouteiller Mathieu Faverge Thomas Herault Jack Dongarra

PTG: an abstraction for unhindered parallelismhpc.pnl.gov/conf/wolfhpc/2014/talks/danalis.pdfPTG: an abstraction for unhindered parallelism Anthony Danalis Innovative Computing Laboratory

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: PTG: an abstraction for unhindered parallelismhpc.pnl.gov/conf/wolfhpc/2014/talks/danalis.pdfPTG: an abstraction for unhindered parallelism Anthony Danalis Innovative Computing Laboratory

PTG: an abstraction for unhindered parallelism

Anthony Danalis Innovative Computing Laboratory

University of Tennessee WOLFHPC 2014, New Orleans

George Bosilca Aurelien Bouteiller Mathieu Faverge Thomas Herault Jack Dongarra

Page 2: PTG: an abstraction for unhindered parallelismhpc.pnl.gov/conf/wolfhpc/2014/talks/danalis.pdfPTG: an abstraction for unhindered parallelism Anthony Danalis Innovative Computing Laboratory

Programming Paradigms

² PTG: Parameterized Task Graph

² Key abstraction behind the PaRSEC runtime

² PTG revives an old programming paradigm:

² Dataflow-based execution

² What is wrong with current prog. paradigms?

² What is the current programming paradigm?

Page 3: PTG: an abstraction for unhindered parallelismhpc.pnl.gov/conf/wolfhpc/2014/talks/danalis.pdfPTG: an abstraction for unhindered parallelism Anthony Danalis Innovative Computing Laboratory

Serial Code

do i = 1, len y(i) = y(i) + a*x(i)enddo

3

Page 4: PTG: an abstraction for unhindered parallelismhpc.pnl.gov/conf/wolfhpc/2014/talks/danalis.pdfPTG: an abstraction for unhindered parallelism Anthony Danalis Innovative Computing Laboratory

Serial Code

do i = 1, len y(i) = y(i) + a*x(i)enddo

4

We tell the computer what to do and exactly in what order to do it.

Page 5: PTG: an abstraction for unhindered parallelismhpc.pnl.gov/conf/wolfhpc/2014/talks/danalis.pdfPTG: an abstraction for unhindered parallelism Anthony Danalis Innovative Computing Laboratory

Vector Architectures

do i = 1, len y(i) = y(i) + a*x(i)enddo

5

Page 6: PTG: an abstraction for unhindered parallelismhpc.pnl.gov/conf/wolfhpc/2014/talks/danalis.pdfPTG: an abstraction for unhindered parallelism Anthony Danalis Innovative Computing Laboratory

VLIW

do i = 1, len y(i) = y(i) + a*x(i)enddo

6

Page 7: PTG: an abstraction for unhindered parallelismhpc.pnl.gov/conf/wolfhpc/2014/talks/danalis.pdfPTG: an abstraction for unhindered parallelism Anthony Danalis Innovative Computing Laboratory

Superscalar

do i = 1, len y(i) = y(i) + a*x(i)enddo

7

Page 8: PTG: an abstraction for unhindered parallelismhpc.pnl.gov/conf/wolfhpc/2014/talks/danalis.pdfPTG: an abstraction for unhindered parallelism Anthony Danalis Innovative Computing Laboratory

OpenMP Code

#pragma omp paralleldo i = 1, len y(i) = y(i) + a*x(i)enddo

8

Page 9: PTG: an abstraction for unhindered parallelismhpc.pnl.gov/conf/wolfhpc/2014/talks/danalis.pdfPTG: an abstraction for unhindered parallelism Anthony Danalis Innovative Computing Laboratory

OpenMP Code

#pragma omp paralleldo i = 1, len y(i) = y(i) + a*x(i)enddo

9

Page 10: PTG: an abstraction for unhindered parallelismhpc.pnl.gov/conf/wolfhpc/2014/talks/danalis.pdfPTG: an abstraction for unhindered parallelism Anthony Danalis Innovative Computing Laboratory

MPI Code

// Exchange initial datado i = my_start, my_len y(i) = y(i) + a*x(i)enddo// Exchange results

10

Page 11: PTG: an abstraction for unhindered parallelismhpc.pnl.gov/conf/wolfhpc/2014/talks/danalis.pdfPTG: an abstraction for unhindered parallelism Anthony Danalis Innovative Computing Laboratory

MPI Code

// Exchange initial datado i = my_start, my_len y(i) = y(i) + a*x(i)enddo// Exchange results

11

Page 12: PTG: an abstraction for unhindered parallelismhpc.pnl.gov/conf/wolfhpc/2014/talks/danalis.pdfPTG: an abstraction for unhindered parallelism Anthony Danalis Innovative Computing Laboratory

MPI Code

// Exchange initial datado i = my_start, my_len y(i) = y(i) + a*x(i)enddo// Exchange results

12

Page 13: PTG: an abstraction for unhindered parallelismhpc.pnl.gov/conf/wolfhpc/2014/talks/danalis.pdfPTG: an abstraction for unhindered parallelism Anthony Danalis Innovative Computing Laboratory

MPI+X Code

• Works well (maybe) only for isomorphic systems • Plagued by limitations of MPI and X

•  Dynamic Load balancing •  Handling Jitter •  Data distribution •  Difficult to develop •  Difficult to debug

• Replaces “big hammer” by two big hammers!

13

Page 14: PTG: an abstraction for unhindered parallelismhpc.pnl.gov/conf/wolfhpc/2014/talks/danalis.pdfPTG: an abstraction for unhindered parallelism Anthony Danalis Innovative Computing Laboratory

Current Programming Paradigm

Coarse Grain Parallelism with Explicit Message Passing (CGP)

14

Most parallel programs are essentially sequential, with some code to coordinate.

Page 15: PTG: an abstraction for unhindered parallelismhpc.pnl.gov/conf/wolfhpc/2014/talks/danalis.pdfPTG: an abstraction for unhindered parallelism Anthony Danalis Innovative Computing Laboratory

Panel vs Tile Algorithms

15

LAPACK / Panel PLASMA / Tile

Page 16: PTG: an abstraction for unhindered parallelismhpc.pnl.gov/conf/wolfhpc/2014/talks/danalis.pdfPTG: an abstraction for unhindered parallelism Anthony Danalis Innovative Computing Laboratory

What does the code look like?

16

for (k = 0; k < MT; k++) { Insert_Task( geqrt, A[k][k], INOUT, T[k][k], OUTPUT); for (m = k+1; m < MT; m++) { Insert_Task( tsqrt, A[k][k], INOUT | REGION_D|REGION_U, A[m][k], INOUT | LOCALITY, T[m][k], OUTPUT); } for (n = k+1; n < NT; n++) { Insert_Task( unmqr, A[k][k], INPUT | REGION_L, T[k][k], INPUT, A[k][m], INOUT); for (m = k+1; m < MT; m++) { Insert_Task( tsmqr, A[k][n], INOUT, A[m][n], INOUT | LOCALITY, A[m][k], INPUT, T[m][k], INPUT);

Page 17: PTG: an abstraction for unhindered parallelismhpc.pnl.gov/conf/wolfhpc/2014/talks/danalis.pdfPTG: an abstraction for unhindered parallelism Anthony Danalis Innovative Computing Laboratory

What’s wrong with this code

✗ It has: ✗ Control Flow ✗ Hints for runtime to infer Data Flow ✗ High memory requirements or reduced parallelism

ü It should have: ü No (or minimal, user defined) Control Flow ü Explicit Data Flow ü Unhindered parallelism

17

Page 18: PTG: an abstraction for unhindered parallelismhpc.pnl.gov/conf/wolfhpc/2014/talks/danalis.pdfPTG: an abstraction for unhindered parallelism Anthony Danalis Innovative Computing Laboratory

What captures the semantics?

{[k]->[k,k+1]:k<=mt-2}

GEQRT(k)k = 0 .. mt-1

TSMQR(k,m,n)k = 0 .. mt-1m = k+1 .. mt-1n = k+1 .. mt-1

TSQRT(k,m)k = 0 .. mt-1m = k+1 .. mt-1

UNMQR(k,n)k = 0 .. mt-1n = k+1 .. mt-1

{[k,m,n]->[n]:k+1==n && k+1==m}

{[k,m]->[k,m,n]:k<nt-1 && k<n<nt} {

[k,m,n]->[n,m]:k+1==n && m>n}

{[k,m,n]->[k+1,n]:k+1==m && n>m}

{[k]->[k,n]:k<n<nt && k<nt-1}

{[k,n]->[k,k+1,n]:k<mt-1}

{[k,m,n]->[k+1,m,n]:n>k+1 && m>k+1}

{[k,m,n]->[k,m+1,n]:m<mt-1}

{[k,m]->[k,m+1]:m<mt-1}

Page 19: PTG: an abstraction for unhindered parallelismhpc.pnl.gov/conf/wolfhpc/2014/talks/danalis.pdfPTG: an abstraction for unhindered parallelism Anthony Danalis Innovative Computing Laboratory

Parameterized Task Graph (PTG)

ü Task Classes w/ parameters ü  geqrt(k), tsqrt(k,m), unmqr(k,n), tsmqr(k,n,m)

ü Precedence constraints between Tasks

ü Compressed form of the Execution DAG

ü Fixed size (problem size independent)

19

Page 20: PTG: an abstraction for unhindered parallelismhpc.pnl.gov/conf/wolfhpc/2014/talks/danalis.pdfPTG: an abstraction for unhindered parallelism Anthony Danalis Innovative Computing Laboratory

Why not discover the whole DAG?

20

Node0

Node1

Node2

Node3

PO

GE

TR

SYTR TR

PO GE GE

TR TR

SYSY GE

PO

TR

SY

PO

SY SY

Page 21: PTG: an abstraction for unhindered parallelismhpc.pnl.gov/conf/wolfhpc/2014/talks/danalis.pdfPTG: an abstraction for unhindered parallelism Anthony Danalis Innovative Computing Laboratory

Why not discover the whole DAG?

21

Node0

Node1

Node2

Node3

PO

GE

TR

SYTR TR

PO GE GE

TR TR

SYSY GE

PO

TR

SY

PO

SY SY

Page 22: PTG: an abstraction for unhindered parallelismhpc.pnl.gov/conf/wolfhpc/2014/talks/danalis.pdfPTG: an abstraction for unhindered parallelism Anthony Danalis Innovative Computing Laboratory

Why not discover the whole DAG?

Node0

Node1

Node2

Node3

PO

GE

TR

SYTR TR

PO GE GE

TR TR

SYSY GE

PO

TR

SY

PO

SY SY

Page 23: PTG: an abstraction for unhindered parallelismhpc.pnl.gov/conf/wolfhpc/2014/talks/danalis.pdfPTG: an abstraction for unhindered parallelism Anthony Danalis Innovative Computing Laboratory

When building the DTG:

The runtime uses a single:

•  Insert_Task() function

• Structure for the elements of the DAG

• Function for traversing a task’s neighborhood

The Dynamic Task Graph is not program specific

23

Page 24: PTG: an abstraction for unhindered parallelismhpc.pnl.gov/conf/wolfhpc/2014/talks/danalis.pdfPTG: an abstraction for unhindered parallelism Anthony Danalis Innovative Computing Laboratory

When using a PTG:

{[k]->[k,k+1]:k<=mt-2}

TSQRT(k,m)k = 0 .. mt-1m = k+1 .. mt-1

{[k,m]->[k,m,n]:k<nt-1 && k<n<nt}

{[k,m,n]->[n,m]:k+1==n && m>n} {[

k,m]->[k,m+1]:m<mt-1}

Program specific DAG walkers

Page 25: PTG: an abstraction for unhindered parallelismhpc.pnl.gov/conf/wolfhpc/2014/talks/danalis.pdfPTG: an abstraction for unhindered parallelism Anthony Danalis Innovative Computing Laboratory

PTG: PING-PONG

25

Page 26: PTG: an abstraction for unhindered parallelismhpc.pnl.gov/conf/wolfhpc/2014/talks/danalis.pdfPTG: an abstraction for unhindered parallelism Anthony Danalis Innovative Computing Laboratory

PTG: Binary Tree Reduction

26

Page 27: PTG: an abstraction for unhindered parallelismhpc.pnl.gov/conf/wolfhpc/2014/talks/danalis.pdfPTG: an abstraction for unhindered parallelism Anthony Danalis Innovative Computing Laboratory

Conc

epts

•  Separation of roles: compiler optimizes each task, developer describes dependencies between tasks, runtime orchestrates dynamic execution

•  Separate algorithms from data distribution •  Avoid limitations of control flow execution

Runt

ime

•  Portability layer for heterogeneous architectures

•  Scheduling policies adapt execution to the hardware & ongoing system status

•  Data movements between consumers are inferred from dependencies. Communication/computation overlap

•  Coherency protocols minimize data movements

•  Memory hierarchy (including NVRAM and disk) integral part of the scheduling decisions

PaRSEC: focus on the algorithm

Page 28: PTG: an abstraction for unhindered parallelismhpc.pnl.gov/conf/wolfhpc/2014/talks/danalis.pdfPTG: an abstraction for unhindered parallelism Anthony Danalis Innovative Computing Laboratory

Developer’s view of PaRSEC

28

Page 29: PTG: an abstraction for unhindered parallelismhpc.pnl.gov/conf/wolfhpc/2014/talks/danalis.pdfPTG: an abstraction for unhindered parallelism Anthony Danalis Innovative Computing Laboratory

Load Balance, Idle time & Jitter

!me$

SPMD$/$M

PI$

PaRSEC

$

29

Page 30: PTG: an abstraction for unhindered parallelismhpc.pnl.gov/conf/wolfhpc/2014/talks/danalis.pdfPTG: an abstraction for unhindered parallelism Anthony Danalis Innovative Computing Laboratory

Performance: (H)QR

0

500

1000

1500

2000

2500

3000

0 50000 100000 150000 200000 250000 300000

PE

RF

OR

MA

NC

E (

GF

LO

P/S

)

MATRIX SIZE M(N=4,480)

Solving Linear Least Square Problem (DGEQRF)

60-node, 480-core, 2.27GHz Intel Xeon Nehalem, IB 20G System

Theoretical Peak: 4358.4 GFlop/s

LibSCI Scalapack

DPLASMA DGEQRF

Hierarchical QR

30

Page 31: PTG: an abstraction for unhindered parallelismhpc.pnl.gov/conf/wolfhpc/2014/talks/danalis.pdfPTG: an abstraction for unhindered parallelism Anthony Danalis Innovative Computing Laboratory

Performance: Systolic QR

0

5

10

15

20

25

768 2304 4032 5760 7776 10080 14784 19584 23868

PE

RF

OR

MA

NC

E (

TF

LO

P/S

)

NUMBER OF CORES

DGEQRF performance strong scaling

LibSCI Scalapack

PaRSEC DPLASMA

Cray XT5 (Kraken) - N = M = 41,472

31

Page 32: PTG: an abstraction for unhindered parallelismhpc.pnl.gov/conf/wolfhpc/2014/talks/danalis.pdfPTG: an abstraction for unhindered parallelism Anthony Danalis Innovative Computing Laboratory

Distributed CPUs + GPUs

0

10

20

30

40

50

60

70

80

16+3N=31k

144+27 400+75 576+108 784+147 1024+192N=246k

Perf

orm

an

ce (

TF

lop

/s)

Number of Cores+GPUs

Distributed Hybrid DPOTRF Weak Scaling on Keeneland1 to 64 nodes (16 cores, 3 M2090 GPUs, Infiniband 20G per node)

PaRSEC DPOTRF (3 GPU per node)Ideal Scaling

32

Page 33: PTG: an abstraction for unhindered parallelismhpc.pnl.gov/conf/wolfhpc/2014/talks/danalis.pdfPTG: an abstraction for unhindered parallelism Anthony Danalis Innovative Computing Laboratory

NWChem Coupled Cluster (CC)

• Computational Chemistry

• TCE Coupled Cluster

• Machine generated Fortran 77 code

• Behavior depends on dynamic, immutable data

• Long sequential chains of DGEMMs

• Work (chain) stealing through GA atomics

33

Page 34: PTG: an abstraction for unhindered parallelismhpc.pnl.gov/conf/wolfhpc/2014/talks/danalis.pdfPTG: an abstraction for unhindered parallelism Anthony Danalis Innovative Computing Laboratory

CC t1_2_2_2

�����

�����

��

���

����� ����� ���� �����

���

� ��

��

���

��

��

��

���

��

����� � ����������

������� �!�"��

#��$�� %� ��� �����

34

Page 35: PTG: an abstraction for unhindered parallelismhpc.pnl.gov/conf/wolfhpc/2014/talks/danalis.pdfPTG: an abstraction for unhindered parallelism Anthony Danalis Innovative Computing Laboratory

CC t2_8

0

1

2

3

4

5

6

7

16x1 16x2 16x4 16x8

Ex

ec

uti

on

Tim

e (

se

c)

Nodes x Cores/Node

Execution Time of icsd_t2_8() subroutine in CCSD of NWChem

Original (isolated) icsd_t2_8()PaRSEC (isolated) icsd_t2_8()

35

Page 36: PTG: an abstraction for unhindered parallelismhpc.pnl.gov/conf/wolfhpc/2014/talks/danalis.pdfPTG: an abstraction for unhindered parallelism Anthony Danalis Innovative Computing Laboratory

Summary

² PTG offers a Data-Flow based Prog. Paradigm

² PaRSEC offers state-of-the-art performance

² Data distribution is decoupled from Algorithm

² Control Flow limitations are avoided

² CGP != highest performance

² CGP != easiest to program (?)

36

Page 37: PTG: an abstraction for unhindered parallelismhpc.pnl.gov/conf/wolfhpc/2014/talks/danalis.pdfPTG: an abstraction for unhindered parallelism Anthony Danalis Innovative Computing Laboratory

Backup slides

37

Page 38: PTG: an abstraction for unhindered parallelismhpc.pnl.gov/conf/wolfhpc/2014/talks/danalis.pdfPTG: an abstraction for unhindered parallelism Anthony Danalis Innovative Computing Laboratory

Why not discover the whole DAG?

Quantify Reduced Parallelism (if using window)

38

Question: Given window size W and P processors, what is the highest level of parallelism that can be missed because of control flow limitations? Assuming P<W

Page 39: PTG: an abstraction for unhindered parallelismhpc.pnl.gov/conf/wolfhpc/2014/talks/danalis.pdfPTG: an abstraction for unhindered parallelism Anthony Danalis Innovative Computing Laboratory

PTG vs DTG

39

for(i=0;i<W;i++){ Task(RW:A[i][0]); for(j=1; j<a*W; j++){ Task(R:A[i][j-1], W:A[i][j]); } }

W

...

... a*W

Page 40: PTG: an abstraction for unhindered parallelismhpc.pnl.gov/conf/wolfhpc/2014/talks/danalis.pdfPTG: an abstraction for unhindered parallelism Anthony Danalis Innovative Computing Laboratory

PTG vs DTG

40

for(i=0;i<W;i++){ Task(RW:A[i][0]); for(j=1; j<a*W; j++){ Task(R:A[i][j-1], W:A[i][j]); } }

W

...

... a*W

Page 41: PTG: an abstraction for unhindered parallelismhpc.pnl.gov/conf/wolfhpc/2014/talks/danalis.pdfPTG: an abstraction for unhindered parallelism Anthony Danalis Innovative Computing Laboratory

PTG vs DTG

41

W

...

... a*W

Dynamic Task Graph (DTG): a*W+(W-1)*(a-1)*W = O(a*W^2)

Page 42: PTG: an abstraction for unhindered parallelismhpc.pnl.gov/conf/wolfhpc/2014/talks/danalis.pdfPTG: an abstraction for unhindered parallelism Anthony Danalis Innovative Computing Laboratory

PTG vs DTG

42

W

...

... a*W

Dynamic Task Graph (DTG): a*W+(W-1)*(a-1)*W = O(a*W^2)

Parameterized Task Graph (PTG):

O(a*W^2/P)

Page 43: PTG: an abstraction for unhindered parallelismhpc.pnl.gov/conf/wolfhpc/2014/talks/danalis.pdfPTG: an abstraction for unhindered parallelism Anthony Danalis Innovative Computing Laboratory

PTG vs DTG

43

W

...

... a*W

Dynamic Task Graph (DTG): a*W+(W-1)*(a-1)*W = O(a*W^2)

Parameterized Task Graph (PTG):

O(a*W^2/P)

O(DTG)/O(PTG) = P