PaRSEC · • Handles specifics of HW system (hyper-threading, NUMA, …) • Per-object capabilities • Read-only or write-only, output data, private, relaxed coherence • engine

PaRSEC Parallel Runtime Scheduling and Execution Control

George Bosilca, Aurelien Bouteiller,

Anthony Danalis, Mathieu Faverge, Thomas Herault,

Stephanie Moreau, Jack Dongarra

Where we are today •  Today software developers face systems with

•  ~1 TFLOP of compute power per node •  32+ of cores, 100+ hardware threads •  Highly heterogeneous architectures (cores + specialized cores +

accelerators/coprocessors) •  Deep memory hierarchies •  But wait there is more: They are distributed! •  Consequence is systemic load imbalance

•  And we still ask the same question: How to harness these devices productively? •  SPMD produces choke points, wasted wait times •  We need to improve efficiency, power and reliability

The missing parallelism •  Too difficult to find parallelism, to debug,

maintain and get good performance for everyone

•  Increasing gaps between the capabilities of today’s programming environments, the requirements of emerging applications, and the challenges of future parallel architectures

Popular belief

Over-sized Decent sizes

Decouple “System issues” from Algorithm

•  Keep the algorithm simple •  Depict only the flow of data between tasks •  Distributed Dataflow Environment based on Dynamic

Scheduling of (Micro) Tasks •  Programmability: layered approach

•  Algorithm / Data Distribution •  Parallel applications without parallel programming

•  Portability / Efficiency •  Use all available hardware; overlap data movements /

computation •  Progress is still possible when imbalance arise

Dataflow with Runtime scheduling •  Algorithms need help to unleash their power

•  Hardware specificities: a runtime can provide portability, performance, scheduling heuristics, heterogeneity management, data movement, …

•  Scalability: maximize parallelism extraction, but no centralized scheduling or entire DAG unpacking: dynamic and independent discovery of the relevant portions during the execution

•  Jitter resilience: Do not support explicit communications, instead make them implicit and schedule to maximize overlap and load balance

•  The need to express the algorithms differently

Node0

Node1

Node2

Node3

PO

GE

TR

SYTR TR

PO GE GE

TR TR

SYSY GE

PO

TR

SY

PO

SY SY

Divide-and-orchestrate

Divide-and-orchestrate •  Leave the optimization of each

level to specialized entities •  Compiler do marvels when the cost

of memory accesses is known •  Think hierarchical super-scalar

•  Compiler focus on problems that fit in the cache

•  Humans focus on depicted the flow of data between tasks

•  And a runtime orchestrate the flows to maximize the throughput of the architecture

GEQRT

TSQRT

UNMQR

TSMQR

Sca

LAP

AC

K Q

R

PLA

SM

A Q

R

The PaRSEC framework

Cores Memory

Hierarchies Coherence

Data Movement

Accelerators

Data Movement

Par

alle

l R

untim

e H

ardw

are

Dom

ain

Spe

cific

Ex

tens

ions

Dense LA … Sparse LA

Scheduling Scheduling

Scheduling

Data

Compact Representation - PTG

Dynamic Discovered Representation - DTG

SpecializedKernels Specialized

Kernels Specialized Kernels

Tasks Tasks

Tasks

Domain Specific Extensions •  DSEs ⇒ higher productivity for developers

•  High-level data types & ops tailored to domain •  E.g., relations, matrices, triangles, …

•  Portable and scalable specification of parallelism •  Automatically adjust data structures, mapping, and scheduling

as systems scale up •  Toolkit of classical data distributions, etc

PaRSEC toolchain

SerialCode

DAGuEcompiler

Dataflowrepresentation

Dataflowcompiler

Paralleltasks stubs

Runtime

Programmer

Systemcompiler

Additionallibraries

MPI

CUDApthreads

PLASMA MAGMA

Application code &Codelets

DAGuE Toolchain

DomainSpecificExtensions

Datadistribution

Supercomputer

Serial Code to Dataflow Representation

PaRSEC Compiler SerialCode

DAGuEcompiler


Dataflowcompiler

Paralleltasks stubs

Runtime

Programmer

Systemcompiler

Additionallibraries

MPI

CUDApthreads

PLASMA MAGMA


DAGuE Toolchain


Datadistribution

Supercomputer

Example: QR Factorization

GEQRT

TSQRT

UNMQR

TSMQR

FOR k = 0 .. SIZE - 1

A[k][k], T[k][k] <- GEQRT( A[k][k] )

FOR m = k+1 .. SIZE - 1

A[k][k]|Up, A[m][k], T[m][k] <- TSQRT( A[k][k]|Up, A[m][k], T[m][k] )

FOR n = k+1 .. SIZE - 1

A[k][n] <- UNMQR( A[k][k]|Low, T[k][k], A[k][n] )

FOR m = k+1 .. SIZE - 1

A[k][n], A[m][n] <- TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] )

Input Format – Quark/StarPU/MORSE for (k = 0; k < A.mt; k++) { Insert_Task( zgeqrt, A[k][k], INOUT, T[k][k], OUTPUT); for (m = k+1; m < A.mt; m++) { Insert_Task( ztsqrt, A[k][k], INOUT | REGION_D|REGION_U, A[m][k], INOUT | LOCALITY, T[m][k], OUTPUT); } for (n = k+1; n < A.nt; n++) { Insert_Task( zunmqr, A[k][k], INPUT | REGION_L, T[k][k], INPUT, A[k][m], INOUT); for (m = k+1; m < A.mt; m++) Insert_Task( ztsmqr, A[k][n], INOUT, A[m][n], INOUT | LOCALITY, A[m][k], INPUT, T[m][k], INPUT); } }

•  Sequential C code •  Annotated through

some specific syntax •  Insert_Task •  INOUT, OUTPUT, INPUT •  REGION_L, REGION_U,

REGION_D, … •  LOCALITY

Dataflow Analysis

•  data flow analysis •  Example on task DGEQRT of

QR •  Polyhedral Analysis through

Omega Test •  Compute algebraic

expressions for: •  Source and destination

tasks •  Necessary conditions for

that data flow to exist

FOR k = 0 .. SIZE - 1

A[k][k], T[k][k] <- GEQRT( A[k][k] )

FOR m = k+1 .. SIZE - 1

A[k][k]|Up, A[m][k], T[m][k] <- TSQRT( A[k][k]|Up, A[m][k], T[m][k] )

FOR n = k+1 .. SIZE - 1

A[k][n] <- UNMQR( A[k][k]|Low, T[k][k], A[k][n] )

FOR m = k+1 .. SIZE - 1

A[k][n], A[m][n] <- TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] )

MEM

n = k+1m = k+1

k = 0

k = SIZE-1

LOWER

UPPER

Incoming DataOutgoing Data Serial

CodeDAGuE

compiler


Dataflowcompiler

Paralleltasks stubs

Runtime

Programmer

Systemcompiler

Additionallibraries

MPI

CUDApthreads

PLASMA MAGMA


DAGuE Toolchain


Datadistribution

Supercomputer

Intermediate Representation: Job Data Flow

GEQRT

TSQRT

UNMQR

TSMQR

Control flow is eliminated, therefore maximum parallelism is possible

GEQRT(k) /* Execution space */ k = 0..( MT < NT ) ? MT-1 : NT-1 ) /* Locality */ : A(k, k) RW A <- (k == 0) ? A(k, k) : A1 TSMQR(k-1, k, k) -> (k < NT-1) ? A UNMQR(k, k+1 .. NT-1) [type = LOWER] -> (k < MT-1) ? A1 TSQRT(k, k+1) [type = UPPER] -> (k == MT-1) ? A(k, k) [type = UPPER] WRITE T <- T(k, k) -> T(k, k) -> (k < NT-1) ? T UNMQR(k, k+1 .. NT-1) /* Priority */ ;(NT-k)*(NT-k)*(NT-k) BODY zgeqrt( A, T ) END

Dataflow Representation PaRSEC Compiler

SerialCode

DAGuEcompiler


Dataflowcompiler

Paralleltasks stubs

Runtime

Programmer

Systemcompiler

Additionallibraries

MPI

CUDApthreads

PLASMA MAGMA


DAGuE Toolchain


Datadistribution

Supercomputer

Example: Reduction Operation •  Reduction: apply a user defined

operator on each data and store the result in a single location. (Suppose the operator is associative and commutative)

Issue: Non-affine loops lead to non-polyhedral array accessing

for(s = 1; s < N/2; s = 2*s) for(i = 0; i < N-s; i += s) operator(V[i], V[i+s])

Example: Reduction Operation reduce(l, p) : V(p) l = 1 .. depth+1 p = 0 .. (MT / (1<<l)) RW A <- (1 == l) ? V(2*p) : A reduce( l-1, 2*p ) -> ((depth+1) == l) ? V(0) -> (0 == (p%2))? A reduce(l+1, p/2) : B reduce(l+1, p/2) READ B <- ((p*(1<<l) + (1<<(l-1))) > MT) ? V(0) <- (1 == l) ? V(2*p+1) <- (1 != l) ? A reduce( l-1, p*2+1 ) BODY operator(A, B); END

Solution: Hand-writing of the data dependency using the intermediate Data Flow representation

Data/Task Distribution •  Flexible data distribution

•  Decoupled from the algorithm •  Expressed as a user-defined

function •  Only limitation: must evaluate

uniformly across all nodes •  Common distributions provided in

DSEs •  1D cyclic, 2D cyclic, etc. •  Symbol Matrix for sparse direct

solvers SerialCode

DAGuEcompiler


Dataflowcompiler

Paralleltasks stubs

Runtime

Programmer

Systemcompiler

Additionallibraries

MPI

CUDApthreads

PLASMA MAGMA


DAGuE Toolchain


Datadistribution

Supercomputer

Parallel Runtime Algorithm is now expressed as a DAG (potentially Parameterized)

•  DAG too large to be generated ahead of time

•  Generate it dynamically •  Merge parameterized

DAGs with dynamically generated DAGs

•  HPC is about distributed heterogeneous resources

•  Have to get involved in message passing

•  Distributed management of the scheduling

•  Dynamically deal with heterogeneity

Runtime DAG scheduling •  Every process has the symbolic DAG

representation •  Only the (node local) frontier of the

DAG is considered •  Distributed Scheduling based on

remote completion notifications •  Background remote data transfer

automatic with overlap •  NUMA / Cache aware Scheduling

•  Work Stealing and sharing based on memory hierarchies

Node0

Node1

Node2

Node3

PO

GE

TR

SYTR TR

PO GE GE

TR TR

SYSY GE

PO

TR

SY

PO

SY SY

Scheduling Heuristics in PaRSEC •  Manages parallelism & locality

•  Achieve efficient execution (performance, power, …) •  Handles specifics of HW system (hyper-threading, NUMA, …)

•  Per-object capabilities •  Read-only or write-only, output data, private, relaxed coherence •  engine tracks data usage, and targets to improve data reuse •  NUMA aware hierarchical bounded buffers to implement work stealing

•  Users hints: expressions for distance to critical path •  Insertion in waiting queue abides to priority, but work stealing can alter this

ordering due to locality •  Communications heuristics

•  Communications inherits priority of destination task •  Algorithm defined scheduling

What’s next? Now we have a runtime and some algorithms

Auto-tuning •  Multi-level tuning

•  Tune the kernels based on local architecture

•  Then tune the algorithm •  Depends on the

network, type and number of cores

•  For a fixed size matrix increasing the task duration (or the tile size) decrease parallelism

•  For best performance: auto-tune per system

50

60

70

80

90

100

120 160

200 260

300 340

460 640

1000

% e

ffici

ency

Block Size (NB)

1 Nodes (8 cores)4 Nodes (32 cores)

81 Nodes (648 cores)

Not enough parallelism

GEQRT

TSQRT

UNMQR

TSMQR

Weak-scaling

References[1] E. Agullo, J. Demmel, J. Dongarra, B. Hadri, J. Kurzak, J. Langou,

H. Ltaief, P. Luszczek, and S. Tomov. Numerical linear algebraon emerging architectures: The PLASMA and MAGMA projects.Journal of Physics: Conference Series, 180, 2009.

[2] E. Anderson, Z. Bai, C. Bischof, L. S. Blackford, J. W. Demmel, J. J.Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney,and D. Sorensen. LAPACK Users’ Guide. SIAM, Philadelphia, PA,1992. URL http://www.netlib.org/lapack/lug/.

[3] C. Augonnet, S. Thibault, R. Namyst, and P.-A. Wacrenier. StarPU: Aunified platform for task scheduling on heterogeneous multicore archi-tectures. In Proceedings of Euro-par’09, LNCS, Delft, Netherlands,2009.

[4] C. Bischof and C. van Loan. The WY representation for products ofHouseholder matrices. J. Sci. Stat. Comput., 8:2–13, 1987.

[5] L. S. Blackford, J. Choi, A. J. Cleary, E. F. D’Azevedo, J. Dem-mel, I. S. Dhillon, J. Dongarra, S. Hammarling, G. Henry, A. Petitet,K. Stanley, D. W. Walker, and R. C. Whaley. ScaLAPACK: A linearalgebra library for message-passing computers. In Proceedings of theEighth SIAM Conference on Parallel Processing for Scientific Com-puting. SIAM, 1997. ISBN 0-89871-395-1.

[6] R. Bolze, F. Cappello, E. Caron, M. J. Dayde, F. Desprez,E. Jeannot, Y. Jegou, S. Lanteri, J. Leduc, N. Melab, G. Mor-net, R. Namyst, P. Primet, B. Quetier, O. Richard, E.-G. Talbi,and I. Touche. Grid’5000: A large scale and highly reconfig-urable experimental grid testbed. IJHPCA, 20(4):481–494, 2006.DOI: 10.1177/1094342006070078.

[7] F. Broquedis, J. Clet-Ortega, S. Moreaud, N. Furmento, B. Goglin,G. Mercier, S. Thibault, and R. Namyst. hwloc: a generic frameworkfor managing hardware affinities in HPC applications. In IEEE, editor,the 18th Euromicro International Conference on Parallel, Distributedand Network-Based Computing (PDP), Pisa Italy, Feb 2010.

[8] A. Buttari, J. Dongarra, J. Kurzak, J. Langou, P. Luszczek, and S. To-mov. The impact of multicore on math software. In Applied ParallelComputing. State of the Art in Scientific Computing, 8th InternationalWorkshop, PARA, volume 4699 of LNCS, pages 1–10. Springer, 2006.ISBN 978-3-540-75754-2. DOI: 10.1007/978-3-540-75755-9 1.

[9] A. Buttari, J. Langou, J. Kurzak, and J. J. Dongarra. Parallel tiledQR factorization for multicore architectures. Concurrency Computat.:Pract. Exper., 20(13):1573–1590, 2008. DOI: 10.1002/cpe.1301.

[10] A. Buttari, J. Langou, J. Kurzak, and J. Dongarra. A classof parallel tiled linear algebra algorithms for multicore architec-tures. Parallel Comput., 35(1):38–53, 2009. ISSN 0167-8191.DOI: 10.1016/j.parco.2008.10.002.

[11] E. Chan, E. S. Quintana-Orti, G. Quintana-Orti, and R. van de Geijn.Supermatrix out-of-order scheduling of matrix operations for smp andmulti-core architectures. In SPAA ’07: Proceedings of the nineteenthannual ACM symposium on Parallel algorithms and architectures,pages 116–125, New York, NY, USA, 2007. ACM. ISBN 978-1-59593-667-7. DOI: 10.1145/1248377.1248397.

[12] M. Cosnard and E. Jeannot. Automatic parallelization tech-niques based on compact DAG extraction and symbolicscheduling. Parallel Processing Letters, 11:151–168, 2001.DOI: 10.1142/S012962640100049X.

[13] M. Cosnard, E. Jeannot, and T. Yang. Compact dag representationand its symbolic scheduling. Journal of Parallel and DistributedComputing, 64(8):921–935, Aug 2004.

[14] O. Delannoy, N. Emad, and S. Petiton. Workflow global computingwith yml. In 7th IEEE/ACM International Conference on Grid Com-puting, Sep 2006.

[15] R. Dolbeau, S. Bihan, and F. Bodin. HMPP: A hybrid multi-coreparallel programming environment. In Workshop on General PurposeProcessing on Graphics Processing Units (GPGPU), 2007.

[16] J. J. Dongarra, J. D. Croz, I. S. Duff, , and S. Hammarling. A set oflevel 3 basic linear algebra subprograms. ACM Trans. Math. Soft., 16:1–17, 1990.

[17] J. J. Dongarra, P. Luszczek, and A. Petitet. The LINPACK benchmark:Past, present and future. Concurrency Computat.: Pract. Exper., 15(9):803–820, 2003. DOI: 10.1002/cpe.728.

[18] G. H. Golub and C. F. van Loan. Matrix Computations. The JohnsHopkins University Press, 1996. ISBN 0801854148.

[19] F. G. Gustavson. Recursion leads to automatic variable blocking fordense linear-algebra algorithms. IBM J. Res. & Dev., 41(6):737–756,1997. DOI: 10.1147/rd.416.0737.

[20] F. G. Gustavson. New generalized matrix data structures lead to avariety of high-performance algorithms. In Proceedings of the IFIPTC2/WG2.5 Working Conference on the Architecture of Scientific Soft-ware, pages 211–234, Ottawa, Canada, Oct 2000. Kluwer AcademicPublishers. ISBN: 0792373391.

[21] F. G. Gustavson, J. A. Gunnels, and J. C. Sexton. Minimal data copyfor dense linear algebra factorization. In Applied Parallel Computing,State of the Art in Scientific Computing, 8th International Workshop,PARA 2006, volume 4699, pages 540–549, Umea, Sweden, Jun 2006.LNCS. DOI: 10.1007/978-3-540-75755-9 66.

[22] F. G. Gustavson, L. Karlsson, and B. Kagstrom. DistributedSBP cholesky factorization algorithms with near-optimal schedul-ing. ACM Trans. Math. Softw., 36(2):1–25, 2009. ISSN 0098-3500.DOI: 10.1145/1499096.1499100.

[23] P. Husbands and K. A. Yelick. Multi-threading and one-sided commu-nication in parallel lu factorization. In B. Verastegui, editor, Proceed-ings of the ACM/IEEE Conference on High Performance Network-ing and Computing, SC 2007, November 10-16, 2007, Reno, Nevada,USA. ACM Press, 2007. ISBN 978-1-59593-764-3.

[24] E. Jeannot. Automatic multithreaded parallel program generation formessage passing multiprocessors using parameterized task graphs. InInternational Conference ’Parallel Computing 2001’ (ParCo2001),Sep 2001.

[25] C. L. Lawson, R. J. Hanson, D. Kincaid, and F. T. Krogh. Basic linearalgebra subprograms for FORTRAN usage. ACM Trans. Math. Soft.,5:308–323, 1979.

[26] J. Perez, R. Badia, and J. Labarta. A dependency-aware task-basedprogramming environment for multi-core architectures. In ClusterComputing, 2008 IEEE International Conference on, pages 142 –151,Oct 2008. DOI: 10.1109/CLUSTR.2008.4663765.

[27] W. Pugh. The omega test: a fast and practical integer programmingalgorithm for dependence analysis. In Supercomputing ’91: Proceed-ings of the 1991 ACM/IEEE conference on Supercomputing, pages 4–13, New York, NY, USA, 1991.

[28] E. S. Quintana-Ortı and R. A. van de Geijn. Updating an LU fac-torization with pivoting. ACM Trans. Math. Softw., 35(2):11, 2008.DOI: 10.1145/1377612.1377615.

[29] R. Schreiber and C. van Loan. A storage-efficient WY representationfor products of Householder transformations. J. Sci. Stat. Comput., 10:53–57, 1991.

[30] J. A. Sharp, editor. Data flow computing: theory and practice. AblexPublishing Corp, 1992.

[31] F. Song, A. YarKhan, and J. Dongarra. Dynamic task schedul-ing for linear algebra algorithms on distributed-memory multicoresystems. In SC ’09: Proceedings of the Conference on High Per-formance Computing Networking, Storage and Analysis, pages 1–11, New York, NY, USA, 2009. ACM. ISBN 978-1-60558-744-8.DOI: 10.1145/1654059.1654079.

[32] G. W. Stewart. Matrix algorithms. Society for Industrial and AppliedMathematics, Philadelphia, PA, USA, 2001. ISBN 0-89871-503-2.

[33] F. G. van Zee. libflame: The Complete Reference. www.lulu.com,2009.

[34] J. Yu and R. Buyya. A taxonomy of workflow management systemsfor grid computing. Journal of Grid Computing, 2005.

10 2010/9/13

DSBP

81 dual Intel Xeon [email protected] (2x4 cores/node) è 648 cores MX 10Gbs, Intel MKL, Scalapack

0

1

2

3

4

5

6

7

13k

26k

40k

53k

67k

80k

94k

107k

120k

130k

TF

lop

/s

Matrix size (N)

DPOTRF performance problem scaling648 cores (Myrinet 10G)

Theoretical peakPractical peak (GEMM)DAGuEDSBPScaLAPACK

0

1

2

3

4

5

6

7

13k

26k

40k

53k

67k

80k

94k

107k

120k

130k

TF

lop

/s

Matrix size (N)

DGEQRF performance problem scaling648 cores (Myrinet 10G)

Theoretical peakPractical peak (GEMM)DAGuEScaLAPACK

0

1

2

3

4

5

6

7

13k

26k

40k

53k

67k

80k

94k

107k

120k

130k

TF

lop

/s

Matrix size (N)

DGETRF performance problem scaling648 cores (Myrinet 10G)

Theoretical peakPractical peak (GEMM)DAGuEHPLScaLAPACK

References[1] E. Agullo, J. Demmel, J. Dongarra, B. Hadri, J. Kurzak, J. Langou,

H. Ltaief, P. Luszczek, and S. Tomov. Numerical linear algebraon emerging architectures: The PLASMA and MAGMA projects.Journal of Physics: Conference Series, 180, 2009.

[2] E. Anderson, Z. Bai, C. Bischof, L. S. Blackford, J. W. Demmel, J. J.Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney,and D. Sorensen. LAPACK Users’ Guide. SIAM, Philadelphia, PA,1992. URL http://www.netlib.org/lapack/lug/.

[3] C. Augonnet, S. Thibault, R. Namyst, and P.-A. Wacrenier. StarPU: Aunified platform for task scheduling on heterogeneous multicore archi-tectures. In Proceedings of Euro-par’09, LNCS, Delft, Netherlands,2009.

[4] C. Bischof and C. van Loan. The WY representation for products ofHouseholder matrices. J. Sci. Stat. Comput., 8:2–13, 1987.

[5] L. S. Blackford, J. Choi, A. J. Cleary, E. F. D’Azevedo, J. Dem-mel, I. S. Dhillon, J. Dongarra, S. Hammarling, G. Henry, A. Petitet,K. Stanley, D. W. Walker, and R. C. Whaley. ScaLAPACK: A linearalgebra library for message-passing computers. In Proceedings of theEighth SIAM Conference on Parallel Processing for Scientific Com-puting. SIAM, 1997. ISBN 0-89871-395-1.

[6] R. Bolze, F. Cappello, E. Caron, M. J. Dayde, F. Desprez,E. Jeannot, Y. Jegou, S. Lanteri, J. Leduc, N. Melab, G. Mor-net, R. Namyst, P. Primet, B. Quetier, O. Richard, E.-G. Talbi,and I. Touche. Grid’5000: A large scale and highly reconfig-urable experimental grid testbed. IJHPCA, 20(4):481–494, 2006.DOI: 10.1177/1094342006070078.

[7] F. Broquedis, J. Clet-Ortega, S. Moreaud, N. Furmento, B. Goglin,G. Mercier, S. Thibault, and R. Namyst. hwloc: a generic frameworkfor managing hardware affinities in HPC applications. In IEEE, editor,the 18th Euromicro International Conference on Parallel, Distributedand Network-Based Computing (PDP), Pisa Italy, Feb 2010.

[8] A. Buttari, J. Dongarra, J. Kurzak, J. Langou, P. Luszczek, and S. To-mov. The impact of multicore on math software. In Applied ParallelComputing. State of the Art in Scientific Computing, 8th InternationalWorkshop, PARA, volume 4699 of LNCS, pages 1–10. Springer, 2006.ISBN 978-3-540-75754-2. DOI: 10.1007/978-3-540-75755-9 1.

[9] A. Buttari, J. Langou, J. Kurzak, and J. J. Dongarra. Parallel tiledQR factorization for multicore architectures. Concurrency Computat.:Pract. Exper., 20(13):1573–1590, 2008. DOI: 10.1002/cpe.1301.

[10] A. Buttari, J. Langou, J. Kurzak, and J. Dongarra. A classof parallel tiled linear algebra algorithms for multicore architec-tures. Parallel Comput., 35(1):38–53, 2009. ISSN 0167-8191.DOI: 10.1016/j.parco.2008.10.002.

[11] E. Chan, E. S. Quintana-Orti, G. Quintana-Orti, and R. van de Geijn.Supermatrix out-of-order scheduling of matrix operations for smp andmulti-core architectures. In SPAA ’07: Proceedings of the nineteenthannual ACM symposium on Parallel algorithms and architectures,pages 116–125, New York, NY, USA, 2007. ACM. ISBN 978-1-59593-667-7. DOI: 10.1145/1248377.1248397.

[12] M. Cosnard and E. Jeannot. Automatic parallelization tech-niques based on compact DAG extraction and symbolicscheduling. Parallel Processing Letters, 11:151–168, 2001.DOI: 10.1142/S012962640100049X.

[13] M. Cosnard, E. Jeannot, and T. Yang. Compact dag representationand its symbolic scheduling. Journal of Parallel and DistributedComputing, 64(8):921–935, Aug 2004.

[14] O. Delannoy, N. Emad, and S. Petiton. Workflow global computingwith yml. In 7th IEEE/ACM International Conference on Grid Com-puting, Sep 2006.

[15] R. Dolbeau, S. Bihan, and F. Bodin. HMPP: A hybrid multi-coreparallel programming environment. In Workshop on General PurposeProcessing on Graphics Processing Units (GPGPU), 2007.

[16] J. J. Dongarra, J. D. Croz, I. S. Duff, , and S. Hammarling. A set oflevel 3 basic linear algebra subprograms. ACM Trans. Math. Soft., 16:1–17, 1990.

[17] J. J. Dongarra, P. Luszczek, and A. Petitet. The LINPACK benchmark:Past, present and future. Concurrency Computat.: Pract. Exper., 15(9):803–820, 2003. DOI: 10.1002/cpe.728.

[18] G. H. Golub and C. F. van Loan. Matrix Computations. The JohnsHopkins University Press, 1996. ISBN 0801854148.

[19] F. G. Gustavson. Recursion leads to automatic variable blocking fordense linear-algebra algorithms. IBM J. Res. & Dev., 41(6):737–756,1997. DOI: 10.1147/rd.416.0737.

[20] F. G. Gustavson. New generalized matrix data structures lead to avariety of high-performance algorithms. In Proceedings of the IFIPTC2/WG2.5 Working Conference on the Architecture of Scientific Soft-ware, pages 211–234, Ottawa, Canada, Oct 2000. Kluwer AcademicPublishers. ISBN: 0792373391.

[21] F. G. Gustavson, J. A. Gunnels, and J. C. Sexton. Minimal data copyfor dense linear algebra factorization. In Applied Parallel Computing,State of the Art in Scientific Computing, 8th International Workshop,PARA 2006, volume 4699, pages 540–549, Umea, Sweden, Jun 2006.LNCS. DOI: 10.1007/978-3-540-75755-9 66.

[22] F. G. Gustavson, L. Karlsson, and B. Kagstrom. DistributedSBP cholesky factorization algorithms with near-optimal schedul-ing. ACM Trans. Math. Softw., 36(2):1–25, 2009. ISSN 0098-3500.DOI: 10.1145/1499096.1499100.

[23] P. Husbands and K. A. Yelick. Multi-threading and one-sided commu-nication in parallel lu factorization. In B. Verastegui, editor, Proceed-ings of the ACM/IEEE Conference on High Performance Network-ing and Computing, SC 2007, November 10-16, 2007, Reno, Nevada,USA. ACM Press, 2007. ISBN 978-1-59593-764-3.

[24] E. Jeannot. Automatic multithreaded parallel program generation formessage passing multiprocessors using parameterized task graphs. InInternational Conference ’Parallel Computing 2001’ (ParCo2001),Sep 2001.

[25] C. L. Lawson, R. J. Hanson, D. Kincaid, and F. T. Krogh. Basic linearalgebra subprograms for FORTRAN usage. ACM Trans. Math. Soft.,5:308–323, 1979.

[26] J. Perez, R. Badia, and J. Labarta. A dependency-aware task-basedprogramming environment for multi-core architectures. In ClusterComputing, 2008 IEEE International Conference on, pages 142 –151,Oct 2008. DOI: 10.1109/CLUSTR.2008.4663765.

[27] W. Pugh. The omega test: a fast and practical integer programmingalgorithm for dependence analysis. In Supercomputing ’91: Proceed-ings of the 1991 ACM/IEEE conference on Supercomputing, pages 4–13, New York, NY, USA, 1991.

[28] E. S. Quintana-Ortı and R. A. van de Geijn. Updating an LU fac-torization with pivoting. ACM Trans. Math. Softw., 35(2):11, 2008.DOI: 10.1145/1377612.1377615.

[29] R. Schreiber and C. van Loan. A storage-efficient WY representationfor products of Householder transformations. J. Sci. Stat. Comput., 10:53–57, 1991.

[30] J. A. Sharp, editor. Data flow computing: theory and practice. AblexPublishing Corp, 1992.

[31] F. Song, A. YarKhan, and J. Dongarra. Dynamic task schedul-ing for linear algebra algorithms on distributed-memory multicoresystems. In SC ’09: Proceedings of the Conference on High Per-formance Computing Networking, Storage and Analysis, pages 1–11, New York, NY, USA, 2009. ACM. ISBN 978-1-60558-744-8.DOI: 10.1145/1654059.1654079.

[32] G. W. Stewart. Matrix algorithms. Society for Industrial and AppliedMathematics, Philadelphia, PA, USA, 2001. ISBN 0-89871-503-2.

[33] F. G. van Zee. libflame: The Complete Reference. www.lulu.com,2009.

[34] J. Yu and R. Buyya. A taxonomy of workflow management systemsfor grid computing. Journal of Grid Computing, 2005.

10 2010/9/13

DSBP

81 dual Intel Xeon [email protected] (2x4 cores/node) è 648 cores MX 10Gbs, Intel MKL, Scalapack

0

1

2

3

4

5

6

7

13k

26k

40k

53k

67k

80k

94k

107k

120k

130k

TF

lop

/s

Matrix size (N)

DPOTRF performance problem scaling648 cores (Myrinet 10G)

Theoretical peakPractical peak (GEMM)DAGuEDSBPScaLAPACK

0

1

2

3

4

5

6

7

13k

26k

40k

53k

67k

80k

94k

107k

120k

130k

TF

lop

/s

Matrix size (N)

DGEQRF performance problem scaling648 cores (Myrinet 10G)

Theoretical peakPractical peak (GEMM)DAGuEScaLAPACK

0

1

2

3

4

5

6

7

13k

26k

40k

53k

67k

80k

94k

107k

120k

130k

TF

lop

/s

Matrix size (N)

DGETRF performance problem scaling648 cores (Myrinet 10G)

Theoretical peakPractical peak (GEMM)DAGuEHPLScaLAPACK

Extracts more parallelism Independence

between data layout and algorithm

Competes with Hand tuned

Hardware conscious scheduling

Scalability in Distributed Memory •  Parameterized

Task Graph representation

•  Independent distributed scheduling

•  Scalable 0

5

10

15

20

25

30

108 432

768 1200

3072

Perf

orm

an

ce (

TF

lop

/s)

Number of cores

DGEQRF performance weak scalingCray XT5

Practical peak (GEMM)DAGuElibSCI Scalapack

DPOTRF

HPL or LU with Partial Pivoting Inria Bordeaux, UTK

0

200

400

600

800

1000

1200

0 10000 20000 30000 40000 50000

GF

lop

/s

Matrix Size

Dancer: 16*8 cores E5520, IB 20Gbs, Intel MKL

Solid = 8 cores/nodeDashed = 7 cores/node

GEMM PEAKWithout pivoting

Incremental pivotingScalapack

Partial pivoting

Figure 6.1: Performance of LU decomposition with PTG

is a mathematical library for parallel distributed memory machines using the LA-PACK library. For that, we used the Netlib1 version. ScaLAPACK was run byassigning one process MPI per core.

We implemented Task flow 3 for panel factorization and Task flow 5 for swap-ping operations in update. We used a 2D block cyclic distribution for matrix. Thesize of tile used was 200 for DAGuE and 120 for ScaLAPACK and has been chosento get the best asymptotic performance. The matrices where generated randomlywith the same side for all measurements.

For each measurement, we do five iterations of the execution, we do not takein count the maximal and minimal values obtained. We display the average of thethree other values. In most of the tests, we obtain a low standard deviation (lessthan 1 Gflop/s).

The results obtained are encouraging because our implementation outperformthe ScaLAPACK implementation and reached 75% of the GEMM peak. How-ever, the incremental pivoting algorithm which enables more parallelism and lesssynchronization in the panel factorization still provide better performance.

1http://www.netlib.org/

28

•  Why LU is different? 2 reasons: •  The panel is expressed using dataflow,

generating many tiny tasks •  Provides very poor performance

•  The DAG depends on the content of the data

•  The strict dataflow model we imposed on this exercise prevents message agregation

•  Our implementation is not message optimal

•  Explore all cases with as less tasks as possible

Hierarchical QR ENS Lyon, Inria Bordeaux, UTK, UCD

•  How to compose trees to get the best pipeline?

•  Flat, Binary, Fibonacci, Greedy, …

•  Study on critical path lengths •  Surprisingly Flat trees are

better for communications on square cases:

•  Less communications •  Good pipeline

Sparse Direct Solvers Inria Bordeaux, UTK

•  Based on PaStiX solver •  Super-nodal method as in SuperLU •  Exploits an elimination tree => DAG

•  LU, LLt & LDLt factorizations for shared memory with DAGuE/PaRSEC and StarPU

•  GPU panel and update kernels •  Based on blocked representation

•  Problems: •  Around 40 times more tasks than an

equivalent dense factorization •  Average task size can be very small

(20x20) •  Sizes are not regular

Name N NNZA Fill ratio OPC Type Factorization SourceMHD 485,597 12,359,369 61.20 9.84e+12 Real LU University of MinnesotaAudi 943,695 39,297,771 31.28 5.23e+12 Real LLT PARASOL Collection10M 10,423737 89,072,871 75.66 1.72e+14 Complex LDLT French CEA-Cesta

TABLE I: Matrices description

1 2 4 6 12 24 36 480

200

400

600

800

1,000

1,200

1,400

1,600

Number of Threads

Fact

oriz

atio

nTi

me

(s)

PASTIXPASTIX with STARPUPASTIX with DAGUE

Fig. 6: LU decomposition on MHD (double precision)

original scheduler on Audi and MHD test matrices. A goodscalability was obtained using all three approaches. On thesetwo test cases, generic schedulers behaved quite well, andobtained the performances similar to the fine-tuned PASTIXscheduler. On a small number of processors, we could evenobtain better results using the generic schedulers. DAGUE lostsome performance when more than a socket (12 cores) wasused, but recovered as soon as the computation spawned overmore than 2 sockets. The DAGUE scheduler seemed to notbe able to extract any performance gain between 12 and 24cores.

With the 10 Millions test matrix (Fig. 7), the finely-tunedscheduler of PASTIX outperformed the generic runtimes.Specifically, STARPU showed its limitation as the number ofthreads increased, while DAGUE maintained a good scalabil-ity. This can be explained by the fact that STARPU does nottake data locality into account.

B. Experiments with GPUs

Fig. 8 and 9 show the results of using one and two GPUs.For these experiments, we did not use all 48 available coresfor computation as one core was dedicated to each GPU.The CUDA kernel gave good acceleration on both single

1 2 4 6 12 24 36 480

10,000

20,000

30,000

40,000

Number of Threads

Fact

oriz

atio

nTi

me

(s)


Fig. 7: LDL

T decomposition on 10M (double complex)

and double precision. The factorization time was reducedsignificantly using the first GPU when the number of coreswas small. The speedups of up to 5 was obtained using oneGPU. However, the second GPU was relevant only with asmall number of cores (less than 4 threads in single precision(Fig. 8), and less than 12 threads in double precision (Fig. 9)).With one GPU, once the cores on a socket (12 core onRomulus, and 6 on Mirage) were fully utilized, the GPU hadno effect on the factorization time.

We also conducted additional GPU experiments using acompute node of the Mirage machine from INRIA - Bordeaux.Mirage nodes are composed of 2 Hexa-core (Westmere IntelXeon X5650), with 36GB of RAM. The results on Mirage(Fig. 10) were similar to the ones on Romulus.

V. CONCLUSION

In this paper, we examined the potential benefits of usinggeneric runtime systems, DAGUE and STARPU, in a parallelsparse direct solver PASTIX. The experimental results usingup to 48 cores and two NVIDIA Tesla GPUs demonstrated thepotential of this approach to design a sparse direct solver onheterogeneous manycore architectures with accelerators usinga uniform interface.

Name N NNZA Fill ratio OPC Type Factorization SourceMHD 485,597 12,359,369 61.20 9.84e+12 Real LU University of MinnesotaAudi 943,695 39,297,771 31.28 5.23e+12 Real LLT PARASOL Collection10M 10,423737 89,072,871 75.66 1.72e+14 Complex LDLT French CEA-Cesta

TABLE I: Matrices description

1 2 4 6 12 24 36 480

200

400

600

800

1,000

1,200

1,400

1,600

Number of Threads

Fact

oriz

atio

nTi

me

(s)


Fig. 6: LU decomposition on MHD (double precision)

original scheduler on Audi and MHD test matrices. A goodscalability was obtained using all three approaches. On thesetwo test cases, generic schedulers behaved quite well, andobtained the performances similar to the fine-tuned PASTIXscheduler. On a small number of processors, we could evenobtain better results using the generic schedulers. DAGUE lostsome performance when more than a socket (12 cores) wasused, but recovered as soon as the computation spawned overmore than 2 sockets. The DAGUE scheduler seemed to notbe able to extract any performance gain between 12 and 24cores.

With the 10 Millions test matrix (Fig. 7), the finely-tunedscheduler of PASTIX outperformed the generic runtimes.Specifically, STARPU showed its limitation as the number ofthreads increased, while DAGUE maintained a good scalabil-ity. This can be explained by the fact that STARPU does nottake data locality into account.

B. Experiments with GPUs

Fig. 8 and 9 show the results of using one and two GPUs.For these experiments, we did not use all 48 available coresfor computation as one core was dedicated to each GPU.The CUDA kernel gave good acceleration on both single

1 2 4 6 12 24 36 480

10,000

20,000

30,000

40,000

Number of Threads

Fact

oriz

atio

nTi

me

(s)


Fig. 7: LDL

T decomposition on 10M (double complex)

and double precision. The factorization time was reducedsignificantly using the first GPU when the number of coreswas small. The speedups of up to 5 was obtained using oneGPU. However, the second GPU was relevant only with asmall number of cores (less than 4 threads in single precision(Fig. 8), and less than 12 threads in double precision (Fig. 9)).With one GPU, once the cores on a socket (12 core onRomulus, and 6 on Mirage) were fully utilized, the GPU hadno effect on the factorization time.

We also conducted additional GPU experiments using acompute node of the Mirage machine from INRIA - Bordeaux.Mirage nodes are composed of 2 Hexa-core (Westmere IntelXeon X5650), with 36GB of RAM. The results on Mirage(Fig. 10) were similar to the ones on Romulus.

V. CONCLUSION

In this paper, we examined the potential benefits of usinggeneric runtime systems, DAGUE and STARPU, in a parallelsparse direct solver PASTIX. The experimental results usingup to 48 cores and two NVIDIA Tesla GPUs demonstrated thepotential of this approach to design a sparse direct solver onheterogeneous manycore architectures with accelerators usinga uniform interface.

Heterogeneity Support •  A BODY is a task on a specific device (codelet) •  Currently the system supports CUDA and cores •  A CUDA device is considered as one additional

memory level •  Data locality and data versioning define the

transfers to and from the GPU/Co-processors

/* POTRF Lower case */ GEMM(k, m, n) // Execution space k = 0 .. MT-3 m = k+2 .. MT-1 n = k+1 .. m-1 // Parallel partitioning : A(m, n) // Parameters READ A <- C TRSM(m, k) READ B <- C TRSM(n, k) RW C <- (k == 0) ? A(m, n) : C GEMM(k-1, m, n) -> (n == k+1) ? C TRSM(m, n) : C GEMM(k+1, m, n) BODY [CPU, CUDA, MIC, *]

•  Single node •  4xTesla (C1060) •  16 cores (AMD opteron)

•  Multi GPU – single node

•  Multi GPU - distributed

0

1

2

3

4

5

6

7

8

9

10

20k 30k 40k 50k 60k 70k 80k 90k 100k110k

120k

Pe

rfo

rm

an

ce

(T

Flo

p/s

)

Matrix Size

SPOTRF performance problem scaling12 GPU nodes (Infiniband 20G)

Practical peak (GEMM)C2070 + 8 cores per node8 cores per node

0

1

2

3

4

5

6

7

8

9

10

1;54k

2;76k

4;108

k

8;152

k

12;176

k

Pe

rfo

rm

an

ce

(T

Flo

p/s

)

Number of Nodes;Matrix Size (N)

SPOTRF performance weak scaling12 GPU nodes (Infiniband 20G)

Practical peak (GEMM)C2070 + 8 cores per node8 cores per node

Scalability

•  12 nodes •  12xFermi (C2070) •  8 cores/node (Intel core2)

0

200

400

600

800

1000

1200

1400

10k 20k 30k 40k 50k

Per

form

ance

(GFl

op/s

)

Matrix size (N)

C1060x4C1060x3C1060x2C1060x1

Energy efficiency

•  Energy used depending on the number of cores •  Up to 62% more energy efficient while using a

high performance tuned scheduling •  Power efficient scheduling

Total energy consumption QR factorization (256 cores)

Work in progress with Hatem Ltaief

0

5000

10000

15000

20000

25000

0 5 10 15 20 25 30 35

Pow

er

(Watts)

Time (seconds)

System

CPU

Memory

Network

(a) ScaLAPACK.

0

5000

10000

15000

20000

25000

0 5 10 15 20 25 30 35P

ow

er

(Watts)

Time (seconds)

System

CPU

Memory

Network

(b) DPLASMA.

Figure 11. Power Profiles of the Cholesky Factorization.

0

5000

10000

15000

20000

25000

0 20 40 60 80 100

Pow

er

(Watts)

Time (seconds)

System

CPU

Memory

Network

(a) ScaLAPACK.

0

5000

10000

15000

20000

25000

0 20 40 60 80 100

Pow

er

(Watts)

Time (seconds)

System

CPU

Memory

Network

(b) DPLASMA.

Figure 12. Power Profiles of the QR Factorization.

smaller number of cores. The engine could then decide toturn off or lower the frequencies of the cores using DynamicVoltage Frequency Scaling [16], a commonly used techniquewith which it is possible to achieve reduction of energyconsumption.

ACKNOWLEDGMENT

The authors would like to thank Kirk Cameron and Hung-Ching Chang from the Department of Computer Science atVirginia Tech, for granting access to their platform.

# Cores Library Cholesky QR

128 ScaLAPACK 192000 672000DPLASMA 128000 540000



Figure 13. Total amount of energy (joule) used for each test based on thenumber of cores

REFERENCES

[1] MPI-2: Extensions to the message passing interface standard.http://www.mpi-forum.org/ (1997)

[2] Agullo, E., Hadri, B., Ltaief, H., Dongarra, J.: Comparativestudy of one-sided factorizations with multiple software pack-ages on multi-core hardware. SC ’09: Proceedings of theConference on High Performance Computing Networking,Storage and Analysis pp. 1–12 (2009)

[3] Anderson, E., Bai, Z., Bischof, C., Blackford, S.L., Demmel,J.W., Dongarra, J.J., Croz, J.D., Greenbaum, A., Hammarling,S., McKenney, A., Sorensen, D.C.: LAPACK User’s Guide,3rd edn. Society for Industrial and Applied Mathematics,Philadelphia (1999)

[4] Bosilca, G., Bouteiller, A., Danalis, A., Faverge, M., Haidar,A., Herault, T., Kurzak, J., Langou, J., Lemarinier, P., Ltaief,H., Luszczek, P., YarKhan, A., Dongarra, J.: Flexible Devel-opment of Dense Linear Algebra Algorithms on MassivelyParallel Architectures with DPLASMA. In: the 12th IEEEInternational Workshop on Parallel and Distributed Scientificand Engineering Computing (PDSEC-11). ACM, Anchorage,AK, USA (2011)

[5] Bosilca, G., Bouteiller, A., Danalis, A., Herault, T.,Lem arinier, P., Dongarra, J.: DAGuE: A generic dis-tributed DAG engine for high performance computing.Tech. Rep. 231, LAPACK Working Note (2010). URLhttp://www.netlib.org/lapack/lawnspdf/lawn231.pdf

[6] Bosilca, G., Bouteiller, A., Herault, T., Lemarinier, P., Don-garra, J.: DAGuE: A generic distributed DAG engine for highperformance computing (2011)

[7] Buttari, A., Langou, J., Kurzak, J., Dongarra, J.: A class ofparallel tiled linear algebra algorithms for multicore architec-tures. Parallel Computing 35(1), 38–53 (2009)

[8] Choi, J., Demmel, J., Dhillon, I., Dongarra, J., Ostrou-chov, S., Petitet, A., Stanley, K., Walker, D., Whaley, R.C.:ScaLAPACK, a portable linear algebra library for distributedmemory computers-design issues and performance. ComputerPhysics Communications 97(1-2), 1–15 (1996)

[9] Cosnard, M., Jeannot, E.: Compact DAG representation andits dynamic scheduling. Journal of Parallel and DistributedComputing 58, 487–514 (1999)

[10] Dongarra, J., Beckman, P.: The International Exascale Soft-ware Roadmap. International Journal of High PerformanceComputer Applications 25(1) (2011)

0

5000

10000

15000

20000

25000

0 5 10 15 20 25 30 35

Pow

er

(Watts)

Time (seconds)

System

CPU

Memory

Network

(a) ScaLAPACK.

0

5000

10000

15000

20000

25000

0 5 10 15 20 25 30 35

Pow

er

(Watts)

Time (seconds)

System

CPU

Memory

Network

(b) DPLASMA.

Figure 11. Power Profiles of the Cholesky Factorization.

0

5000

10000

15000

20000

25000

0 20 40 60 80 100

Pow

er

(Watts)

Time (seconds)

System

CPU

Memory

Network

(a) ScaLAPACK.

0

5000

10000

15000

20000

25000

0 20 40 60 80 100

Pow

er

(Watts)

Time (seconds)

System

CPU

Memory

Network

(b) DPLASMA.

Figure 12. Power Profiles of the QR Factorization.

smaller number of cores. The engine could then decide toturn off or lower the frequencies of the cores using DynamicVoltage Frequency Scaling [16], a commonly used techniquewith which it is possible to achieve reduction of energyconsumption.

ACKNOWLEDGMENT

The authors would like to thank Kirk Cameron and Hung-Ching Chang from the Department of Computer Science atVirginia Tech, for granting access to their platform.

# Cores Library Cholesky QR




Figure 13. Total amount of energy (joule) used for each test based on thenumber of cores

REFERENCES

[1] MPI-2: Extensions to the message passing interface standard.http://www.mpi-forum.org/ (1997)

[2] Agullo, E., Hadri, B., Ltaief, H., Dongarra, J.: Comparativestudy of one-sided factorizations with multiple software pack-ages on multi-core hardware. SC ’09: Proceedings of theConference on High Performance Computing Networking,Storage and Analysis pp. 1–12 (2009)

[3] Anderson, E., Bai, Z., Bischof, C., Blackford, S.L., Demmel,J.W., Dongarra, J.J., Croz, J.D., Greenbaum, A., Hammarling,S., McKenney, A., Sorensen, D.C.: LAPACK User’s Guide,3rd edn. Society for Industrial and Applied Mathematics,Philadelphia (1999)

[4] Bosilca, G., Bouteiller, A., Danalis, A., Faverge, M., Haidar,A., Herault, T., Kurzak, J., Langou, J., Lemarinier, P., Ltaief,H., Luszczek, P., YarKhan, A., Dongarra, J.: Flexible Devel-opment of Dense Linear Algebra Algorithms on MassivelyParallel Architectures with DPLASMA. In: the 12th IEEEInternational Workshop on Parallel and Distributed Scientificand Engineering Computing (PDSEC-11). ACM, Anchorage,AK, USA (2011)

[5] Bosilca, G., Bouteiller, A., Danalis, A., Herault, T.,Lem arinier, P., Dongarra, J.: DAGuE: A generic dis-tributed DAG engine for high performance computing.Tech. Rep. 231, LAPACK Working Note (2010). URLhttp://www.netlib.org/lapack/lawnspdf/lawn231.pdf

[6] Bosilca, G., Bouteiller, A., Herault, T., Lemarinier, P., Don-garra, J.: DAGuE: A generic distributed DAG engine for highperformance computing (2011)

[7] Buttari, A., Langou, J., Kurzak, J., Dongarra, J.: A class ofparallel tiled linear algebra algorithms for multicore architec-tures. Parallel Computing 35(1), 38–53 (2009)

[8] Choi, J., Demmel, J., Dhillon, I., Dongarra, J., Ostrou-chov, S., Petitet, A., Stanley, K., Walker, D., Whaley, R.C.:ScaLAPACK, a portable linear algebra library for distributedmemory computers-design issues and performance. ComputerPhysics Communications 97(1-2), 1–15 (1996)

[9] Cosnard, M., Jeannot, E.: Compact DAG representation andits dynamic scheduling. Journal of Parallel and DistributedComputing 58, 487–514 (1999)

[10] Dongarra, J., Beckman, P.: The International Exascale Soft-ware Roadmap. International Journal of High PerformanceComputer Applications 25(1) (2011)

SystemG: Virginia Tech Energy Monitored cluster (ib40g, intel, 8cores/node)

Analysis Tools

Hermitian Band Diagonal; 16x16 tiles

Conclusion •  Programming must be made easy(ier)

•  Portability: inherently take advantage of all hardware capabilities •  Efficiency: deliver the best performance on several families of

algorithms •  Computer scientists were spoiled by MPI

•  Now let’s think about our users •  Let different people focus on different problems

•  Application developers on their algorithms •  System developers on system issues •  Compilers on whatever they can

The end

Resilience •  The fault propagate in the system

based on the data dependencies •  However, if the original data can be

recovered, the execution complete without user interaction

•  Automatic recovery made simple

Composition •  An algorithm is a series of operations

with data dependencies •  A sequential composition limit the

parallelism due to strict synchronizations •  Following the flow of data we can loosen the

synchronizations and transform them in data dependencies









Other Systems

PaR

SEC

SM

Pss

StarP

U

Charm

+

+

FLAM

E

QU

AR

K

Tblas

PTG

Scheduling Distr.

(1/core) Repl

(1/node) Repl

(1/node) Distr.

(Actors) w/

SuperMatrix Repl

(1/node) Centr. Centr.

Language Internal

or Seq. w/ Affine Loops

Seq. w/

add_task

Seq. w/

add_task

Msg-Driven Objects

Internal (LA DSL)

Seq. w/

add_task

Seq. w/

add_task Internal

Accelerator GPU GPU GPU GPU GPU

Availability Public Public Public Public Public Public Not Avail. Not Avail.

Early stage: ParalleX Non-academic: Swarm, MadLINQ, CnC

All projects support Distributed and Shared Memory (QUARK with QUARKd; FLAME with Elemental)

History: Beginnings of Data Flow •  “Design of a separable transition-diagram compiler”,

M.E. Conway, Comm. ACM, 1963 •  Coroutines, flow of data between process

•  J.B. Dennis, 60’s •  Data Flow representation of programs •  Reasoning about parallelism, equivalence of programs, …

•  “The semantics of a simple language for parallel programming”, G. Kahn •  Kahn Networks

Documents

PaRSEC · • Handles specifics of HW system (hyper-threading, NUMA, …) • Per-object capabilities • Read-only or write-only, output data, private, relaxed coherence • engine