Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
PaRSEC Parallel Runtime Scheduling and Execution Control
George Bosilca, Aurelien Bouteiller,
Anthony Danalis, Mathieu Faverge, Thomas Herault,
Stephanie Moreau, Jack Dongarra
Where we are today • Today software developers face systems with
• ~1 TFLOP of compute power per node • 32+ of cores, 100+ hardware threads • Highly heterogeneous architectures (cores + specialized cores +
accelerators/coprocessors) • Deep memory hierarchies • But wait there is more: They are distributed! • Consequence is systemic load imbalance
• And we still ask the same question: How to harness these devices productively? • SPMD produces choke points, wasted wait times • We need to improve efficiency, power and reliability
The missing parallelism • Too difficult to find parallelism, to debug,
maintain and get good performance for everyone
• Increasing gaps between the capabilities of today’s programming environments, the requirements of emerging applications, and the challenges of future parallel architectures
Popular belief
Over-sized Decent sizes
Decouple “System issues” from Algorithm
• Keep the algorithm simple • Depict only the flow of data between tasks • Distributed Dataflow Environment based on Dynamic
Scheduling of (Micro) Tasks • Programmability: layered approach
• Algorithm / Data Distribution • Parallel applications without parallel programming
• Portability / Efficiency • Use all available hardware; overlap data movements /
computation • Progress is still possible when imbalance arise
Dataflow with Runtime scheduling • Algorithms need help to unleash their power
• Hardware specificities: a runtime can provide portability, performance, scheduling heuristics, heterogeneity management, data movement, …
• Scalability: maximize parallelism extraction, but no centralized scheduling or entire DAG unpacking: dynamic and independent discovery of the relevant portions during the execution
• Jitter resilience: Do not support explicit communications, instead make them implicit and schedule to maximize overlap and load balance
• The need to express the algorithms differently
Node0
Node1
Node2
Node3
PO
GE
TR
SYTR TR
PO GE GE
TR TR
SYSY GE
PO
TR
SY
PO
SY SY
Divide-and-orchestrate
Divide-and-orchestrate • Leave the optimization of each
level to specialized entities • Compiler do marvels when the cost
of memory accesses is known • Think hierarchical super-scalar
• Compiler focus on problems that fit in the cache
• Humans focus on depicted the flow of data between tasks
• And a runtime orchestrate the flows to maximize the throughput of the architecture
GEQRT
TSQRT
UNMQR
TSMQR
Sca
LAP
AC
K Q
R
PLA
SM
A Q
R
The PaRSEC framework
Cores Memory
Hierarchies Coherence
Data Movement
Accelerators
Data Movement
Par
alle
l R
untim
e H
ardw
are
Dom
ain
Spe
cific
Ex
tens
ions
Dense LA … Sparse LA
Scheduling Scheduling
Scheduling
Data
Compact Representation - PTG
Dynamic Discovered Representation - DTG
SpecializedKernels Specialized
Kernels Specialized Kernels
Tasks Tasks
Tasks
Domain Specific Extensions • DSEs ⇒ higher productivity for developers
• High-level data types & ops tailored to domain • E.g., relations, matrices, triangles, …
• Portable and scalable specification of parallelism • Automatically adjust data structures, mapping, and scheduling
as systems scale up • Toolkit of classical data distributions, etc
PaRSEC toolchain
SerialCode
DAGuEcompiler
Dataflowrepresentation
Dataflowcompiler
Paralleltasks stubs
Runtime
Programmer
Systemcompiler
Additionallibraries
MPI
CUDApthreads
PLASMA MAGMA
Application code &Codelets
DAGuE Toolchain
DomainSpecificExtensions
Datadistribution
Supercomputer
Serial Code to Dataflow Representation
PaRSEC Compiler SerialCode
DAGuEcompiler
Dataflowrepresentation
Dataflowcompiler
Paralleltasks stubs
Runtime
Programmer
Systemcompiler
Additionallibraries
MPI
CUDApthreads
PLASMA MAGMA
Application code &Codelets
DAGuE Toolchain
DomainSpecificExtensions
Datadistribution
Supercomputer
Example: QR Factorization
GEQRT
TSQRT
UNMQR
TSMQR
FOR k = 0 .. SIZE - 1
A[k][k], T[k][k] <- GEQRT( A[k][k] )
FOR m = k+1 .. SIZE - 1
A[k][k]|Up, A[m][k], T[m][k] <- TSQRT( A[k][k]|Up, A[m][k], T[m][k] )
FOR n = k+1 .. SIZE - 1
A[k][n] <- UNMQR( A[k][k]|Low, T[k][k], A[k][n] )
FOR m = k+1 .. SIZE - 1
A[k][n], A[m][n] <- TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] )
Input Format – Quark/StarPU/MORSE for (k = 0; k < A.mt; k++) { Insert_Task( zgeqrt, A[k][k], INOUT, T[k][k], OUTPUT); for (m = k+1; m < A.mt; m++) { Insert_Task( ztsqrt, A[k][k], INOUT | REGION_D|REGION_U, A[m][k], INOUT | LOCALITY, T[m][k], OUTPUT); } for (n = k+1; n < A.nt; n++) { Insert_Task( zunmqr, A[k][k], INPUT | REGION_L, T[k][k], INPUT, A[k][m], INOUT); for (m = k+1; m < A.mt; m++) Insert_Task( ztsmqr, A[k][n], INOUT, A[m][n], INOUT | LOCALITY, A[m][k], INPUT, T[m][k], INPUT); } }
• Sequential C code • Annotated through
some specific syntax • Insert_Task • INOUT, OUTPUT, INPUT • REGION_L, REGION_U,
REGION_D, … • LOCALITY
Dataflow Analysis
• data flow analysis • Example on task DGEQRT of
QR • Polyhedral Analysis through
Omega Test • Compute algebraic
expressions for: • Source and destination
tasks • Necessary conditions for
that data flow to exist
FOR k = 0 .. SIZE - 1
A[k][k], T[k][k] <- GEQRT( A[k][k] )
FOR m = k+1 .. SIZE - 1
A[k][k]|Up, A[m][k], T[m][k] <- TSQRT( A[k][k]|Up, A[m][k], T[m][k] )
FOR n = k+1 .. SIZE - 1
A[k][n] <- UNMQR( A[k][k]|Low, T[k][k], A[k][n] )
FOR m = k+1 .. SIZE - 1
A[k][n], A[m][n] <- TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] )
MEM
n = k+1m = k+1
k = 0
k = SIZE-1
LOWER
UPPER
Incoming DataOutgoing Data Serial
CodeDAGuE
compiler
Dataflowrepresentation
Dataflowcompiler
Paralleltasks stubs
Runtime
Programmer
Systemcompiler
Additionallibraries
MPI
CUDApthreads
PLASMA MAGMA
Application code &Codelets
DAGuE Toolchain
DomainSpecificExtensions
Datadistribution
Supercomputer
Intermediate Representation: Job Data Flow
GEQRT
TSQRT
UNMQR
TSMQR
Control flow is eliminated, therefore maximum parallelism is possible
GEQRT(k) /* Execution space */ k = 0..( MT < NT ) ? MT-1 : NT-1 ) /* Locality */ : A(k, k) RW A <- (k == 0) ? A(k, k) : A1 TSMQR(k-1, k, k) -> (k < NT-1) ? A UNMQR(k, k+1 .. NT-1) [type = LOWER] -> (k < MT-1) ? A1 TSQRT(k, k+1) [type = UPPER] -> (k == MT-1) ? A(k, k) [type = UPPER] WRITE T <- T(k, k) -> T(k, k) -> (k < NT-1) ? T UNMQR(k, k+1 .. NT-1) /* Priority */ ;(NT-k)*(NT-k)*(NT-k) BODY zgeqrt( A, T ) END
Dataflow Representation PaRSEC Compiler
SerialCode
DAGuEcompiler
Dataflowrepresentation
Dataflowcompiler
Paralleltasks stubs
Runtime
Programmer
Systemcompiler
Additionallibraries
MPI
CUDApthreads
PLASMA MAGMA
Application code &Codelets
DAGuE Toolchain
DomainSpecificExtensions
Datadistribution
Supercomputer
Example: Reduction Operation • Reduction: apply a user defined
operator on each data and store the result in a single location. (Suppose the operator is associative and commutative)
Issue: Non-affine loops lead to non-polyhedral array accessing
for(s = 1; s < N/2; s = 2*s) for(i = 0; i < N-s; i += s) operator(V[i], V[i+s])
Example: Reduction Operation reduce(l, p) : V(p) l = 1 .. depth+1 p = 0 .. (MT / (1<<l)) RW A <- (1 == l) ? V(2*p) : A reduce( l-1, 2*p ) -> ((depth+1) == l) ? V(0) -> (0 == (p%2))? A reduce(l+1, p/2) : B reduce(l+1, p/2) READ B <- ((p*(1<<l) + (1<<(l-1))) > MT) ? V(0) <- (1 == l) ? V(2*p+1) <- (1 != l) ? A reduce( l-1, p*2+1 ) BODY operator(A, B); END
Solution: Hand-writing of the data dependency using the intermediate Data Flow representation
Data/Task Distribution • Flexible data distribution
• Decoupled from the algorithm • Expressed as a user-defined
function • Only limitation: must evaluate
uniformly across all nodes • Common distributions provided in
DSEs • 1D cyclic, 2D cyclic, etc. • Symbol Matrix for sparse direct
solvers SerialCode
DAGuEcompiler
Dataflowrepresentation
Dataflowcompiler
Paralleltasks stubs
Runtime
Programmer
Systemcompiler
Additionallibraries
MPI
CUDApthreads
PLASMA MAGMA
Application code &Codelets
DAGuE Toolchain
DomainSpecificExtensions
Datadistribution
Supercomputer
Parallel Runtime Algorithm is now expressed as a DAG (potentially Parameterized)
• DAG too large to be generated ahead of time
• Generate it dynamically • Merge parameterized
DAGs with dynamically generated DAGs
• HPC is about distributed heterogeneous resources
• Have to get involved in message passing
• Distributed management of the scheduling
• Dynamically deal with heterogeneity
Runtime DAG scheduling • Every process has the symbolic DAG
representation • Only the (node local) frontier of the
DAG is considered • Distributed Scheduling based on
remote completion notifications • Background remote data transfer
automatic with overlap • NUMA / Cache aware Scheduling
• Work Stealing and sharing based on memory hierarchies
Node0
Node1
Node2
Node3
PO
GE
TR
SYTR TR
PO GE GE
TR TR
SYSY GE
PO
TR
SY
PO
SY SY
Scheduling Heuristics in PaRSEC • Manages parallelism & locality
• Achieve efficient execution (performance, power, …) • Handles specifics of HW system (hyper-threading, NUMA, …)
• Per-object capabilities • Read-only or write-only, output data, private, relaxed coherence • engine tracks data usage, and targets to improve data reuse • NUMA aware hierarchical bounded buffers to implement work stealing
• Users hints: expressions for distance to critical path • Insertion in waiting queue abides to priority, but work stealing can alter this
ordering due to locality • Communications heuristics
• Communications inherits priority of destination task • Algorithm defined scheduling
What’s next? Now we have a runtime and some algorithms
Auto-tuning • Multi-level tuning
• Tune the kernels based on local architecture
• Then tune the algorithm • Depends on the
network, type and number of cores
• For a fixed size matrix increasing the task duration (or the tile size) decrease parallelism
• For best performance: auto-tune per system
50
60
70
80
90
100
120 160
200 260
300 340
460 640
1000
% e
ffici
ency
Block Size (NB)
1 Nodes (8 cores)4 Nodes (32 cores)
81 Nodes (648 cores)
Not enough parallelism
GEQRT
TSQRT
UNMQR
TSMQR
Weak-scaling
References[1] E. Agullo, J. Demmel, J. Dongarra, B. Hadri, J. Kurzak, J. Langou,
H. Ltaief, P. Luszczek, and S. Tomov. Numerical linear algebraon emerging architectures: The PLASMA and MAGMA projects.Journal of Physics: Conference Series, 180, 2009.
[2] E. Anderson, Z. Bai, C. Bischof, L. S. Blackford, J. W. Demmel, J. J.Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney,and D. Sorensen. LAPACK Users’ Guide. SIAM, Philadelphia, PA,1992. URL http://www.netlib.org/lapack/lug/.
[3] C. Augonnet, S. Thibault, R. Namyst, and P.-A. Wacrenier. StarPU: Aunified platform for task scheduling on heterogeneous multicore archi-tectures. In Proceedings of Euro-par’09, LNCS, Delft, Netherlands,2009.
[4] C. Bischof and C. van Loan. The WY representation for products ofHouseholder matrices. J. Sci. Stat. Comput., 8:2–13, 1987.
[5] L. S. Blackford, J. Choi, A. J. Cleary, E. F. D’Azevedo, J. Dem-mel, I. S. Dhillon, J. Dongarra, S. Hammarling, G. Henry, A. Petitet,K. Stanley, D. W. Walker, and R. C. Whaley. ScaLAPACK: A linearalgebra library for message-passing computers. In Proceedings of theEighth SIAM Conference on Parallel Processing for Scientific Com-puting. SIAM, 1997. ISBN 0-89871-395-1.
[6] R. Bolze, F. Cappello, E. Caron, M. J. Dayde, F. Desprez,E. Jeannot, Y. Jegou, S. Lanteri, J. Leduc, N. Melab, G. Mor-net, R. Namyst, P. Primet, B. Quetier, O. Richard, E.-G. Talbi,and I. Touche. Grid’5000: A large scale and highly reconfig-urable experimental grid testbed. IJHPCA, 20(4):481–494, 2006.DOI: 10.1177/1094342006070078.
[7] F. Broquedis, J. Clet-Ortega, S. Moreaud, N. Furmento, B. Goglin,G. Mercier, S. Thibault, and R. Namyst. hwloc: a generic frameworkfor managing hardware affinities in HPC applications. In IEEE, editor,the 18th Euromicro International Conference on Parallel, Distributedand Network-Based Computing (PDP), Pisa Italy, Feb 2010.
[8] A. Buttari, J. Dongarra, J. Kurzak, J. Langou, P. Luszczek, and S. To-mov. The impact of multicore on math software. In Applied ParallelComputing. State of the Art in Scientific Computing, 8th InternationalWorkshop, PARA, volume 4699 of LNCS, pages 1–10. Springer, 2006.ISBN 978-3-540-75754-2. DOI: 10.1007/978-3-540-75755-9 1.
[9] A. Buttari, J. Langou, J. Kurzak, and J. J. Dongarra. Parallel tiledQR factorization for multicore architectures. Concurrency Computat.:Pract. Exper., 20(13):1573–1590, 2008. DOI: 10.1002/cpe.1301.
[10] A. Buttari, J. Langou, J. Kurzak, and J. Dongarra. A classof parallel tiled linear algebra algorithms for multicore architec-tures. Parallel Comput., 35(1):38–53, 2009. ISSN 0167-8191.DOI: 10.1016/j.parco.2008.10.002.
[11] E. Chan, E. S. Quintana-Orti, G. Quintana-Orti, and R. van de Geijn.Supermatrix out-of-order scheduling of matrix operations for smp andmulti-core architectures. In SPAA ’07: Proceedings of the nineteenthannual ACM symposium on Parallel algorithms and architectures,pages 116–125, New York, NY, USA, 2007. ACM. ISBN 978-1-59593-667-7. DOI: 10.1145/1248377.1248397.
[12] M. Cosnard and E. Jeannot. Automatic parallelization tech-niques based on compact DAG extraction and symbolicscheduling. Parallel Processing Letters, 11:151–168, 2001.DOI: 10.1142/S012962640100049X.
[13] M. Cosnard, E. Jeannot, and T. Yang. Compact dag representationand its symbolic scheduling. Journal of Parallel and DistributedComputing, 64(8):921–935, Aug 2004.
[14] O. Delannoy, N. Emad, and S. Petiton. Workflow global computingwith yml. In 7th IEEE/ACM International Conference on Grid Com-puting, Sep 2006.
[15] R. Dolbeau, S. Bihan, and F. Bodin. HMPP: A hybrid multi-coreparallel programming environment. In Workshop on General PurposeProcessing on Graphics Processing Units (GPGPU), 2007.
[16] J. J. Dongarra, J. D. Croz, I. S. Duff, , and S. Hammarling. A set oflevel 3 basic linear algebra subprograms. ACM Trans. Math. Soft., 16:1–17, 1990.
[17] J. J. Dongarra, P. Luszczek, and A. Petitet. The LINPACK benchmark:Past, present and future. Concurrency Computat.: Pract. Exper., 15(9):803–820, 2003. DOI: 10.1002/cpe.728.
[18] G. H. Golub and C. F. van Loan. Matrix Computations. The JohnsHopkins University Press, 1996. ISBN 0801854148.
[19] F. G. Gustavson. Recursion leads to automatic variable blocking fordense linear-algebra algorithms. IBM J. Res. & Dev., 41(6):737–756,1997. DOI: 10.1147/rd.416.0737.
[20] F. G. Gustavson. New generalized matrix data structures lead to avariety of high-performance algorithms. In Proceedings of the IFIPTC2/WG2.5 Working Conference on the Architecture of Scientific Soft-ware, pages 211–234, Ottawa, Canada, Oct 2000. Kluwer AcademicPublishers. ISBN: 0792373391.
[21] F. G. Gustavson, J. A. Gunnels, and J. C. Sexton. Minimal data copyfor dense linear algebra factorization. In Applied Parallel Computing,State of the Art in Scientific Computing, 8th International Workshop,PARA 2006, volume 4699, pages 540–549, Umea, Sweden, Jun 2006.LNCS. DOI: 10.1007/978-3-540-75755-9 66.
[22] F. G. Gustavson, L. Karlsson, and B. Kagstrom. DistributedSBP cholesky factorization algorithms with near-optimal schedul-ing. ACM Trans. Math. Softw., 36(2):1–25, 2009. ISSN 0098-3500.DOI: 10.1145/1499096.1499100.
[23] P. Husbands and K. A. Yelick. Multi-threading and one-sided commu-nication in parallel lu factorization. In B. Verastegui, editor, Proceed-ings of the ACM/IEEE Conference on High Performance Network-ing and Computing, SC 2007, November 10-16, 2007, Reno, Nevada,USA. ACM Press, 2007. ISBN 978-1-59593-764-3.
[24] E. Jeannot. Automatic multithreaded parallel program generation formessage passing multiprocessors using parameterized task graphs. InInternational Conference ’Parallel Computing 2001’ (ParCo2001),Sep 2001.
[25] C. L. Lawson, R. J. Hanson, D. Kincaid, and F. T. Krogh. Basic linearalgebra subprograms for FORTRAN usage. ACM Trans. Math. Soft.,5:308–323, 1979.
[26] J. Perez, R. Badia, and J. Labarta. A dependency-aware task-basedprogramming environment for multi-core architectures. In ClusterComputing, 2008 IEEE International Conference on, pages 142 –151,Oct 2008. DOI: 10.1109/CLUSTR.2008.4663765.
[27] W. Pugh. The omega test: a fast and practical integer programmingalgorithm for dependence analysis. In Supercomputing ’91: Proceed-ings of the 1991 ACM/IEEE conference on Supercomputing, pages 4–13, New York, NY, USA, 1991.
[28] E. S. Quintana-Ortı and R. A. van de Geijn. Updating an LU fac-torization with pivoting. ACM Trans. Math. Softw., 35(2):11, 2008.DOI: 10.1145/1377612.1377615.
[29] R. Schreiber and C. van Loan. A storage-efficient WY representationfor products of Householder transformations. J. Sci. Stat. Comput., 10:53–57, 1991.
[30] J. A. Sharp, editor. Data flow computing: theory and practice. AblexPublishing Corp, 1992.
[31] F. Song, A. YarKhan, and J. Dongarra. Dynamic task schedul-ing for linear algebra algorithms on distributed-memory multicoresystems. In SC ’09: Proceedings of the Conference on High Per-formance Computing Networking, Storage and Analysis, pages 1–11, New York, NY, USA, 2009. ACM. ISBN 978-1-60558-744-8.DOI: 10.1145/1654059.1654079.
[32] G. W. Stewart. Matrix algorithms. Society for Industrial and AppliedMathematics, Philadelphia, PA, USA, 2001. ISBN 0-89871-503-2.
[33] F. G. van Zee. libflame: The Complete Reference. www.lulu.com,2009.
[34] J. Yu and R. Buyya. A taxonomy of workflow management systemsfor grid computing. Journal of Grid Computing, 2005.
10 2010/9/13
DSBP
81 dual Intel Xeon [email protected] (2x4 cores/node) è 648 cores MX 10Gbs, Intel MKL, Scalapack
0
1
2
3
4
5
6
7
13k
26k
40k
53k
67k
80k
94k
107k
120k
130k
TF
lop
/s
Matrix size (N)
DPOTRF performance problem scaling648 cores (Myrinet 10G)
Theoretical peakPractical peak (GEMM)DAGuEDSBPScaLAPACK
0
1
2
3
4
5
6
7
13k
26k
40k
53k
67k
80k
94k
107k
120k
130k
TF
lop
/s
Matrix size (N)
DGEQRF performance problem scaling648 cores (Myrinet 10G)
Theoretical peakPractical peak (GEMM)DAGuEScaLAPACK
0
1
2
3
4
5
6
7
13k
26k
40k
53k
67k
80k
94k
107k
120k
130k
TF
lop
/s
Matrix size (N)
DGETRF performance problem scaling648 cores (Myrinet 10G)
Theoretical peakPractical peak (GEMM)DAGuEHPLScaLAPACK
References[1] E. Agullo, J. Demmel, J. Dongarra, B. Hadri, J. Kurzak, J. Langou,
H. Ltaief, P. Luszczek, and S. Tomov. Numerical linear algebraon emerging architectures: The PLASMA and MAGMA projects.Journal of Physics: Conference Series, 180, 2009.
[2] E. Anderson, Z. Bai, C. Bischof, L. S. Blackford, J. W. Demmel, J. J.Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney,and D. Sorensen. LAPACK Users’ Guide. SIAM, Philadelphia, PA,1992. URL http://www.netlib.org/lapack/lug/.
[3] C. Augonnet, S. Thibault, R. Namyst, and P.-A. Wacrenier. StarPU: Aunified platform for task scheduling on heterogeneous multicore archi-tectures. In Proceedings of Euro-par’09, LNCS, Delft, Netherlands,2009.
[4] C. Bischof and C. van Loan. The WY representation for products ofHouseholder matrices. J. Sci. Stat. Comput., 8:2–13, 1987.
[5] L. S. Blackford, J. Choi, A. J. Cleary, E. F. D’Azevedo, J. Dem-mel, I. S. Dhillon, J. Dongarra, S. Hammarling, G. Henry, A. Petitet,K. Stanley, D. W. Walker, and R. C. Whaley. ScaLAPACK: A linearalgebra library for message-passing computers. In Proceedings of theEighth SIAM Conference on Parallel Processing for Scientific Com-puting. SIAM, 1997. ISBN 0-89871-395-1.
[6] R. Bolze, F. Cappello, E. Caron, M. J. Dayde, F. Desprez,E. Jeannot, Y. Jegou, S. Lanteri, J. Leduc, N. Melab, G. Mor-net, R. Namyst, P. Primet, B. Quetier, O. Richard, E.-G. Talbi,and I. Touche. Grid’5000: A large scale and highly reconfig-urable experimental grid testbed. IJHPCA, 20(4):481–494, 2006.DOI: 10.1177/1094342006070078.
[7] F. Broquedis, J. Clet-Ortega, S. Moreaud, N. Furmento, B. Goglin,G. Mercier, S. Thibault, and R. Namyst. hwloc: a generic frameworkfor managing hardware affinities in HPC applications. In IEEE, editor,the 18th Euromicro International Conference on Parallel, Distributedand Network-Based Computing (PDP), Pisa Italy, Feb 2010.
[8] A. Buttari, J. Dongarra, J. Kurzak, J. Langou, P. Luszczek, and S. To-mov. The impact of multicore on math software. In Applied ParallelComputing. State of the Art in Scientific Computing, 8th InternationalWorkshop, PARA, volume 4699 of LNCS, pages 1–10. Springer, 2006.ISBN 978-3-540-75754-2. DOI: 10.1007/978-3-540-75755-9 1.
[9] A. Buttari, J. Langou, J. Kurzak, and J. J. Dongarra. Parallel tiledQR factorization for multicore architectures. Concurrency Computat.:Pract. Exper., 20(13):1573–1590, 2008. DOI: 10.1002/cpe.1301.
[10] A. Buttari, J. Langou, J. Kurzak, and J. Dongarra. A classof parallel tiled linear algebra algorithms for multicore architec-tures. Parallel Comput., 35(1):38–53, 2009. ISSN 0167-8191.DOI: 10.1016/j.parco.2008.10.002.
[11] E. Chan, E. S. Quintana-Orti, G. Quintana-Orti, and R. van de Geijn.Supermatrix out-of-order scheduling of matrix operations for smp andmulti-core architectures. In SPAA ’07: Proceedings of the nineteenthannual ACM symposium on Parallel algorithms and architectures,pages 116–125, New York, NY, USA, 2007. ACM. ISBN 978-1-59593-667-7. DOI: 10.1145/1248377.1248397.
[12] M. Cosnard and E. Jeannot. Automatic parallelization tech-niques based on compact DAG extraction and symbolicscheduling. Parallel Processing Letters, 11:151–168, 2001.DOI: 10.1142/S012962640100049X.
[13] M. Cosnard, E. Jeannot, and T. Yang. Compact dag representationand its symbolic scheduling. Journal of Parallel and DistributedComputing, 64(8):921–935, Aug 2004.
[14] O. Delannoy, N. Emad, and S. Petiton. Workflow global computingwith yml. In 7th IEEE/ACM International Conference on Grid Com-puting, Sep 2006.
[15] R. Dolbeau, S. Bihan, and F. Bodin. HMPP: A hybrid multi-coreparallel programming environment. In Workshop on General PurposeProcessing on Graphics Processing Units (GPGPU), 2007.
[16] J. J. Dongarra, J. D. Croz, I. S. Duff, , and S. Hammarling. A set oflevel 3 basic linear algebra subprograms. ACM Trans. Math. Soft., 16:1–17, 1990.
[17] J. J. Dongarra, P. Luszczek, and A. Petitet. The LINPACK benchmark:Past, present and future. Concurrency Computat.: Pract. Exper., 15(9):803–820, 2003. DOI: 10.1002/cpe.728.
[18] G. H. Golub and C. F. van Loan. Matrix Computations. The JohnsHopkins University Press, 1996. ISBN 0801854148.
[19] F. G. Gustavson. Recursion leads to automatic variable blocking fordense linear-algebra algorithms. IBM J. Res. & Dev., 41(6):737–756,1997. DOI: 10.1147/rd.416.0737.
[20] F. G. Gustavson. New generalized matrix data structures lead to avariety of high-performance algorithms. In Proceedings of the IFIPTC2/WG2.5 Working Conference on the Architecture of Scientific Soft-ware, pages 211–234, Ottawa, Canada, Oct 2000. Kluwer AcademicPublishers. ISBN: 0792373391.
[21] F. G. Gustavson, J. A. Gunnels, and J. C. Sexton. Minimal data copyfor dense linear algebra factorization. In Applied Parallel Computing,State of the Art in Scientific Computing, 8th International Workshop,PARA 2006, volume 4699, pages 540–549, Umea, Sweden, Jun 2006.LNCS. DOI: 10.1007/978-3-540-75755-9 66.
[22] F. G. Gustavson, L. Karlsson, and B. Kagstrom. DistributedSBP cholesky factorization algorithms with near-optimal schedul-ing. ACM Trans. Math. Softw., 36(2):1–25, 2009. ISSN 0098-3500.DOI: 10.1145/1499096.1499100.
[23] P. Husbands and K. A. Yelick. Multi-threading and one-sided commu-nication in parallel lu factorization. In B. Verastegui, editor, Proceed-ings of the ACM/IEEE Conference on High Performance Network-ing and Computing, SC 2007, November 10-16, 2007, Reno, Nevada,USA. ACM Press, 2007. ISBN 978-1-59593-764-3.
[24] E. Jeannot. Automatic multithreaded parallel program generation formessage passing multiprocessors using parameterized task graphs. InInternational Conference ’Parallel Computing 2001’ (ParCo2001),Sep 2001.
[25] C. L. Lawson, R. J. Hanson, D. Kincaid, and F. T. Krogh. Basic linearalgebra subprograms for FORTRAN usage. ACM Trans. Math. Soft.,5:308–323, 1979.
[26] J. Perez, R. Badia, and J. Labarta. A dependency-aware task-basedprogramming environment for multi-core architectures. In ClusterComputing, 2008 IEEE International Conference on, pages 142 –151,Oct 2008. DOI: 10.1109/CLUSTR.2008.4663765.
[27] W. Pugh. The omega test: a fast and practical integer programmingalgorithm for dependence analysis. In Supercomputing ’91: Proceed-ings of the 1991 ACM/IEEE conference on Supercomputing, pages 4–13, New York, NY, USA, 1991.
[28] E. S. Quintana-Ortı and R. A. van de Geijn. Updating an LU fac-torization with pivoting. ACM Trans. Math. Softw., 35(2):11, 2008.DOI: 10.1145/1377612.1377615.
[29] R. Schreiber and C. van Loan. A storage-efficient WY representationfor products of Householder transformations. J. Sci. Stat. Comput., 10:53–57, 1991.
[30] J. A. Sharp, editor. Data flow computing: theory and practice. AblexPublishing Corp, 1992.
[31] F. Song, A. YarKhan, and J. Dongarra. Dynamic task schedul-ing for linear algebra algorithms on distributed-memory multicoresystems. In SC ’09: Proceedings of the Conference on High Per-formance Computing Networking, Storage and Analysis, pages 1–11, New York, NY, USA, 2009. ACM. ISBN 978-1-60558-744-8.DOI: 10.1145/1654059.1654079.
[32] G. W. Stewart. Matrix algorithms. Society for Industrial and AppliedMathematics, Philadelphia, PA, USA, 2001. ISBN 0-89871-503-2.
[33] F. G. van Zee. libflame: The Complete Reference. www.lulu.com,2009.
[34] J. Yu and R. Buyya. A taxonomy of workflow management systemsfor grid computing. Journal of Grid Computing, 2005.
10 2010/9/13
DSBP
81 dual Intel Xeon [email protected] (2x4 cores/node) è 648 cores MX 10Gbs, Intel MKL, Scalapack
0
1
2
3
4
5
6
7
13k
26k
40k
53k
67k
80k
94k
107k
120k
130k
TF
lop
/s
Matrix size (N)
DPOTRF performance problem scaling648 cores (Myrinet 10G)
Theoretical peakPractical peak (GEMM)DAGuEDSBPScaLAPACK
0
1
2
3
4
5
6
7
13k
26k
40k
53k
67k
80k
94k
107k
120k
130k
TF
lop
/s
Matrix size (N)
DGEQRF performance problem scaling648 cores (Myrinet 10G)
Theoretical peakPractical peak (GEMM)DAGuEScaLAPACK
0
1
2
3
4
5
6
7
13k
26k
40k
53k
67k
80k
94k
107k
120k
130k
TF
lop
/s
Matrix size (N)
DGETRF performance problem scaling648 cores (Myrinet 10G)
Theoretical peakPractical peak (GEMM)DAGuEHPLScaLAPACK
Extracts more parallelism Independence
between data layout and algorithm
Competes with Hand tuned
Hardware conscious scheduling
Scalability in Distributed Memory • Parameterized
Task Graph representation
• Independent distributed scheduling
• Scalable 0
5
10
15
20
25
30
108 432
768 1200
3072
Perf
orm
an
ce (
TF
lop
/s)
Number of cores
DGEQRF performance weak scalingCray XT5
Practical peak (GEMM)DAGuElibSCI Scalapack
DPOTRF
HPL or LU with Partial Pivoting Inria Bordeaux, UTK
0
200
400
600
800
1000
1200
0 10000 20000 30000 40000 50000
GF
lop
/s
Matrix Size
Dancer: 16*8 cores E5520, IB 20Gbs, Intel MKL
Solid = 8 cores/nodeDashed = 7 cores/node
GEMM PEAKWithout pivoting
Incremental pivotingScalapack
Partial pivoting
Figure 6.1: Performance of LU decomposition with PTG
is a mathematical library for parallel distributed memory machines using the LA-PACK library. For that, we used the Netlib1 version. ScaLAPACK was run byassigning one process MPI per core.
We implemented Task flow 3 for panel factorization and Task flow 5 for swap-ping operations in update. We used a 2D block cyclic distribution for matrix. Thesize of tile used was 200 for DAGuE and 120 for ScaLAPACK and has been chosento get the best asymptotic performance. The matrices where generated randomlywith the same side for all measurements.
For each measurement, we do five iterations of the execution, we do not takein count the maximal and minimal values obtained. We display the average of thethree other values. In most of the tests, we obtain a low standard deviation (lessthan 1 Gflop/s).
The results obtained are encouraging because our implementation outperformthe ScaLAPACK implementation and reached 75% of the GEMM peak. How-ever, the incremental pivoting algorithm which enables more parallelism and lesssynchronization in the panel factorization still provide better performance.
1http://www.netlib.org/
28
• Why LU is different? 2 reasons: • The panel is expressed using dataflow,
generating many tiny tasks • Provides very poor performance
• The DAG depends on the content of the data
• The strict dataflow model we imposed on this exercise prevents message agregation
• Our implementation is not message optimal
• Explore all cases with as less tasks as possible
Hierarchical QR ENS Lyon, Inria Bordeaux, UTK, UCD
• How to compose trees to get the best pipeline?
• Flat, Binary, Fibonacci, Greedy, …
• Study on critical path lengths • Surprisingly Flat trees are
better for communications on square cases:
• Less communications • Good pipeline
Sparse Direct Solvers Inria Bordeaux, UTK
• Based on PaStiX solver • Super-nodal method as in SuperLU • Exploits an elimination tree => DAG
• LU, LLt & LDLt factorizations for shared memory with DAGuE/PaRSEC and StarPU
• GPU panel and update kernels • Based on blocked representation
• Problems: • Around 40 times more tasks than an
equivalent dense factorization • Average task size can be very small
(20x20) • Sizes are not regular
Name N NNZA Fill ratio OPC Type Factorization SourceMHD 485,597 12,359,369 61.20 9.84e+12 Real LU University of MinnesotaAudi 943,695 39,297,771 31.28 5.23e+12 Real LLT PARASOL Collection10M 10,423737 89,072,871 75.66 1.72e+14 Complex LDLT French CEA-Cesta
TABLE I: Matrices description
1 2 4 6 12 24 36 480
200
400
600
800
1,000
1,200
1,400
1,600
Number of Threads
Fact
oriz
atio
nTi
me
(s)
PASTIXPASTIX with STARPUPASTIX with DAGUE
Fig. 6: LU decomposition on MHD (double precision)
original scheduler on Audi and MHD test matrices. A goodscalability was obtained using all three approaches. On thesetwo test cases, generic schedulers behaved quite well, andobtained the performances similar to the fine-tuned PASTIXscheduler. On a small number of processors, we could evenobtain better results using the generic schedulers. DAGUE lostsome performance when more than a socket (12 cores) wasused, but recovered as soon as the computation spawned overmore than 2 sockets. The DAGUE scheduler seemed to notbe able to extract any performance gain between 12 and 24cores.
With the 10 Millions test matrix (Fig. 7), the finely-tunedscheduler of PASTIX outperformed the generic runtimes.Specifically, STARPU showed its limitation as the number ofthreads increased, while DAGUE maintained a good scalabil-ity. This can be explained by the fact that STARPU does nottake data locality into account.
B. Experiments with GPUs
Fig. 8 and 9 show the results of using one and two GPUs.For these experiments, we did not use all 48 available coresfor computation as one core was dedicated to each GPU.The CUDA kernel gave good acceleration on both single
1 2 4 6 12 24 36 480
10,000
20,000
30,000
40,000
Number of Threads
Fact
oriz
atio
nTi
me
(s)
PASTIXPASTIX with STARPUPASTIX with DAGUE
Fig. 7: LDL
T decomposition on 10M (double complex)
and double precision. The factorization time was reducedsignificantly using the first GPU when the number of coreswas small. The speedups of up to 5 was obtained using oneGPU. However, the second GPU was relevant only with asmall number of cores (less than 4 threads in single precision(Fig. 8), and less than 12 threads in double precision (Fig. 9)).With one GPU, once the cores on a socket (12 core onRomulus, and 6 on Mirage) were fully utilized, the GPU hadno effect on the factorization time.
We also conducted additional GPU experiments using acompute node of the Mirage machine from INRIA - Bordeaux.Mirage nodes are composed of 2 Hexa-core (Westmere IntelXeon X5650), with 36GB of RAM. The results on Mirage(Fig. 10) were similar to the ones on Romulus.
V. CONCLUSION
In this paper, we examined the potential benefits of usinggeneric runtime systems, DAGUE and STARPU, in a parallelsparse direct solver PASTIX. The experimental results usingup to 48 cores and two NVIDIA Tesla GPUs demonstrated thepotential of this approach to design a sparse direct solver onheterogeneous manycore architectures with accelerators usinga uniform interface.
Name N NNZA Fill ratio OPC Type Factorization SourceMHD 485,597 12,359,369 61.20 9.84e+12 Real LU University of MinnesotaAudi 943,695 39,297,771 31.28 5.23e+12 Real LLT PARASOL Collection10M 10,423737 89,072,871 75.66 1.72e+14 Complex LDLT French CEA-Cesta
TABLE I: Matrices description
1 2 4 6 12 24 36 480
200
400
600
800
1,000
1,200
1,400
1,600
Number of Threads
Fact
oriz
atio
nTi
me
(s)
PASTIXPASTIX with STARPUPASTIX with DAGUE
Fig. 6: LU decomposition on MHD (double precision)
original scheduler on Audi and MHD test matrices. A goodscalability was obtained using all three approaches. On thesetwo test cases, generic schedulers behaved quite well, andobtained the performances similar to the fine-tuned PASTIXscheduler. On a small number of processors, we could evenobtain better results using the generic schedulers. DAGUE lostsome performance when more than a socket (12 cores) wasused, but recovered as soon as the computation spawned overmore than 2 sockets. The DAGUE scheduler seemed to notbe able to extract any performance gain between 12 and 24cores.
With the 10 Millions test matrix (Fig. 7), the finely-tunedscheduler of PASTIX outperformed the generic runtimes.Specifically, STARPU showed its limitation as the number ofthreads increased, while DAGUE maintained a good scalabil-ity. This can be explained by the fact that STARPU does nottake data locality into account.
B. Experiments with GPUs
Fig. 8 and 9 show the results of using one and two GPUs.For these experiments, we did not use all 48 available coresfor computation as one core was dedicated to each GPU.The CUDA kernel gave good acceleration on both single
1 2 4 6 12 24 36 480
10,000
20,000
30,000
40,000
Number of Threads
Fact
oriz
atio
nTi
me
(s)
PASTIXPASTIX with STARPUPASTIX with DAGUE
Fig. 7: LDL
T decomposition on 10M (double complex)
and double precision. The factorization time was reducedsignificantly using the first GPU when the number of coreswas small. The speedups of up to 5 was obtained using oneGPU. However, the second GPU was relevant only with asmall number of cores (less than 4 threads in single precision(Fig. 8), and less than 12 threads in double precision (Fig. 9)).With one GPU, once the cores on a socket (12 core onRomulus, and 6 on Mirage) were fully utilized, the GPU hadno effect on the factorization time.
We also conducted additional GPU experiments using acompute node of the Mirage machine from INRIA - Bordeaux.Mirage nodes are composed of 2 Hexa-core (Westmere IntelXeon X5650), with 36GB of RAM. The results on Mirage(Fig. 10) were similar to the ones on Romulus.
V. CONCLUSION
In this paper, we examined the potential benefits of usinggeneric runtime systems, DAGUE and STARPU, in a parallelsparse direct solver PASTIX. The experimental results usingup to 48 cores and two NVIDIA Tesla GPUs demonstrated thepotential of this approach to design a sparse direct solver onheterogeneous manycore architectures with accelerators usinga uniform interface.
Heterogeneity Support • A BODY is a task on a specific device (codelet) • Currently the system supports CUDA and cores • A CUDA device is considered as one additional
memory level • Data locality and data versioning define the
transfers to and from the GPU/Co-processors
/* POTRF Lower case */ GEMM(k, m, n) // Execution space k = 0 .. MT-3 m = k+2 .. MT-1 n = k+1 .. m-1 // Parallel partitioning : A(m, n) // Parameters READ A <- C TRSM(m, k) READ B <- C TRSM(n, k) RW C <- (k == 0) ? A(m, n) : C GEMM(k-1, m, n) -> (n == k+1) ? C TRSM(m, n) : C GEMM(k+1, m, n) BODY [CPU, CUDA, MIC, *]
• Single node • 4xTesla (C1060) • 16 cores (AMD opteron)
• Multi GPU – single node
• Multi GPU - distributed
0
1
2
3
4
5
6
7
8
9
10
20k 30k 40k 50k 60k 70k 80k 90k 100k110k
120k
Pe
rfo
rm
an
ce
(T
Flo
p/s
)
Matrix Size
SPOTRF performance problem scaling12 GPU nodes (Infiniband 20G)
Practical peak (GEMM)C2070 + 8 cores per node8 cores per node
0
1
2
3
4
5
6
7
8
9
10
1;54k
2;76k
4;108
k
8;152
k
12;176
k
Pe
rfo
rm
an
ce
(T
Flo
p/s
)
Number of Nodes;Matrix Size (N)
SPOTRF performance weak scaling12 GPU nodes (Infiniband 20G)
Practical peak (GEMM)C2070 + 8 cores per node8 cores per node
Scalability
• 12 nodes • 12xFermi (C2070) • 8 cores/node (Intel core2)
0
200
400
600
800
1000
1200
1400
10k 20k 30k 40k 50k
Per
form
ance
(GFl
op/s
)
Matrix size (N)
C1060x4C1060x3C1060x2C1060x1
Energy efficiency
• Energy used depending on the number of cores • Up to 62% more energy efficient while using a
high performance tuned scheduling • Power efficient scheduling
Total energy consumption QR factorization (256 cores)
Work in progress with Hatem Ltaief
0
5000
10000
15000
20000
25000
0 5 10 15 20 25 30 35
Pow
er
(Watts)
Time (seconds)
System
CPU
Memory
Network
(a) ScaLAPACK.
0
5000
10000
15000
20000
25000
0 5 10 15 20 25 30 35P
ow
er
(Watts)
Time (seconds)
System
CPU
Memory
Network
(b) DPLASMA.
Figure 11. Power Profiles of the Cholesky Factorization.
0
5000
10000
15000
20000
25000
0 20 40 60 80 100
Pow
er
(Watts)
Time (seconds)
System
CPU
Memory
Network
(a) ScaLAPACK.
0
5000
10000
15000
20000
25000
0 20 40 60 80 100
Pow
er
(Watts)
Time (seconds)
System
CPU
Memory
Network
(b) DPLASMA.
Figure 12. Power Profiles of the QR Factorization.
smaller number of cores. The engine could then decide toturn off or lower the frequencies of the cores using DynamicVoltage Frequency Scaling [16], a commonly used techniquewith which it is possible to achieve reduction of energyconsumption.
ACKNOWLEDGMENT
The authors would like to thank Kirk Cameron and Hung-Ching Chang from the Department of Computer Science atVirginia Tech, for granting access to their platform.
# Cores Library Cholesky QR
128 ScaLAPACK 192000 672000DPLASMA 128000 540000
256 ScaLAPACK 240000 816000DPLASMA 96000 540000
512 ScaLAPACK 325000 1000000DPLASMA 125000 576000
Figure 13. Total amount of energy (joule) used for each test based on thenumber of cores
REFERENCES
[1] MPI-2: Extensions to the message passing interface standard.http://www.mpi-forum.org/ (1997)
[2] Agullo, E., Hadri, B., Ltaief, H., Dongarra, J.: Comparativestudy of one-sided factorizations with multiple software pack-ages on multi-core hardware. SC ’09: Proceedings of theConference on High Performance Computing Networking,Storage and Analysis pp. 1–12 (2009)
[3] Anderson, E., Bai, Z., Bischof, C., Blackford, S.L., Demmel,J.W., Dongarra, J.J., Croz, J.D., Greenbaum, A., Hammarling,S., McKenney, A., Sorensen, D.C.: LAPACK User’s Guide,3rd edn. Society for Industrial and Applied Mathematics,Philadelphia (1999)
[4] Bosilca, G., Bouteiller, A., Danalis, A., Faverge, M., Haidar,A., Herault, T., Kurzak, J., Langou, J., Lemarinier, P., Ltaief,H., Luszczek, P., YarKhan, A., Dongarra, J.: Flexible Devel-opment of Dense Linear Algebra Algorithms on MassivelyParallel Architectures with DPLASMA. In: the 12th IEEEInternational Workshop on Parallel and Distributed Scientificand Engineering Computing (PDSEC-11). ACM, Anchorage,AK, USA (2011)
[5] Bosilca, G., Bouteiller, A., Danalis, A., Herault, T.,Lem arinier, P., Dongarra, J.: DAGuE: A generic dis-tributed DAG engine for high performance computing.Tech. Rep. 231, LAPACK Working Note (2010). URLhttp://www.netlib.org/lapack/lawnspdf/lawn231.pdf
[6] Bosilca, G., Bouteiller, A., Herault, T., Lemarinier, P., Don-garra, J.: DAGuE: A generic distributed DAG engine for highperformance computing (2011)
[7] Buttari, A., Langou, J., Kurzak, J., Dongarra, J.: A class ofparallel tiled linear algebra algorithms for multicore architec-tures. Parallel Computing 35(1), 38–53 (2009)
[8] Choi, J., Demmel, J., Dhillon, I., Dongarra, J., Ostrou-chov, S., Petitet, A., Stanley, K., Walker, D., Whaley, R.C.:ScaLAPACK, a portable linear algebra library for distributedmemory computers-design issues and performance. ComputerPhysics Communications 97(1-2), 1–15 (1996)
[9] Cosnard, M., Jeannot, E.: Compact DAG representation andits dynamic scheduling. Journal of Parallel and DistributedComputing 58, 487–514 (1999)
[10] Dongarra, J., Beckman, P.: The International Exascale Soft-ware Roadmap. International Journal of High PerformanceComputer Applications 25(1) (2011)
0
5000
10000
15000
20000
25000
0 5 10 15 20 25 30 35
Pow
er
(Watts)
Time (seconds)
System
CPU
Memory
Network
(a) ScaLAPACK.
0
5000
10000
15000
20000
25000
0 5 10 15 20 25 30 35
Pow
er
(Watts)
Time (seconds)
System
CPU
Memory
Network
(b) DPLASMA.
Figure 11. Power Profiles of the Cholesky Factorization.
0
5000
10000
15000
20000
25000
0 20 40 60 80 100
Pow
er
(Watts)
Time (seconds)
System
CPU
Memory
Network
(a) ScaLAPACK.
0
5000
10000
15000
20000
25000
0 20 40 60 80 100
Pow
er
(Watts)
Time (seconds)
System
CPU
Memory
Network
(b) DPLASMA.
Figure 12. Power Profiles of the QR Factorization.
smaller number of cores. The engine could then decide toturn off or lower the frequencies of the cores using DynamicVoltage Frequency Scaling [16], a commonly used techniquewith which it is possible to achieve reduction of energyconsumption.
ACKNOWLEDGMENT
The authors would like to thank Kirk Cameron and Hung-Ching Chang from the Department of Computer Science atVirginia Tech, for granting access to their platform.
# Cores Library Cholesky QR
128 ScaLAPACK 192000 672000DPLASMA 128000 540000
256 ScaLAPACK 240000 816000DPLASMA 96000 540000
512 ScaLAPACK 325000 1000000DPLASMA 125000 576000
Figure 13. Total amount of energy (joule) used for each test based on thenumber of cores
REFERENCES
[1] MPI-2: Extensions to the message passing interface standard.http://www.mpi-forum.org/ (1997)
[2] Agullo, E., Hadri, B., Ltaief, H., Dongarra, J.: Comparativestudy of one-sided factorizations with multiple software pack-ages on multi-core hardware. SC ’09: Proceedings of theConference on High Performance Computing Networking,Storage and Analysis pp. 1–12 (2009)
[3] Anderson, E., Bai, Z., Bischof, C., Blackford, S.L., Demmel,J.W., Dongarra, J.J., Croz, J.D., Greenbaum, A., Hammarling,S., McKenney, A., Sorensen, D.C.: LAPACK User’s Guide,3rd edn. Society for Industrial and Applied Mathematics,Philadelphia (1999)
[4] Bosilca, G., Bouteiller, A., Danalis, A., Faverge, M., Haidar,A., Herault, T., Kurzak, J., Langou, J., Lemarinier, P., Ltaief,H., Luszczek, P., YarKhan, A., Dongarra, J.: Flexible Devel-opment of Dense Linear Algebra Algorithms on MassivelyParallel Architectures with DPLASMA. In: the 12th IEEEInternational Workshop on Parallel and Distributed Scientificand Engineering Computing (PDSEC-11). ACM, Anchorage,AK, USA (2011)
[5] Bosilca, G., Bouteiller, A., Danalis, A., Herault, T.,Lem arinier, P., Dongarra, J.: DAGuE: A generic dis-tributed DAG engine for high performance computing.Tech. Rep. 231, LAPACK Working Note (2010). URLhttp://www.netlib.org/lapack/lawnspdf/lawn231.pdf
[6] Bosilca, G., Bouteiller, A., Herault, T., Lemarinier, P., Don-garra, J.: DAGuE: A generic distributed DAG engine for highperformance computing (2011)
[7] Buttari, A., Langou, J., Kurzak, J., Dongarra, J.: A class ofparallel tiled linear algebra algorithms for multicore architec-tures. Parallel Computing 35(1), 38–53 (2009)
[8] Choi, J., Demmel, J., Dhillon, I., Dongarra, J., Ostrou-chov, S., Petitet, A., Stanley, K., Walker, D., Whaley, R.C.:ScaLAPACK, a portable linear algebra library for distributedmemory computers-design issues and performance. ComputerPhysics Communications 97(1-2), 1–15 (1996)
[9] Cosnard, M., Jeannot, E.: Compact DAG representation andits dynamic scheduling. Journal of Parallel and DistributedComputing 58, 487–514 (1999)
[10] Dongarra, J., Beckman, P.: The International Exascale Soft-ware Roadmap. International Journal of High PerformanceComputer Applications 25(1) (2011)
SystemG: Virginia Tech Energy Monitored cluster (ib40g, intel, 8cores/node)
Analysis Tools
Hermitian Band Diagonal; 16x16 tiles
Conclusion • Programming must be made easy(ier)
• Portability: inherently take advantage of all hardware capabilities • Efficiency: deliver the best performance on several families of
algorithms • Computer scientists were spoiled by MPI
• Now let’s think about our users • Let different people focus on different problems
• Application developers on their algorithms • System developers on system issues • Compilers on whatever they can
The end
Resilience • The fault propagate in the system
based on the data dependencies • However, if the original data can be
recovered, the execution complete without user interaction
• Automatic recovery made simple
Composition • An algorithm is a series of operations
with data dependencies • A sequential composition limit the
parallelism due to strict synchronizations • Following the flow of data we can loosen the
synchronizations and transform them in data dependencies
Composition • An algorithm is a series of operations
with data dependencies • A sequential composition limit the
parallelism due to strict synchronizations • Following the flow of data we can loosen the
synchronizations and transform them in data dependencies
Composition • An algorithm is a series of operations
with data dependencies • A sequential composition limit the
parallelism due to strict synchronizations • Following the flow of data we can loosen the
synchronizations and transform them in data dependencies
Other Systems
PaR
SEC
SM
Pss
StarP
U
Charm
+
+
FLAM
E
QU
AR
K
Tblas
PTG
Scheduling Distr.
(1/core) Repl
(1/node) Repl
(1/node) Distr.
(Actors) w/
SuperMatrix Repl
(1/node) Centr. Centr.
Language Internal
or Seq. w/ Affine Loops
Seq. w/
add_task
Seq. w/
add_task
Msg-Driven Objects
Internal (LA DSL)
Seq. w/
add_task
Seq. w/
add_task Internal
Accelerator GPU GPU GPU GPU GPU
Availability Public Public Public Public Public Public Not Avail. Not Avail.
Early stage: ParalleX Non-academic: Swarm, MadLINQ, CnC
All projects support Distributed and Shared Memory (QUARK with QUARKd; FLAME with Elemental)
History: Beginnings of Data Flow • “Design of a separable transition-diagram compiler”,
M.E. Conway, Comm. ACM, 1963 • Coroutines, flow of data between process
• J.B. Dennis, 60’s • Data Flow representation of programs • Reasoning about parallelism, equivalence of programs, …
• “The semantics of a simple language for parallel programming”, G. Kahn • Kahn Networks