Upload
wesley-miller
View
218
Download
1
Embed Size (px)
Citation preview
PACC2011, Sept. 2011 1
[Soccer] is a very simple game. It’s just very hard to play it simple.- Johan Cruyff
Dense Linear Algebra subject make
RvdG
PACC2011, Sept. 2011 2
Robert van de Geijn
Designing a Library to be Multi-Accelerator Ready: A Case Study
Department of Computer ScienceInstitute for Computational Engineering and Sciences
The University of Texas at Austin
PACC2011, Sept. 2011 3
Outline
Motivation What is FLAME? Deriving algorithms to be correct Representing algorithms in code Of blocked algorithms and algorithms-by-blocks Runtime support for multicore, GPU, and multiGPU Extensions to distributed memory platforms Related work Conclusion
PACC2011, Sept. 2011 4
Moments of Inspiration
Birth of multi-threaded libflame Aug. 2006 - an insight:
libflame + algorithm-by-blocks + out-of-order scheduling
(runtime) = multithreaded library
Sept. 2006 - working prototype (by G. Quintana)
Oct. 2006 - grant proposal (to NSF, later funded)
Jan. 2007 - paper submitted (to SPAA07, accepted)
April 2007 - released with libflame R1.0
Birth of multi-GPU libflame Fall 2007 - runtime used to
manage data and tasks on a single GPU. (UJI-Spain)
March 2008 - NVIDIA donates 4GPU Tesla S870 system
Two hours after unboxing the boards, multiple heuristics for multiGPU runtime implemented
Then the power cord fried…
PACC2011, Sept. 2011 5
After two hours Shortly after
Birth of MultiGPU libflame
G. Quintana, Igual, E. Quintana, van de Geijn. "Solving Dense Linear Algebra Problems on Platforms with Multiple Hardware Accelerators." PPoPP’09.
PACC2011, Sept. 2011 6
What Supports our Productivity/Performance?
Deep understanding of the domain Foundational computer science
Derivation of algorithms Software implementation of hardware techniques Blocking for performance
Abstraction Separation of concern
PACC2011, Sept. 2011 7
Outline
Motivation What is FLAME? Deriving algorithms to be correct Representing algorithms in code Of blocked algorithms and algorithms-by-blocks Runtime support for multicore, GPU, and multiGPU Extensions to distributed memory platforms Related work Conclusion
PACC2011, Sept. 2011 8
What is FLAME?
A notation for expressing linear algebra algorithms A methodology for deriving such algorithm A set of abstractions for representing such algorithms
In LaTeX, M-script, C, etc. A modern library (libflame)
Alternative to BLAS, LAPACK, ScaLAPACK, and related efforts Many new contributions to theory and practice of dense linear
algebra Also banded and Krylov subspace methods
A set of tools supporting the above Mechanical derivation Automatic generation of code Design-by-Transformation (DxT)
PACC2011, Sept. 2011 9
Who is FLAME?
PACC2011, Sept. 2011 10
Outline
Motivation What is FLAME? Deriving algorithms to be correct Representing algorithms in code Of blocked algorithms and algorithms-by-blocks Runtime support for multicore, GPU, and multiGPU Extensions to distributed memory platforms Related work Conclusion
PACC2011, Sept. 2011 11
Deriving Algorithms to be Correct
Include all algorithms for a given operation: Pick the right algorithm for the given architecture
Problem: How to find the right algorithm
Solution: Formal derivation (Hoare, Dijkstra, …):
Given operation, systematically derive family of algorithms for computing it.
PACC2011, Sept. 2011 12
Notation
The notation used to express an algorithm should reflect how the algorithm is naturally explained.
PACC2011, Sept. 2011 13
Example: The Cholesky Factorization
Lower triangular case:
Key in the solution of s.p.d. linear systems
A x = b (LLT)x = b L y = b y LT x = y x
A = *L LT
PACC2011, Sept. 2011 14
Algorithm Loop: Repartition
ATL
ABL ABR
a11
a21 A22
A00
A20
a10T
Indexing operations
PACC2011, Sept. 2011 15
Algorithm Loop: Update
a11
a21
A00
A20
a10T
a11
a21
/a11
A22 – a21 a21T
A00
A20
a10T
Real computation
PACC2011, Sept. 2011 16
Algorithm Loop: Merging
ATL
ABL ABR
a11
a21
/a11
A22 – a21 a21T
A00
A20
a10T
Indexing operation
PACC2011, Sept. 2011 17
Worksheet for Cholesky Factorization
PACC2011, Sept. 2011 18
Mechanical Derivation of Algorithms
Mechanical development
from math. specification
18
Mechanicalprocedure
A = L * LT
Paolo Bientinesi. "Mechanical Derivation and Systematic Analysis of Correct Linear Algebra Algorithms." Ph.D. Dissertation. UT-Austin. August 2006.
PACC2011, Sept. 2011 19
Is Formal Derivation Practical?
libflame: 128+ matrix operations 1389+ implementations of algorithms Test suite created in 2011 126,756 tests executed Only 3 minor bugs in library… (now fixed)
PACC2011, Sept. 2011 20
Impact on (Single) GPU Computing
CUBLAS 2009 (have been optimized since)
PACC2011, Sept. 2011 21
Impact on (Single) GPU Computing
Fran Igual. “Matrix Computations on Graphics Processors and Clusters of GPUs." Ph.D. Dissertation. Univ. Jaume I. May 2011.Igual, G. Quintana, and van de Geijn. "Level-3 BLAS on a GPU: Picking the Low Hanging Fruit " FLAME Working Note #37. Universidad Jaume I, updated May 21, 2009.
CUBLAS 2009
PACC2011, Sept. 2011 22
A Sampling of Functionality
operation Classic FLAME SuperMatrixMultiThreaded/
MultiGPU
lapack2flame
Level-3 BLAS y y N.A.
Cholesky y y y
LU with partial pivoting y y y
LU with incremental pivoting y y N.A.
QR (UT) y y y
LQ (UT) y y y
SPD/HPD inversion y y y
Triangular inversion y y y
Triangular Sylvester y y y
Triangular Lyapunov y y y
Up-and-downdate (UT) y N.A.
SVD next week soon
EVD next week soon
PACC2011, Sept. 2011 23
Outline
Motivation What is FLAME? Deriving algorithms to be correct Representing algorithms in code Of blocked algorithms and algorithms-by-blocks Runtime support for multicore, GPU, and multiGPU Extensions to distributed memory platforms Related work Conclusion
PACC2011, Sept. 2011 24
Representing Algorithms in Code
Code should closely resemble how an algorithm is presented so that no bugs can be introduced when translating an algorithm to code.
PACC2011, Sept. 2011 25
Representing algorithms in code
Spark+APIsC,F77,Matlab,LabView,LaTeX
http://www.cs.utexas.edu/users/flame/Spark/
PACC2011, Sept. 2011 26
FLAME/C API
FLA_Part_2x2( A, &ATL, &ATR, &ABL, &ABR, 0, 0, FLA_TL );
while ( FLA_Obj_length( ATL ) < FLA_Obj_length( A ) ){ b = min( FLA_Obj_length( ABR ), nb_alg ); FLA_Repart_2x2_to_3x3( ATL, /**/ ATR, &A00, /**/ &a01, &A02, /* ************* */ /* ************************** */ &a10t,/**/ &alpha11, &a12t, ABL, /**/ ABR, &A20, /**/ &a21, &A22, 1, 1, FLA_BR ); /*--------------------------------------*/ FLA_Sqrt( alpha11 ); FLA_Inv_scal( alpha11, a21 ); FLA_Syr( FLA_LOWER_TRIANGULAR, FLA_NO_TRANSPOSE, FLA_MINUS_ONE, a21, A22 ); /*--------------------------------------*/ FLA_Cont_with_3x3_to_2x2( &ATL, /**/ &ATR, A00, a01, /**/ A02, a10t, alpha11, /**/ a12t, /* ************** */ /* ************************/ &ABL, /**/ &ABR, A20, a21, /**/ A22, FLA_TL );}
For now, libflame employs external BLAS: GotoBLAS, MKL, ACML, CUBLAS
PACC2011, Sept. 2011 27
Outline
Motivation What is FLAME? Deriving algorithms to be correct Representing algorithms in code Of blocked algorithms and algorithms-by-blocks Runtime support for multicore, GPU, and multiGPU Extensions to distributed memory platforms Related work Conclusion
PACC2011, Sept. 2011 28
Multicore/MultiGPU - Issues
Manage computation Assignment of tasks to cores and/or GPUs Granularity is important
Manage memory Manage data transfer between “host” and caches of
cores or host and GPU local memories Granularity is important Keep the data in the local memory as long as possible
PACC2011, Sept. 2011 29
Where have we seen this before?
Computer architecture late 1960s: Super scalar units proposed Unit of data: floating point number Unit of computation: floating point operation Examine dependencies Execute out-of-order, prefetch, cache data, etc., to keep
computational units busy Extract parallelism from sequential instruction stream
R. M. Tomasulo, “An Efficient Algorithm for Exploiting Multiple Arithmetic Units.” IBM J. of R&D, (1967)
Basis for exploitation of ILP on current superscalar processors!
PACC2011, Sept. 2011 30
Of Blocks and Tasks
Dense matrix computation Unit of data: block in matrix Unit of computation (task): operation with blocks Dependency: input/output of operation with blocks Instruction stream: sequential libflame code
Generates DAG Runtime system schedules tasks
Goal: minimize data transfer and maximize utilization
PACC2011, Sept. 2011 31 31
Review: Blocked Algorithms
Cholesky factorization
…
1st iteration 2nd iteration 3rd iteration
A11 = L11 L11T
A21 := L21 = A21 L11-T
A22 := A22 – L21 L21T
PACC2011, Sept. 2011 32 32
Blocked Algorithms
Cholesky factorizationA = L * LT
APIs + Tools
FLA_Part_2x2(…);
while ( FLA_Obj_length(ATL) < FLA_Obj_length(A) ){
FLA_Repart_2x2_to_3x3(…);
/*--------------------------------------*/ FLA_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLA_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLA_Syrk( FLA_LOWER_TRIANGULAR,FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /*--------------------------------------*/
FLA_Cont_with_3x3_to_2x2(…);
}
PACC2011, Sept. 2011 33 33
Simple Parallelization: Blocked Algorithms
Link with multi-threaded BLAS
FLA_Part_2x2(…);
while ( FLA_Obj_length(ATL) < FLA_Obj_length(A) ){
FLA_Repart_2x2_to_3x3(…);
/*--------------------------------------*/ FLA_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLA_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLA_Syrk( FLA_LOWER_TRIANGULAR,FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /*--------------------------------------*/
FLA_Cont_with_3x3_to_2x2(…);
}A11 = L11 L11
T
A21 := L21 = A21 L11-T
A22 := A22 – L21 L21T
PACC2011, Sept. 2011 34 34
Blocked Algorithms
There is more parallelism!
1st iteration
Inside the same iteration
2nd iteration
In different iterations
PACC2011, Sept. 2011 35 35
Coding Algorithm-by-Blocks
Algorithm-by-blocks :
FLA_Part_2x2(…);
while ( FLA_Obj_length(ATL) < FLA_Obj_length(A) ){
FLA_Repart_2x2_to_3x3(…);
/*--------------------------------------*/ FLA_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLASH_Syrk( FLA_LOWER_TRIANGULAR,FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /*--------------------------------------*/
FLA_Cont_with_3x3_to_2x2(…);
}
A11 = L11 L11T
A21 := L21 = A21 L11-T
A22 := A22 – L21 L21T
PACC2011, Sept. 2011 36
Outline
Motivation What is FLAME? Deriving algorithms to be correct Representing algorithms in code Of blocked algorithms and algorithms-by-blocks Runtime support for multicore, GPU, and multiGPU Extensions to distributed memory platforms Related work Conclusion
PACC2011, Sept. 2011 37
Generating a DAG
1
2
3
4
5 6
7
8 9 10
FLA_Part_2x2(…);
while ( FLA_Obj_length(ATL) < FLA_Obj_length(A) ){
FLA_Repart_2x2_to_3x3(…);
/*--------------------------------------*/ FLA_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLASH_Syrk( FLA_LOWER_TRIANGULAR,FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /*--------------------------------------*/
FLA_Cont_with_3x3_to_2x2(…);
}
PACC2011, Sept. 2011 38
Managing Tasks and Blocks
Separation of concerns: Sequential libflame routine generates the DAG Runtime system (SuperMatrix) manages and
schedules the DAG As one moves from one architecture to another, only
the runtime system needs to be updated Multicore Out-of-core Single GPU MultiGPU Distributed Runtime …
PACC2011, Sept. 2011 39
Runtime system - SuperMatrix
1
2
3
4
5 6
7
8 9 10
SuperMatrix
DAG
Multicore
+ heuristic
PACC2011, Sept. 2011 40
Runtime system for GPU - SuperMatrix
1
2
3
4
5 6
7
8 9 10
SuperMatrix
DAG
CPU + GPU
Managedata transfer
accelerator
PACC2011, Sept. 2011 41
Runtime system for MultiGPU - SuperMatrix
1
2
3
4
5 6
7
8 9 10
SuperMatrix
DAG
CPU + MultiGPU
Managedata transfer
Multi-accelerator
PACC2011, Sept. 2011 42
MultiGPU
42
How do we program these?
CPU(s)PCI-e bus
GPU #1
GPU #3
GPU #0
GPU #2
Inter-connect
PACC2011, Sept. 2011 43 43
MultiGPU: a User’s View
FLA_Obj A;
// Initialize conventional matrix: buffer, m, rs, cs// Obtain storage blocksize, # of threads: b, n_threads
FLA_Init();
FLASH_Obj_create( FLA_DOUBLE, m, m, 1, &b, &A );FLASH_Copy_buffer_to_hier( m, m, buffer, rs, cs, 0, 0, A );FLASH_Queue_set_num_threads( n_threads );FLASH_Queue_enable_gpu();FLASH_Chol( FLA_LOWER_TRIANGULAR, A );
FLASH_Obj_free( &A );
FLA_Finalize();
PACC2011, Sept. 2011 44
MultiGPU: Under the Cover
44
Naïve approach: Before execution, transfer
data to device Call CUBLAS operations
(implementation “someone else’s problem”)
Upon completion, retrieve results back to host
poor data locality
CPU(s)PCI-e bus
GPU #1
GPU #3
GPU #0
GPU #2
Inter-connect
PACC2011, Sept. 2011 45
MultiGPU: Under the Cover
45
How do we program these?
View as a…
Shared-memory
multiprocessor +
Distributed Shared
Memory (DSM)
architecture
CPU(s)PCI-e bus
GPU #1
GPU #3
GPU #0
GPU #2
Inter-connect
PACC2011, Sept. 2011 46
MultiGPU: Under the Cover
46
View system as a shared-memory multiprocessors (multi-core processor with hw. coherence)
MP P0+Cache0
P1+Cache1
P2+Cache2
P3+Cache3
CPU(s)PCI-e bus
GPU #1
GPU #3
GPU #0
GPU #2
Inter-connect
PACC2011, Sept. 2011 47
MultiGPU: Under the Cover
47
Software Distributed-Shared Memory (DSM) Software: flexibility vs. efficiency Underlying distributed memory hidden Reduce memory transfers using write-back, write-
invalidate,… Well-known approach, not too efficient as a
middleware for general apps.
Regularity of dense linear algebra makes a difference!
PACC2011, Sept. 2011 48
MultiGPU: Under the Cover
48
Reduce #data transfers: Run-time handles device
memory as a software cache: Operate at block level Software flexibility Write-back Write-invalidate
SuperMatrix
CPU(s)PCI-e bus
GPU #1
GPU #3
GPU #0
GPU #2
Inter-connect
1
2
3
4
5 6
7
8 9 10
PACC2011, Sept. 2011 49
MultiGPU: Under the Cover
SuperMatrix
FLA_Part_2x2(…);
while ( FLA_Obj_length(ATL) < FLA_Obj_length(A) ){
FLA_Repart_2x2_to_3x3(…);
/*--------------------------------------*/ FLASH_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLASH_Syrk( FLA_LOWER_TRIANGULAR,FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /*--------------------------------------*/
FLA_Cont_with_3x3_to_2x2(…);
}
• Factor A11 on host
PACC2011, Sept. 2011 50
Multi-GPU: Under the Cover
SuperMatrix
FLA_Part_2x2(…);
while ( FLA_Obj_length(ATL) < FLA_Obj_length(A) ){
FLA_Repart_2x2_to_3x3(…);
/*--------------------------------------*/ FLASH_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLASH_Syrk( FLA_LOWER_TRIANGULAR,FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /*--------------------------------------*/
FLA_Cont_with_3x3_to_2x2(…);
}
• Transfer A11 from host to appropriate devices before using it in subsequent computations (write-update)
PACC2011, Sept. 2011 51
Multi-GPU: Under the Cover
SuperMatrix
FLA_Part_2x2(…);
while ( FLA_Obj_length(ATL) < FLA_Obj_length(A) ){
FLA_Repart_2x2_to_3x3(…);
/*--------------------------------------*/ FLASH_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLASH_Syrk( FLA_LOWER_TRIANGULAR,FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /*--------------------------------------*/
FLA_Cont_with_3x3_to_2x2(…);
}
• Cache A11 in receiving device(s) in case needed in subsequent computations
PACC2011, Sept. 2011 52
Multi-GPU: Under the Cover
SuperMatrix
FLA_Part_2x2(…);
while ( FLA_Obj_length(ATL) < FLA_Obj_length(A) ){
FLA_Repart_2x2_to_3x3(…);
/*--------------------------------------*/ FLASH_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLASH_Syrk( FLA_LOWER_TRIANGULAR,FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /*--------------------------------------*/
FLA_Cont_with_3x3_to_2x2(…);
}
• Send blocks to devices• Perform Trsm on blocks of
A21 (hopefully using cached A11)
• Keep updated A21 in device till needed by other GPU(s) (write-back)
PACC2011, Sept. 2011 53
Multi-GPU: Under the Cover
SuperMatrix
FLA_Part_2x2(…);
while ( FLA_Obj_length(ATL) < FLA_Obj_length(A) ){
FLA_Repart_2x2_to_3x3(…);
/*--------------------------------------*/ FLASH_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLASH_Syrk( FLA_LOWER_TRIANGULAR,FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /*--------------------------------------*/
FLA_Cont_with_3x3_to_2x2(…);
}
• Send blocks to devices• Perform Syrk/Gemm on
blocks of A22 (hopefully using cached blocks of A21)
• Keep updated A22 in device till needed by other GPU(s) (write-back)
PACC2011, Sept. 2011 54
C = C + ABT on S1070 (Tesla x 4)
PACC2011, Sept. 2011 55
Cholesky on S1070 (Tesla x 4)
PACC2011, Sept. 2011 56
Cholesky on S1070 (Tesla x 4)
PACC2011, Sept. 2011 57
Sampling of LAPACK functionality on S2050 (Fermi x 4)
PACC2011, Sept. 2011 58
Sampling of LAPACK functionality on S2050 (Fermi x 4)
PACC2011, Sept. 2011 59
Outline
Motivation What is FLAME? Deriving algorithms to be correct Representing algorithms in code Of blocked algorithms and algorithms-by-blocks Runtime support for multicore, GPU, and
multiGPU Extensions to distributed memory platforms Conclusion
PACC2011, Sept. 2011 60
libflame for Cluster (+ Accelerators)
PLAPACK Distributed memory (MPI) Inspired FLAME Recently modified so that each node can have GPU Keep data in GPU memory as much as possible
Elemental (Jack Poulson) Distributed memory (MPI) Inspired by FLAME/PLAPACK Can use GPU at each node/core
libflame + SuperMatrix Runtime schedules tasks and data transfer Appropriate for small clusters
PACC2011, Sept. 2011 61
PLAPACK + GPU accelerators
Fogue, Igual, E. Quintana, van de Geijn. "Retargeting PLAPACK to Clusters with Hardware Accelerators." (WEHA 2010).
Each node:Xeon Nehalem (8 cores)+ 2 NVIDIA C1060 (Tesla)
PACC2011, Sept. 2011 62
Targeting Clusters with GPUsSuperMatrix Distributed Runtime
Igual, G. Quintana, van de Geijn. “Scheduling Algorithms-by-Blocks on Small Clusters.” Concurrency and Computation: Practice and Experience. In review.
Each node:Xeon Nehalem (8 cores)+ 1 NVIDIA C2050 (Fermi)
PACC2011, Sept. 2011 63
Elemental Cholesky Factorization
PACC2011, Sept. 2011 64
Elemental vs. ScaLAPACK
Cholesky on 8192 cores - BlueGene/P
Elemental has full ScaLAPACK functionality(except nonsymmetric Eigenvalue Problem).
Poulson, Marker, Hammond, Romero, van de Geijn. "Elemental: A New Framework for Distributed Memory Dense Matrix Computations." ACM TOMS. Submitted.
PACC2011, Sept. 2011 65
Single-Chip Cloud Computer
Intel SCC research processor 48-core concept vehicle Created for many-core software research Custom communication library (RCCE)
PACC2011, Sept. 2011 66
SCC Results
- 48 Pentium cores- MPI replaced by RCCE
Igual, G. Quintana, van de Geijn. “Scheduling Algorithms-by-Blocks on Small Clusters.” Concurrency and Computation: Practice and Experience. In review.
Marker, Chan, Poulson, van de Geijn, Van der Wijngaart, Mattson, Kubaska. "Programming Many-Core Architectures - A Case Study: Dense Matrix Computations on the Intel SCC Processor." Concurrency and Computation: Practice and Experience. To Appear.
PACC2011, Sept. 2011 67
Outline
Motivation What is FLAME? Deriving algorithms to be correct Representing algorithms in code Of blocked algorithms and algorithms-by-blocks Runtime support for multicore, GPU, and multiGPU Extensions to distributed memory platforms Related work Conclusion
PACC2011, Sept. 2011 68
Related work
Data-flow parallelism, dynamic scheduling, runtime
Cilk OpenMP (task queues) StarSs (SMPSs) StarPU Threading Building Blocks (TBB) …
What we have is very dense linear algebra specific
PACC2011, Sept. 2011 69
Dense Linear Algebra Libraries
Target Platform LAPACK Project FLAME Project
Sequential LAPACK libflame
Sequential + multithreaded BLAS LAPACK libflame
Multicore/multithreaded PLASMA libflame+SuperMatrix
Multicore + out-of-order scheduling PLASMA+Quark libflame+SuperMatrix
CPU + single GPU MAGMA libflame+SuperMatrix
Multicore + multiGPU DAGuE? libflame+SuperMatrix
Distributed memory ScaLAPACK libflame+SuperMatrixPLAPACK
ElementalDistributed memory + GPU DAGuE?
ScaLAPACK?libflame+SuperMatrix
PLAPACKElemental
Out-of-Core ? libflame+SuperMatrix
PACC2011, Sept. 2011 70
Comparison with Quark
Agullo, Bouwmeester, Dongarra, Kurzak, Langou, Rosenbert. “Towards an Efficient Tile Matrix Inversion of Symmetric Positive Definite Matrices on Multicore Architectures.” VecPar, 2010.
PACC2011, Sept. 2011 71
Outline
Motivation What is FLAME? Deriving algorithms to be correct Representing algorithms in code Of blocked algorithms and algorithms-by-blocks Runtime support for multicore, GPU, and multiGPU Extensions to distributed memory platforms Related work Conclusion
PACC2011, Sept. 2011 72
Conclusions
Programmability is the key to harnessing parallel computation One code, many target platforms
Formal derivation provides confidence in code If there is a problem, it is not in the library!
Separation of concern Library developer derives algorithms and codes them Execution of routines generates DAG Parallelism, temporal locality, and spatial locality are captured in
DAG Runtime system uses appropriate heuristics to schedule
PACC2011, Sept. 2011 73
What does this mean for you?
One successful approach:
Identify units of data and units of computation Write a sequential program that generates a DAG Hand DAG to runtime for scheduling
PACC2011, Sept. 2011 74
The Future
Currently: Library is an instantiation in codeFuture
Create repository of algorithms, expert knowledge about algorithms, and knowledge about a target architecture
Mechanically generate a library for a target architecture, exactly as an expert would
Design-by-Transformation (DxT)
Bryan Marker, Andy Terrel, Jack Poulson, Don Batory, and Robert van de Geijn. "Mechanizing the Expert Dense Linear Algebra Developer." FLAME Working Note #58. 2011
PACC2011, Sept. 2011 75
Availability
Everything that has been discussed is available under LGPL license or BSD license
libflame + SuperMatrix http://www.cs.utexas.edu/users/flame/
Elemental http://code.google.com/p/elemental/
[Soccer] is a very simple game. It’s just very hard to play it simple.- Johan Cruyff
Dense Linear Algebra subject make
RvdG