RUNTIME DATA FLOW SCHEDULING OF MATRIX COMPUTATIONS
ERNIE CHAN
CHOL0
A0,0 :=CHOL(A0,
0)
TRSM2
A2,0 :=A2,0 A0,0
-T
GEMM5
A2,1 :=A2,0 A1,0
T
SYRK7
A2,2 :=A2,0 A2,0
T
TRSM3
A3,0 :=A3,0 A0,0
-T
GEMM6
A3,1 :=A3,0 A1,0
T
GEMM8
A3,2 :=A3,0 A2,0
T
SYRK9
A3,3 :=A3,0 A3,0
T
TRSM1
A1,0 :=A1,0 A0,0
-T
SYRK4
A1,1 :=A1,0 A1,0
T
TRSM11
A2,1 :=A2,1 A1,1
-T
SYRK13
A2,2 :=A2,1 A2,1
T
TRSM12
A3,1 :=A3,1 A1,1
-T
GEMM14
A3,2 :=A3,1 A2,1
T
SYRK15
A3,3 :=A3,1 A3,1
T
CHOL10
A1,1 :=CHOL(A1,
1)
CHOL16
A2,2 :=CHOL(A2,
2)
TRSM17
A3,2 :=A3,2 A2,2
-T
SYRK18
A3,3 :=A3,2 A3,2
T
CHOL19
A3,3 :=CHOL(A3,
3)
ABSTRACT
CHOLESKY FACTORIZATION
ALGORITHM-BY-BLOCKS
DIRECTED ACYCLIC GRAPH
QUEUEING THEORY
PERFORMANCE
SUPERMATRIX
A → L LT
where A is a symmetric positive definite matrix and L is a lower triangular matrix.
Blocked right-looking algorithm (left) and implementation (right) for Cholesky factorization
ITERATION 1
ITERATION 2
ITERATION 3
ITERATION 4
We investigate the scheduling of matrix computations expressed as directed acyclic graphs for shared-memory parallelism. Because of the data granularity in this problem domain, even slight variations in load balance or data locality can greatly affect performance. We provide a flexible framework for scheduling matrix computations, which we use to empirically quantify different scheduling algorithms. We have developed a scheduling algorithm that addresses both load balance and data locality simultaneously and show its performance benefits.
CHOL0
CHOL1
0
CHOL1
6
TRSM1 TRSM2 TRSM3
TRSM1
2
TRSM1
1
SYRK4 SYRK7 SYRK9
GEMM5
GEMM6
GEMM8
SYRK13 SYRK15
GEMM14
TRSM1
7
SYRK18
CHOL1
9
CONCLUSION
Performance of Cholesky factorization using several high-performance implementations (left)
and finding the best block size for each problem size using SuperMatrix w/cache affinity (right)
Algorithm-by-blocks for Cholesky factorization given a 4×4 matrix of blocks
Reformulate a blocked algorithm to an algorithm-by-blocks bydecomposing sub-operations into component operations on blocks (tasks).
Directed acyclic graph for Cholesky factorization given a 4×4 matrix of blocks
REFERENCES
[1] Ernie Chan, Enrique S. Quintana-Ortí, Gregorio Quintana-Ortí, and Robert van de Geijn. SuperMatrix out-of-order scheduling of matrix operations on SMP and multi-core architectures. In Proceedings of the Nineteenth Annual ACM Symposium on Parallelism in Algorithms and Architectures, pages 116-125, San Diego, CA, USA, June 2007.[2] Ernie Chan, Field G. Van Zee, Paolo Bientinesi, Enrique S. Quintana-Ortí, Gregorio Quintana-Ortí, and Robert van de Geijn. SuperMatrix: A multithreaded runtime scheduling system for algorithms-by-blocks. In Proceedings of the Thirteenth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 123-132, Salt Lake City, UT, USA, February 2008.[3] Gregorio Quintana-Ortí, Enrique S. Quintana-Ortí, Robert A. van de Geijn, Field G. Van Zee, and Ernie Chan. Programming matrix algorithms-by-blocks for thread-level parallelism. ACM Transactions on Mathematical Software, 36(3):14:1-14:26, July 2009.
Separation of concerns facilitates programmability and allows us to experiment with different scheduling algorithms and heuristics. Data locality is important as load balance for scheduling matrix computations due to the coarse data granularity of this problem domain.
This research is partially funded by National Science Foundation (NSF) grants CCF-0540926 and CCF-0702714 and Intel and Microsoft Corporations. We would like to thank the rest of the FLAME team for their support, namely Robert van de Geijn and Field Van Zee.
Map an algorithm-by-blocks to a directed acyclic graph by viewing tasks as the nodes and data dependencies between tasks as the edges in the graph.
TRSM1 reads A0,0 after CHOL0 overwrites it. This situation leads to the flow dependency (read-after-write) between these two tasks.
ANALYZERDISPATCHE
R
Use separation of concerns between code that implements a linear algebra algorithm from runtime system that exploits parallelism from an algorithm-by-blocks mapped to a directed acyclic graph.
Delay the execution of operations and instead store all tasks in global task queue sequentially.
Internally calculate all data dependencies between tasks only using the input and output parameters for each task.
Implicitly construct a directed acyclic graph (DAG) from tasks and data dependencies.
Once analyzer completes, dispatcher is invoked to schedule and dispatch tasks to threads in parallel.
foreach task in DAG do
if task is ready
then Enqueue task
end endwhile tasks are
available do Dequeue task Execute task foreach
dependent task do Update
dependent task if dependent
task is ready then Enqueue
dependent taskend end end
PE1
DEQUEU
EENQUEU
E
PEp
…DEQUEU
E
PE1DEQUEU
EENQUEU
E
PEp
…
DEQUEUE
…
ENQUEUE
……
Multi-queue multi-server system
Single-queue multi-server system
A single-queue multi-server system attains better load balance than a multi-queue multi-server system.
Matrix computations exhibit coarse data granularity, so the cost of performing the Enqueue and Dequeue routines are amortized over the cost of executing the tasks.
We developed the cache affinity scheduling algorithm that uses a single priority queue to address load balance and software caches with each thread to address data locality simultaneously.
FLA_Error FLASH_Chol_l_var3( FLA_Obj A ){ FLA_Obj ATL, ATR, A00, A01, A02, ABL, ABR, A10, A11, A12, A20, A21, A22;
FLA_Part_2x2( A, &ATL, &ATR, &ABL, &ABR, 0, 0, FLA_TL );
while ( FLA_Obj_length( ATL ) < FLA_Obj_length( A ) ) { FLA_Repart_2x2_to_3x3( ATL, /**/ ATR, &A00, /**/ &A01, &A02, /* ************* */ /* ******************** */ &A10, /**/ &A11, &A12, ABL, /**/ ABR, &A20, /**/ &A21, &A22, 1, 1, FLA_BR ); /*-----------------------------------------------*/ FLASH_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLASH_Syrk( FLA_LOWER_TRIANGULAR, FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /*-----------------------------------------------*/ FLA_Cont_with_3x3_to_2x2( &ATL, /**/ &ATR, A00, A01, /**/ A02, A10, A11, /**/ A12, /* ************** */ /* ****************** */ &ABL, /**/ &ABR, A20, A21, /**/ A22, FLA_TL ); } return FLA_SUCCESS;}