Download pptx - R UNTIME D ATA F LOW S CHEDULING OF M ATRIX C OMPUTATIONS E RNIE C HAN R UNTIME D ATA F LOW S CHEDULING OF M ATRIX C OMPUTATIONS E RNIE C HAN C HOL 0 A

RUNTIME DATA FLOW SCHEDULING OF MATRIX COMPUTATIONS

ERNIE CHAN

CHOL0

A0,0 :=CHOL(A0,

0)

TRSM2

A2,0 :=A2,0 A0,0

-T

GEMM5

A2,1 :=A2,0 A1,0

T

SYRK7

A2,2 :=A2,0 A2,0

T

TRSM3

A3,0 :=A3,0 A0,0

-T

GEMM6

A3,1 :=A3,0 A1,0

T

GEMM8

A3,2 :=A3,0 A2,0

T

SYRK9

A3,3 :=A3,0 A3,0

T

TRSM1

A1,0 :=A1,0 A0,0

-T

SYRK4

A1,1 :=A1,0 A1,0

T

TRSM11

A2,1 :=A2,1 A1,1

-T

SYRK13

A2,2 :=A2,1 A2,1

T

TRSM12

A3,1 :=A3,1 A1,1

-T

GEMM14

A3,2 :=A3,1 A2,1

T

SYRK15

A3,3 :=A3,1 A3,1

T

CHOL10

A1,1 :=CHOL(A1,

1)

CHOL16

A2,2 :=CHOL(A2,

2)

TRSM17

A3,2 :=A3,2 A2,2

-T

SYRK18

A3,3 :=A3,2 A3,2

T

CHOL19

A3,3 :=CHOL(A3,

3)

ABSTRACT

CHOLESKY FACTORIZATION

ALGORITHM-BY-BLOCKS

DIRECTED ACYCLIC GRAPH

QUEUEING THEORY

PERFORMANCE

SUPERMATRIX

A → L LT

where A is a symmetric positive definite matrix and L is a lower triangular matrix.

Blocked right-looking algorithm (left) and implementation (right) for Cholesky factorization

ITERATION 1

ITERATION 2

ITERATION 3

ITERATION 4

We investigate the scheduling of matrix computations expressed as directed acyclic graphs for shared-memory parallelism. Because of the data granularity in this problem domain, even slight variations in load balance or data locality can greatly affect performance. We provide a flexible framework for scheduling matrix computations, which we use to empirically quantify different scheduling algorithms. We have developed a scheduling algorithm that addresses both load balance and data locality simultaneously and show its performance benefits.

CHOL0

CHOL1

0

CHOL1

6

TRSM1 TRSM2 TRSM3

TRSM1

2

TRSM1

1

SYRK4 SYRK7 SYRK9

GEMM5

GEMM6

GEMM8

SYRK13 SYRK15

GEMM14

TRSM1

7

SYRK18

CHOL1

9

CONCLUSION

Performance of Cholesky factorization using several high-performance implementations (left)

and finding the best block size for each problem size using SuperMatrix w/cache affinity (right)

Algorithm-by-blocks for Cholesky factorization given a 4×4 matrix of blocks

Reformulate a blocked algorithm to an algorithm-by-blocks bydecomposing sub-operations into component operations on blocks (tasks).

Directed acyclic graph for Cholesky factorization given a 4×4 matrix of blocks

REFERENCES

[1] Ernie Chan, Enrique S. Quintana-Ortí, Gregorio Quintana-Ortí, and Robert van de Geijn. SuperMatrix out-of-order scheduling of matrix operations on SMP and multi-core architectures. In Proceedings of the Nineteenth Annual ACM Symposium on Parallelism in Algorithms and Architectures, pages 116-125, San Diego, CA, USA, June 2007.[2] Ernie Chan, Field G. Van Zee, Paolo Bientinesi, Enrique S. Quintana-Ortí, Gregorio Quintana-Ortí, and Robert van de Geijn. SuperMatrix: A multithreaded runtime scheduling system for algorithms-by-blocks. In Proceedings of the Thirteenth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 123-132, Salt Lake City, UT, USA, February 2008.[3] Gregorio Quintana-Ortí, Enrique S. Quintana-Ortí, Robert A. van de Geijn, Field G. Van Zee, and Ernie Chan. Programming matrix algorithms-by-blocks for thread-level parallelism. ACM Transactions on Mathematical Software, 36(3):14:1-14:26, July 2009.

Separation of concerns facilitates programmability and allows us to experiment with different scheduling algorithms and heuristics. Data locality is important as load balance for scheduling matrix computations due to the coarse data granularity of this problem domain.

This research is partially funded by National Science Foundation (NSF) grants CCF-0540926 and CCF-0702714 and Intel and Microsoft Corporations. We would like to thank the rest of the FLAME team for their support, namely Robert van de Geijn and Field Van Zee.

Map an algorithm-by-blocks to a directed acyclic graph by viewing tasks as the nodes and data dependencies between tasks as the edges in the graph.

TRSM1 reads A0,0 after CHOL0 overwrites it. This situation leads to the flow dependency (read-after-write) between these two tasks.

ANALYZERDISPATCHE

R

Use separation of concerns between code that implements a linear algebra algorithm from runtime system that exploits parallelism from an algorithm-by-blocks mapped to a directed acyclic graph.

Delay the execution of operations and instead store all tasks in global task queue sequentially.

Internally calculate all data dependencies between tasks only using the input and output parameters for each task.

Implicitly construct a directed acyclic graph (DAG) from tasks and data dependencies.

Once analyzer completes, dispatcher is invoked to schedule and dispatch tasks to threads in parallel.

foreach task in DAG do

if task is ready

then Enqueue task

end endwhile tasks are

available do Dequeue task Execute task foreach

dependent task do Update

dependent task if dependent

task is ready then Enqueue

dependent taskend end end

PE1

DEQUEU

EENQUEU

E

PEp

…DEQUEU

E

PE1DEQUEU

EENQUEU

E

PEp

…

DEQUEUE

…

ENQUEUE

……

Multi-queue multi-server system

Single-queue multi-server system

A single-queue multi-server system attains better load balance than a multi-queue multi-server system.

Matrix computations exhibit coarse data granularity, so the cost of performing the Enqueue and Dequeue routines are amortized over the cost of executing the tasks.

We developed the cache affinity scheduling algorithm that uses a single priority queue to address load balance and software caches with each thread to address data locality simultaneously.

FLA_Error FLASH_Chol_l_var3( FLA_Obj A ){ FLA_Obj ATL, ATR, A00, A01, A02, ABL, ABR, A10, A11, A12, A20, A21, A22;

FLA_Part_2x2( A, &ATL, &ATR, &ABL, &ABR, 0, 0, FLA_TL );

while ( FLA_Obj_length( ATL ) < FLA_Obj_length( A ) ) { FLA_Repart_2x2_to_3x3( ATL, /**/ ATR, &A00, /**/ &A01, &A02, /* ************* */ /* ******************** */ &A10, /**/ &A11, &A12, ABL, /**/ ABR, &A20, /**/ &A21, &A22, 1, 1, FLA_BR ); /*-----------------------------------------------*/ FLASH_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLASH_Syrk( FLA_LOWER_TRIANGULAR, FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /*-----------------------------------------------*/ FLA_Cont_with_3x3_to_2x2( &ATL, /**/ &ATR, A00, A01, /**/ A02, A10, A11, /**/ A12, /* ************** */ /* ****************** */ &ABL, /**/ &ABR, A20, A21, /**/ A22, FLA_TL ); } return FLA_SUCCESS;}