1
RUNTIME DATA FLOW SCHEDULING OF MATRIX COMPUTATIONS ERNIE CHAN CHOL 0 A 0,0 := CHOL(A 0,0 ) TRSM 2 A 2,0 := A 2,0 A 0,0 -T GEMM 5 A 2,1 := A 2,0 A 1,0 T SYRK 7 A 2,2 := A 2,0 A 2,0 T TRSM 3 A 3,0 := A 3,0 A 0,0 -T GEMM 6 A 3,1 := A 3,0 A 1,0 T GEMM 8 A 3,2 := A 3,0 A 2,0 T SYRK 9 A 3,3 := A 3,0 A 3,0 T TRSM 1 A 1,0 := A 1,0 A 0,0 -T SYRK 4 A 1,1 := A 1,0 A 1,0 T TRSM 11 A 2,1 := A 2,1 A 1,1 -T SYRK 13 A 2,2 := A 2,1 A 2,1 T TRSM 12 A 3,1 := A 3,1 A 1,1 -T GEMM 14 A 3,2 := A 3,1 A 2,1 T SYRK 15 A 3,3 := A 3,1 A 3,1 T CHOL 10 A 1,1 := CHOL(A 1,1 ) CHOL 16 A 2,2 := CHOL(A 2,2 ) TRSM 17 A 3,2 := A 3,2 A 2,2 -T SYRK 18 A 3,3 := A 3,2 A 3,2 T CHOL 19 A 3,3 := CHOL(A 3,3 ) ABSTRACT CHOLESKY FACTORIZATION ALGORITHM-BY-BLOCKS DIRECTED ACYCLIC GRAPH QUEUEING THEORY PERFORMANCE SUPERMATRIX A → L L T where A is a symmetric positive definite matrix and L is a lower triangular matrix. Blocked right-looking algorithm (left) and implementation (right) for Cholesky factorization ITERATION 1 ITERATION 2 ITERATION 3 ITERATION 4 We investigate the scheduling of matrix computations expressed as directed acyclic graphs for shared-memory parallelism. Because of the data granularity in this problem domain, even slight variations in load balance or data locality can greatly affect performance. We provide a flexible framework for scheduling matrix computations, which we use to empirically quantify different scheduling algorithms. We have developed a scheduling algorithm that addresses both load balance and data locality simultaneously and show its performance benefits. CHOL 0 CHOL 10 CHOL 16 TRSM 1 TRSM 2 TRSM 3 TRSM 12 TRSM 11 SYRK 4 SYRK 7 SYRK 9 GEMM 5 GEMM 6 GEMM 8 SYRK 13 SYRK 15 GEMM 14 TRSM 17 SYRK 18 CHOL 19 CONCLUSION Performance of Cholesky factorization using several high- performance implementations (left) and finding the best block size for each problem size using SuperMatrix w/cache affinity (right) Algorithm-by-blocks for Cholesky factorization given a 4×4 matrix of blocks Reformulate a blocked algorithm to an algorithm-by-blocks by decomposing sub-operations into component operations on blocks (tasks). Directed acyclic graph for Cholesky factorization given a 4×4 matrix of blocks REFERENCES [1] Ernie Chan, Enrique S. Quintana-Ortí, Gregorio Quintana-Ortí, and Robert van de Geijn. SuperMatrix out-of-order scheduling of matrix operations on SMP and multi-core architectures. In Proceedings of the Nineteenth Annual ACM Symposium on Parallelism in Algorithms and Architectures , pages 116-125, San Diego, CA, USA, June 2007. [2] Ernie Chan, Field G. Van Zee, Paolo Bientinesi, Enrique S. Quintana-Ortí, Gregorio Quintana- Ortí, and Robert van de Geijn. SuperMatrix: A multithreaded runtime scheduling system for algorithms-by-blocks. In Proceedings of the Thirteenth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 123-132, Salt Lake City, UT, USA, February 2008. [3] Gregorio Quintana-Ortí, Enrique S. Quintana-Ortí, Robert A. van de Geijn, Field G. Van Zee, and Ernie Chan. Programming matrix algorithms-by-blocks for thread-level parallelism. ACM Transactions on Mathematical Software, 36(3):14:1-14:26, July 2009. Separation of concerns facilitates programmability and allows us to experiment with different scheduling algorithms and heuristics. Data locality is important as load balance for scheduling matrix computations due to the coarse data granularity of this problem domain. This research is partially funded by National Science Foundation (NSF) grants CCF-0540926 and CCF-0702714 and Intel and Microsoft Corporations. We would like to thank the rest of the FLAME team for their support, namely Robert van de Geijn and Field Van Zee. Map an algorithm-by-blocks to a directed acyclic graph by viewing tasks as the nodes and data dependencies between tasks as the edges in the graph. TRSM 1 reads A 0,0 after CHOL 0 overwrites it. This situation leads to the flow dependency (read-after-write) between these two tasks. ANALYZER DISPATCHER Use separation of concerns between code that implements a linear algebra algorithm from runtime system that exploits parallelism from an algorithm-by-blocks mapped to a directed acyclic graph. Delay the execution of operations and instead store all tasks in global task queue sequentially. Internally calculate all data dependencies between tasks only using the input and output parameters for each task. Implicitly construct a directed acyclic graph (DAG) from tasks and data dependencies. Once analyzer completes, dispatcher is invoked to schedule and dispatch tasks to threads in parallel. foreach task in DAG do if task is ready then Enqueue task end end while tasks are available do Dequeue task Execute task foreach dependent task do Update dependent task if dependent task is ready then Enqueue dependent task end end end PE 1 DEQUEUE ENQUEUE PE p DEQUEUE PE 1 DEQUEUE ENQUEUE PE p DEQUEUE ENQUEUE Multi-queue multi- server system Single-queue multi- server system A single-queue multi- server system attains better load balance than a multi-queue multi- server system. Matrix computations exhibit coarse data granularity, so the cost of performing the Enqueue and Dequeue routines are amortized over the cost of executing the tasks. We developed the cache affinity scheduling algorithm that uses a single priority queue to address load balance and software caches with each thread to address data locality simultaneously. FLA_Error FLASH_Chol_l_var3( FLA_Obj A ) { FLA_Obj ATL, ATR, A00, A01, A02, ABL, ABR, A10, A11, A12, A20, A21, A22; FLA_Part_2x2( A, &ATL, &ATR, &ABL, &ABR, 0, 0, FLA_TL ); while ( FLA_Obj_length( ATL ) < FLA_Obj_length( A ) ) { FLA_Repart_2x2_to_3x3( ATL, /**/ ATR, &A00, /**/ &A01, &A02, /* ************* */ /* ******************** */ &A10, /**/ &A11, &A12, ABL, /**/ ABR, &A20, /**/ &A21, &A22, 1, 1, FLA_BR ); /*-----------------------------------------------*/ FLASH_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLASH_Syrk( FLA_LOWER_TRIANGULAR, FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /*-----------------------------------------------*/ FLA_Cont_with_3x3_to_2x2( &ATL, /**/ &ATR, A00, A01, /**/ A02, A10, A11, /**/ A12, /* ************** */ /* ****************** */ &ABL, /**/ &ABR, A20, A21, /**/ A22, FLA_TL ); } return FLA_SUCCESS; }

Runtime Data Flow Scheduling of Matrix Computations Ernie Chan

  • Upload
    aine

  • View
    52

  • Download
    0

Embed Size (px)

DESCRIPTION

Runtime Data Flow Scheduling of Matrix Computations Ernie Chan. Abstract. Algorithm-by-Blocks. Queueing Theory. - PowerPoint PPT Presentation

Citation preview

Page 1: Runtime Data Flow Scheduling of Matrix Computations Ernie Chan

RUNTIME DATA FLOW SCHEDULING OF MATRIX COMPUTATIONS

ERNIE CHAN

CHOL0A0,0 :=

CHOL(A0,

0)

TRSM2A2,0 :=

A2,0 A0,0-T

GEMM5A2,1 :=

A2,0 A1,0T

SYRK7A2,2 :=

A2,0 A2,0T

TRSM3A3,0 :=

A3,0 A0,0-T

GEMM6A3,1 :=

A3,0 A1,0T

GEMM8A3,2 :=

A3,0 A2,0T

SYRK9A3,3 :=

A3,0 A3,0T

TRSM1A1,0 :=

A1,0 A0,0-T

SYRK4A1,1 :=

A1,0 A1,0T

TRSM11A2,1 :=

A2,1 A1,1-T

SYRK13A2,2 :=

A2,1 A2,1T

TRSM12A3,1 :=

A3,1 A1,1-T

GEMM14A3,2 :=

A3,1 A2,1T

SYRK15A3,3 :=

A3,1 A3,1T

CHOL10A1,1 :=

CHOL(A1,

1)

CHOL16A2,2 :=

CHOL(A2,

2)

TRSM17A3,2 :=

A3,2 A2,2-T

SYRK18A3,3 :=

A3,2 A3,2T

CHOL19A3,3 :=

CHOL(A3,

3)

ABSTRACT

CHOLESKY FACTORIZATION

ALGORITHM-BY-BLOCKS

DIRECTED ACYCLIC GRAPH

QUEUEING THEORY

PERFORMANCE

SUPERMATRIX

A → L LT

where A is a symmetric positive definite matrix and L is a lower triangular matrix.

Blocked right-looking algorithm (left) and implementation (right) for Cholesky factorization

ITERATION 1

ITERATION 2

ITERATION 3

ITERATION 4

We investigate the scheduling of matrix computations expressed as directed acyclic graphs for shared-memory parallelism. Because of the data granularity in this problem domain, even slight variations in load balance or data locality can greatly affect performance. We provide a flexible framework for scheduling matrix computations, which we use to empirically quantify different scheduling algorithms. We have developed a scheduling algorithm that addresses both load balance and data locality simultaneously and show its performance benefits.

CHOL0

CHOL1

0

CHOL1

6

TRSM1 TRSM2 TRSM3

TRSM1

2

TRSM1

1

SYRK4 SYRK7 SYRK9GEMM

5

GEMM6

GEMM8

SYRK13 SYRK15GEMM

14

TRSM1

7

SYRK18

CHOL1

9

CONCLUSION

Performance of Cholesky factorization using several high-performance implementations (left)

and finding the best block size for each problem size using SuperMatrix w/cache affinity (right)

Algorithm-by-blocks for Cholesky factorization given a 4×4 matrix of blocks

Reformulate a blocked algorithm to an algorithm-by-blocks bydecomposing sub-operations into component operations on blocks (tasks).

Directed acyclic graph for Cholesky factorization given a 4×4 matrix of blocks

REFERENCES

[1] Ernie Chan, Enrique S. Quintana-Ortí, Gregorio Quintana-Ortí, and Robert van de Geijn. SuperMatrix out-of-order scheduling of matrix operations on SMP and multi-core architectures. In Proceedings of the Nineteenth Annual ACM Symposium on Parallelism in Algorithms and Architectures, pages 116-125, San Diego, CA, USA, June 2007.[2] Ernie Chan, Field G. Van Zee, Paolo Bientinesi, Enrique S. Quintana-Ortí, Gregorio Quintana-Ortí, and Robert van de Geijn. SuperMatrix: A multithreaded runtime scheduling system for algorithms-by-blocks. In Proceedings of the Thirteenth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 123-132, Salt Lake City, UT, USA, February 2008.[3] Gregorio Quintana-Ortí, Enrique S. Quintana-Ortí, Robert A. van de Geijn, Field G. Van Zee, and Ernie Chan. Programming matrix algorithms-by-blocks for thread-level parallelism. ACM Transactions on Mathematical Software, 36(3):14:1-14:26, July 2009.

Separation of concerns facilitates programmability and allows us to experiment with different scheduling algorithms and heuristics. Data locality is important as load balance for scheduling matrix computations due to the coarse data granularity of this problem domain.This research is partially funded by National Science Foundation (NSF) grants CCF-0540926 and CCF-0702714 and Intel and Microsoft Corporations. We would like to thank the rest of the FLAME team for their support, namely Robert van de Geijn and Field Van Zee.

Map an algorithm-by-blocks to a directed acyclic graph by viewing tasks as the nodes and data dependencies between tasks as the edges in the graph.TRSM1 reads A0,0 after CHOL0 overwrites it. This situation leads to the flow dependency (read-after-write) between these two tasks.

ANALYZER DISPATCHER

Use separation of concerns between code that implements a linear algebra algorithm from runtime system that exploits parallelism from an algorithm-by-blocks mapped to a directed acyclic graph.

Delay the execution of operations and instead store all tasks in global task queue sequentially.

Internally calculate all data dependencies between tasks only using the input and output parameters for each task.

Implicitly construct a directed acyclic graph (DAG) from tasks and data dependencies.

Once analyzer completes, dispatcher is invoked to schedule and dispatch tasks to threads in parallel.

foreach task in DAG do

if task is ready

then Enqueue task

end endwhile tasks are

available do Dequeue task Execute task foreach

dependent task do Update

dependent task if dependent

task is ready then Enqueue

dependent taskend end end

PE1

DEQUEUE

ENQUEUE

PEp

…DEQUEUE

PE1DEQUEU

EENQUEU

E

PEp

DEQUEUE

ENQUEUE

……

Multi-queue multi-server system

Single-queue multi-server system

A single-queue multi-server system attains better load balance than a multi-queue multi-server system.Matrix computations exhibit coarse data granularity, so the cost of performing the Enqueue and Dequeue routines are amortized over the cost of executing the tasks.We developed the cache affinity scheduling algorithm that uses a single priority queue to address load balance and software caches with each thread to address data locality simultaneously.

FLA_Error FLASH_Chol_l_var3( FLA_Obj A ){ FLA_Obj ATL, ATR, A00, A01, A02, ABL, ABR, A10, A11, A12, A20, A21, A22;

FLA_Part_2x2( A, &ATL, &ATR, &ABL, &ABR, 0, 0, FLA_TL );

while ( FLA_Obj_length( ATL ) < FLA_Obj_length( A ) ) { FLA_Repart_2x2_to_3x3( ATL, /**/ ATR, &A00, /**/ &A01, &A02, /* ************* */ /* ******************** */ &A10, /**/ &A11, &A12, ABL, /**/ ABR, &A20, /**/ &A21, &A22, 1, 1, FLA_BR ); /*-----------------------------------------------*/ FLASH_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLASH_Syrk( FLA_LOWER_TRIANGULAR, FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /*-----------------------------------------------*/ FLA_Cont_with_3x3_to_2x2( &ATL, /**/ &ATR, A00, A01, /**/ A02, A10, A11, /**/ A12, /* ************** */ /* ****************** */ &ABL, /**/ &ABR, A20, A21, /**/ A22, FLA_TL ); } return FLA_SUCCESS;}