75
PACC2011, Sept. 2011 1 occer] is a very simple game. It’s just very hard to play it simpl - Johan Cruy Dense Linear Algebra subject make RvdG

PACC2011, Sept. 2011 1 [Soccer] is a very simple game. It’s just very hard to play it simple. - Johan Cruyff Dense Linear Algebra subjectmake RvdG

Embed Size (px)

Citation preview

Page 1: PACC2011, Sept. 2011 1 [Soccer] is a very simple game. It’s just very hard to play it simple. - Johan Cruyff Dense Linear Algebra subjectmake RvdG

PACC2011, Sept. 2011 1

[Soccer] is a very simple game. It’s just very hard to play it simple.- Johan Cruyff

Dense Linear Algebra subject make

RvdG

Page 2: PACC2011, Sept. 2011 1 [Soccer] is a very simple game. It’s just very hard to play it simple. - Johan Cruyff Dense Linear Algebra subjectmake RvdG

PACC2011, Sept. 2011 2

Robert van de Geijn

Designing a Library to be Multi-Accelerator Ready: A Case Study

Department of Computer ScienceInstitute for Computational Engineering and Sciences

The University of Texas at Austin

Page 3: PACC2011, Sept. 2011 1 [Soccer] is a very simple game. It’s just very hard to play it simple. - Johan Cruyff Dense Linear Algebra subjectmake RvdG

PACC2011, Sept. 2011 3

Outline

Motivation What is FLAME? Deriving algorithms to be correct Representing algorithms in code Of blocked algorithms and algorithms-by-blocks Runtime support for multicore, GPU, and multiGPU Extensions to distributed memory platforms Related work Conclusion

Page 4: PACC2011, Sept. 2011 1 [Soccer] is a very simple game. It’s just very hard to play it simple. - Johan Cruyff Dense Linear Algebra subjectmake RvdG

PACC2011, Sept. 2011 4

Moments of Inspiration

Birth of multi-threaded libflame Aug. 2006 - an insight:

libflame + algorithm-by-blocks + out-of-order scheduling

(runtime) = multithreaded library

Sept. 2006 - working prototype (by G. Quintana)

Oct. 2006 - grant proposal (to NSF, later funded)

Jan. 2007 - paper submitted (to SPAA07, accepted)

April 2007 - released with libflame R1.0

Birth of multi-GPU libflame Fall 2007 - runtime used to

manage data and tasks on a single GPU. (UJI-Spain)

March 2008 - NVIDIA donates 4GPU Tesla S870 system

Two hours after unboxing the boards, multiple heuristics for multiGPU runtime implemented

Then the power cord fried…

Page 5: PACC2011, Sept. 2011 1 [Soccer] is a very simple game. It’s just very hard to play it simple. - Johan Cruyff Dense Linear Algebra subjectmake RvdG

PACC2011, Sept. 2011 5

After two hours Shortly after

Birth of MultiGPU libflame

G. Quintana, Igual, E. Quintana, van de Geijn. "Solving Dense Linear Algebra Problems on Platforms with Multiple Hardware Accelerators." PPoPP’09.

Page 6: PACC2011, Sept. 2011 1 [Soccer] is a very simple game. It’s just very hard to play it simple. - Johan Cruyff Dense Linear Algebra subjectmake RvdG

PACC2011, Sept. 2011 6

What Supports our Productivity/Performance?

Deep understanding of the domain Foundational computer science

Derivation of algorithms Software implementation of hardware techniques Blocking for performance

Abstraction Separation of concern

Page 7: PACC2011, Sept. 2011 1 [Soccer] is a very simple game. It’s just very hard to play it simple. - Johan Cruyff Dense Linear Algebra subjectmake RvdG

PACC2011, Sept. 2011 7

Outline

Motivation What is FLAME? Deriving algorithms to be correct Representing algorithms in code Of blocked algorithms and algorithms-by-blocks Runtime support for multicore, GPU, and multiGPU Extensions to distributed memory platforms Related work Conclusion

Page 8: PACC2011, Sept. 2011 1 [Soccer] is a very simple game. It’s just very hard to play it simple. - Johan Cruyff Dense Linear Algebra subjectmake RvdG

PACC2011, Sept. 2011 8

What is FLAME?

A notation for expressing linear algebra algorithms A methodology for deriving such algorithm A set of abstractions for representing such algorithms

In LaTeX, M-script, C, etc. A modern library (libflame)

Alternative to BLAS, LAPACK, ScaLAPACK, and related efforts Many new contributions to theory and practice of dense linear

algebra Also banded and Krylov subspace methods

A set of tools supporting the above Mechanical derivation Automatic generation of code Design-by-Transformation (DxT)

Page 9: PACC2011, Sept. 2011 1 [Soccer] is a very simple game. It’s just very hard to play it simple. - Johan Cruyff Dense Linear Algebra subjectmake RvdG

PACC2011, Sept. 2011 9

Who is FLAME?

Page 10: PACC2011, Sept. 2011 1 [Soccer] is a very simple game. It’s just very hard to play it simple. - Johan Cruyff Dense Linear Algebra subjectmake RvdG

PACC2011, Sept. 2011 10

Outline

Motivation What is FLAME? Deriving algorithms to be correct Representing algorithms in code Of blocked algorithms and algorithms-by-blocks Runtime support for multicore, GPU, and multiGPU Extensions to distributed memory platforms Related work Conclusion

Page 11: PACC2011, Sept. 2011 1 [Soccer] is a very simple game. It’s just very hard to play it simple. - Johan Cruyff Dense Linear Algebra subjectmake RvdG

PACC2011, Sept. 2011 11

Deriving Algorithms to be Correct

Include all algorithms for a given operation: Pick the right algorithm for the given architecture

Problem: How to find the right algorithm

Solution: Formal derivation (Hoare, Dijkstra, …):

Given operation, systematically derive family of algorithms for computing it.

Page 12: PACC2011, Sept. 2011 1 [Soccer] is a very simple game. It’s just very hard to play it simple. - Johan Cruyff Dense Linear Algebra subjectmake RvdG

PACC2011, Sept. 2011 12

Notation

The notation used to express an algorithm should reflect how the algorithm is naturally explained.

Page 13: PACC2011, Sept. 2011 1 [Soccer] is a very simple game. It’s just very hard to play it simple. - Johan Cruyff Dense Linear Algebra subjectmake RvdG

PACC2011, Sept. 2011 13

Example: The Cholesky Factorization

Lower triangular case:

Key in the solution of s.p.d. linear systems

A x = b (LLT)x = b L y = b y LT x = y x

A = *L LT

Page 14: PACC2011, Sept. 2011 1 [Soccer] is a very simple game. It’s just very hard to play it simple. - Johan Cruyff Dense Linear Algebra subjectmake RvdG

PACC2011, Sept. 2011 14

Algorithm Loop: Repartition

ATL

ABL ABR

a11

a21 A22

A00

A20

a10T

Indexing operations

Page 15: PACC2011, Sept. 2011 1 [Soccer] is a very simple game. It’s just very hard to play it simple. - Johan Cruyff Dense Linear Algebra subjectmake RvdG

PACC2011, Sept. 2011 15

Algorithm Loop: Update

a11

a21

A00

A20

a10T

a11

a21

/a11

A22 – a21 a21T

A00

A20

a10T

Real computation

Page 16: PACC2011, Sept. 2011 1 [Soccer] is a very simple game. It’s just very hard to play it simple. - Johan Cruyff Dense Linear Algebra subjectmake RvdG

PACC2011, Sept. 2011 16

Algorithm Loop: Merging

ATL

ABL ABR

a11

a21

/a11

A22 – a21 a21T

A00

A20

a10T

Indexing operation

Page 17: PACC2011, Sept. 2011 1 [Soccer] is a very simple game. It’s just very hard to play it simple. - Johan Cruyff Dense Linear Algebra subjectmake RvdG

PACC2011, Sept. 2011 17

Worksheet for Cholesky Factorization

Page 19: PACC2011, Sept. 2011 1 [Soccer] is a very simple game. It’s just very hard to play it simple. - Johan Cruyff Dense Linear Algebra subjectmake RvdG

PACC2011, Sept. 2011 19

Is Formal Derivation Practical?

libflame: 128+ matrix operations 1389+ implementations of algorithms Test suite created in 2011 126,756 tests executed Only 3 minor bugs in library… (now fixed)

Page 20: PACC2011, Sept. 2011 1 [Soccer] is a very simple game. It’s just very hard to play it simple. - Johan Cruyff Dense Linear Algebra subjectmake RvdG

PACC2011, Sept. 2011 20

Impact on (Single) GPU Computing

CUBLAS 2009 (have been optimized since)

Page 21: PACC2011, Sept. 2011 1 [Soccer] is a very simple game. It’s just very hard to play it simple. - Johan Cruyff Dense Linear Algebra subjectmake RvdG

PACC2011, Sept. 2011 21

Impact on (Single) GPU Computing

Fran Igual. “Matrix Computations on Graphics Processors and Clusters of GPUs." Ph.D. Dissertation. Univ. Jaume I. May 2011.Igual, G. Quintana, and van de Geijn. "Level-3 BLAS on a GPU: Picking the Low Hanging Fruit " FLAME Working Note #37. Universidad Jaume I, updated May 21, 2009.

CUBLAS 2009

Page 22: PACC2011, Sept. 2011 1 [Soccer] is a very simple game. It’s just very hard to play it simple. - Johan Cruyff Dense Linear Algebra subjectmake RvdG

PACC2011, Sept. 2011 22

A Sampling of Functionality

operation Classic FLAME SuperMatrixMultiThreaded/

MultiGPU

lapack2flame

Level-3 BLAS y y N.A.

Cholesky y y y

LU with partial pivoting y y y

LU with incremental pivoting y y N.A.

QR (UT) y y y

LQ (UT) y y y

SPD/HPD inversion y y y

Triangular inversion y y y

Triangular Sylvester y y y

Triangular Lyapunov y y y

Up-and-downdate (UT) y N.A.

SVD next week soon

EVD next week soon

Page 23: PACC2011, Sept. 2011 1 [Soccer] is a very simple game. It’s just very hard to play it simple. - Johan Cruyff Dense Linear Algebra subjectmake RvdG

PACC2011, Sept. 2011 23

Outline

Motivation What is FLAME? Deriving algorithms to be correct Representing algorithms in code Of blocked algorithms and algorithms-by-blocks Runtime support for multicore, GPU, and multiGPU Extensions to distributed memory platforms Related work Conclusion

Page 24: PACC2011, Sept. 2011 1 [Soccer] is a very simple game. It’s just very hard to play it simple. - Johan Cruyff Dense Linear Algebra subjectmake RvdG

PACC2011, Sept. 2011 24

Representing Algorithms in Code

Code should closely resemble how an algorithm is presented so that no bugs can be introduced when translating an algorithm to code.

Page 25: PACC2011, Sept. 2011 1 [Soccer] is a very simple game. It’s just very hard to play it simple. - Johan Cruyff Dense Linear Algebra subjectmake RvdG

PACC2011, Sept. 2011 25

Representing algorithms in code

Spark+APIsC,F77,Matlab,LabView,LaTeX

http://www.cs.utexas.edu/users/flame/Spark/

Page 26: PACC2011, Sept. 2011 1 [Soccer] is a very simple game. It’s just very hard to play it simple. - Johan Cruyff Dense Linear Algebra subjectmake RvdG

PACC2011, Sept. 2011 26

FLAME/C API

FLA_Part_2x2( A, &ATL, &ATR, &ABL, &ABR, 0, 0, FLA_TL );

while ( FLA_Obj_length( ATL ) < FLA_Obj_length( A ) ){ b = min( FLA_Obj_length( ABR ), nb_alg ); FLA_Repart_2x2_to_3x3( ATL, /**/ ATR, &A00, /**/ &a01, &A02, /* ************* */ /* ************************** */ &a10t,/**/ &alpha11, &a12t, ABL, /**/ ABR, &A20, /**/ &a21, &A22, 1, 1, FLA_BR ); /*--------------------------------------*/ FLA_Sqrt( alpha11 ); FLA_Inv_scal( alpha11, a21 ); FLA_Syr( FLA_LOWER_TRIANGULAR, FLA_NO_TRANSPOSE, FLA_MINUS_ONE, a21, A22 ); /*--------------------------------------*/ FLA_Cont_with_3x3_to_2x2( &ATL, /**/ &ATR, A00, a01, /**/ A02, a10t, alpha11, /**/ a12t, /* ************** */ /* ************************/ &ABL, /**/ &ABR, A20, a21, /**/ A22, FLA_TL );}

For now, libflame employs external BLAS: GotoBLAS, MKL, ACML, CUBLAS

Page 27: PACC2011, Sept. 2011 1 [Soccer] is a very simple game. It’s just very hard to play it simple. - Johan Cruyff Dense Linear Algebra subjectmake RvdG

PACC2011, Sept. 2011 27

Outline

Motivation What is FLAME? Deriving algorithms to be correct Representing algorithms in code Of blocked algorithms and algorithms-by-blocks Runtime support for multicore, GPU, and multiGPU Extensions to distributed memory platforms Related work Conclusion

Page 28: PACC2011, Sept. 2011 1 [Soccer] is a very simple game. It’s just very hard to play it simple. - Johan Cruyff Dense Linear Algebra subjectmake RvdG

PACC2011, Sept. 2011 28

Multicore/MultiGPU - Issues

Manage computation Assignment of tasks to cores and/or GPUs Granularity is important

Manage memory Manage data transfer between “host” and caches of

cores or host and GPU local memories Granularity is important Keep the data in the local memory as long as possible

Page 29: PACC2011, Sept. 2011 1 [Soccer] is a very simple game. It’s just very hard to play it simple. - Johan Cruyff Dense Linear Algebra subjectmake RvdG

PACC2011, Sept. 2011 29

Where have we seen this before?

Computer architecture late 1960s: Super scalar units proposed Unit of data: floating point number Unit of computation: floating point operation Examine dependencies Execute out-of-order, prefetch, cache data, etc., to keep

computational units busy Extract parallelism from sequential instruction stream

R. M. Tomasulo, “An Efficient Algorithm for Exploiting Multiple Arithmetic Units.” IBM J. of R&D, (1967)

Basis for exploitation of ILP on current superscalar processors!

Page 30: PACC2011, Sept. 2011 1 [Soccer] is a very simple game. It’s just very hard to play it simple. - Johan Cruyff Dense Linear Algebra subjectmake RvdG

PACC2011, Sept. 2011 30

Of Blocks and Tasks

Dense matrix computation Unit of data: block in matrix Unit of computation (task): operation with blocks Dependency: input/output of operation with blocks Instruction stream: sequential libflame code

Generates DAG Runtime system schedules tasks

Goal: minimize data transfer and maximize utilization

Page 31: PACC2011, Sept. 2011 1 [Soccer] is a very simple game. It’s just very hard to play it simple. - Johan Cruyff Dense Linear Algebra subjectmake RvdG

PACC2011, Sept. 2011 31 31

Review: Blocked Algorithms

Cholesky factorization

1st iteration 2nd iteration 3rd iteration

A11 = L11 L11T

A21 := L21 = A21 L11-T

A22 := A22 – L21 L21T

Page 32: PACC2011, Sept. 2011 1 [Soccer] is a very simple game. It’s just very hard to play it simple. - Johan Cruyff Dense Linear Algebra subjectmake RvdG

PACC2011, Sept. 2011 32 32

Blocked Algorithms

Cholesky factorizationA = L * LT

APIs + Tools

FLA_Part_2x2(…);

while ( FLA_Obj_length(ATL) < FLA_Obj_length(A) ){

FLA_Repart_2x2_to_3x3(…);

/*--------------------------------------*/ FLA_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLA_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLA_Syrk( FLA_LOWER_TRIANGULAR,FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /*--------------------------------------*/

FLA_Cont_with_3x3_to_2x2(…);

}

Page 33: PACC2011, Sept. 2011 1 [Soccer] is a very simple game. It’s just very hard to play it simple. - Johan Cruyff Dense Linear Algebra subjectmake RvdG

PACC2011, Sept. 2011 33 33

Simple Parallelization: Blocked Algorithms

Link with multi-threaded BLAS

FLA_Part_2x2(…);

while ( FLA_Obj_length(ATL) < FLA_Obj_length(A) ){

FLA_Repart_2x2_to_3x3(…);

/*--------------------------------------*/ FLA_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLA_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLA_Syrk( FLA_LOWER_TRIANGULAR,FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /*--------------------------------------*/

FLA_Cont_with_3x3_to_2x2(…);

}A11 = L11 L11

T

A21 := L21 = A21 L11-T

A22 := A22 – L21 L21T

Page 34: PACC2011, Sept. 2011 1 [Soccer] is a very simple game. It’s just very hard to play it simple. - Johan Cruyff Dense Linear Algebra subjectmake RvdG

PACC2011, Sept. 2011 34 34

Blocked Algorithms

There is more parallelism!

1st iteration

Inside the same iteration

2nd iteration

In different iterations

Page 35: PACC2011, Sept. 2011 1 [Soccer] is a very simple game. It’s just very hard to play it simple. - Johan Cruyff Dense Linear Algebra subjectmake RvdG

PACC2011, Sept. 2011 35 35

Coding Algorithm-by-Blocks

Algorithm-by-blocks :

FLA_Part_2x2(…);

while ( FLA_Obj_length(ATL) < FLA_Obj_length(A) ){

FLA_Repart_2x2_to_3x3(…);

/*--------------------------------------*/ FLA_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLASH_Syrk( FLA_LOWER_TRIANGULAR,FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /*--------------------------------------*/

FLA_Cont_with_3x3_to_2x2(…);

}

A11 = L11 L11T

A21 := L21 = A21 L11-T

A22 := A22 – L21 L21T

Page 36: PACC2011, Sept. 2011 1 [Soccer] is a very simple game. It’s just very hard to play it simple. - Johan Cruyff Dense Linear Algebra subjectmake RvdG

PACC2011, Sept. 2011 36

Outline

Motivation What is FLAME? Deriving algorithms to be correct Representing algorithms in code Of blocked algorithms and algorithms-by-blocks Runtime support for multicore, GPU, and multiGPU Extensions to distributed memory platforms Related work Conclusion

Page 37: PACC2011, Sept. 2011 1 [Soccer] is a very simple game. It’s just very hard to play it simple. - Johan Cruyff Dense Linear Algebra subjectmake RvdG

PACC2011, Sept. 2011 37

Generating a DAG

1

2

3

4

5 6

7

8 9 10

FLA_Part_2x2(…);

while ( FLA_Obj_length(ATL) < FLA_Obj_length(A) ){

FLA_Repart_2x2_to_3x3(…);

/*--------------------------------------*/ FLA_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLASH_Syrk( FLA_LOWER_TRIANGULAR,FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /*--------------------------------------*/

FLA_Cont_with_3x3_to_2x2(…);

}

Page 38: PACC2011, Sept. 2011 1 [Soccer] is a very simple game. It’s just very hard to play it simple. - Johan Cruyff Dense Linear Algebra subjectmake RvdG

PACC2011, Sept. 2011 38

Managing Tasks and Blocks

Separation of concerns: Sequential libflame routine generates the DAG Runtime system (SuperMatrix) manages and

schedules the DAG As one moves from one architecture to another, only

the runtime system needs to be updated Multicore Out-of-core Single GPU MultiGPU Distributed Runtime …

Page 39: PACC2011, Sept. 2011 1 [Soccer] is a very simple game. It’s just very hard to play it simple. - Johan Cruyff Dense Linear Algebra subjectmake RvdG

PACC2011, Sept. 2011 39

Runtime system - SuperMatrix

1

2

3

4

5 6

7

8 9 10

SuperMatrix

DAG

Multicore

+ heuristic

Page 40: PACC2011, Sept. 2011 1 [Soccer] is a very simple game. It’s just very hard to play it simple. - Johan Cruyff Dense Linear Algebra subjectmake RvdG

PACC2011, Sept. 2011 40

Runtime system for GPU - SuperMatrix

1

2

3

4

5 6

7

8 9 10

SuperMatrix

DAG

CPU + GPU

Managedata transfer

accelerator

Page 41: PACC2011, Sept. 2011 1 [Soccer] is a very simple game. It’s just very hard to play it simple. - Johan Cruyff Dense Linear Algebra subjectmake RvdG

PACC2011, Sept. 2011 41

Runtime system for MultiGPU - SuperMatrix

1

2

3

4

5 6

7

8 9 10

SuperMatrix

DAG

CPU + MultiGPU

Managedata transfer

Multi-accelerator

Page 42: PACC2011, Sept. 2011 1 [Soccer] is a very simple game. It’s just very hard to play it simple. - Johan Cruyff Dense Linear Algebra subjectmake RvdG

PACC2011, Sept. 2011 42

MultiGPU

42

How do we program these?

CPU(s)PCI-e bus

GPU #1

GPU #3

GPU #0

GPU #2

Inter-connect

Page 43: PACC2011, Sept. 2011 1 [Soccer] is a very simple game. It’s just very hard to play it simple. - Johan Cruyff Dense Linear Algebra subjectmake RvdG

PACC2011, Sept. 2011 43 43

MultiGPU: a User’s View

FLA_Obj A;

// Initialize conventional matrix: buffer, m, rs, cs// Obtain storage blocksize, # of threads: b, n_threads

FLA_Init();

FLASH_Obj_create( FLA_DOUBLE, m, m, 1, &b, &A );FLASH_Copy_buffer_to_hier( m, m, buffer, rs, cs, 0, 0, A );FLASH_Queue_set_num_threads( n_threads );FLASH_Queue_enable_gpu();FLASH_Chol( FLA_LOWER_TRIANGULAR, A );

FLASH_Obj_free( &A );

FLA_Finalize();

Page 44: PACC2011, Sept. 2011 1 [Soccer] is a very simple game. It’s just very hard to play it simple. - Johan Cruyff Dense Linear Algebra subjectmake RvdG

PACC2011, Sept. 2011 44

MultiGPU: Under the Cover

44

Naïve approach: Before execution, transfer

data to device Call CUBLAS operations

(implementation “someone else’s problem”)

Upon completion, retrieve results back to host

poor data locality

CPU(s)PCI-e bus

GPU #1

GPU #3

GPU #0

GPU #2

Inter-connect

Page 45: PACC2011, Sept. 2011 1 [Soccer] is a very simple game. It’s just very hard to play it simple. - Johan Cruyff Dense Linear Algebra subjectmake RvdG

PACC2011, Sept. 2011 45

MultiGPU: Under the Cover

45

How do we program these?

View as a…

Shared-memory

multiprocessor +

Distributed Shared

Memory (DSM)

architecture

CPU(s)PCI-e bus

GPU #1

GPU #3

GPU #0

GPU #2

Inter-connect

Page 46: PACC2011, Sept. 2011 1 [Soccer] is a very simple game. It’s just very hard to play it simple. - Johan Cruyff Dense Linear Algebra subjectmake RvdG

PACC2011, Sept. 2011 46

MultiGPU: Under the Cover

46

View system as a shared-memory multiprocessors (multi-core processor with hw. coherence)

MP P0+Cache0

P1+Cache1

P2+Cache2

P3+Cache3

CPU(s)PCI-e bus

GPU #1

GPU #3

GPU #0

GPU #2

Inter-connect

Page 47: PACC2011, Sept. 2011 1 [Soccer] is a very simple game. It’s just very hard to play it simple. - Johan Cruyff Dense Linear Algebra subjectmake RvdG

PACC2011, Sept. 2011 47

MultiGPU: Under the Cover

47

Software Distributed-Shared Memory (DSM) Software: flexibility vs. efficiency Underlying distributed memory hidden Reduce memory transfers using write-back, write-

invalidate,… Well-known approach, not too efficient as a

middleware for general apps.

Regularity of dense linear algebra makes a difference!

Page 48: PACC2011, Sept. 2011 1 [Soccer] is a very simple game. It’s just very hard to play it simple. - Johan Cruyff Dense Linear Algebra subjectmake RvdG

PACC2011, Sept. 2011 48

MultiGPU: Under the Cover

48

Reduce #data transfers: Run-time handles device

memory as a software cache: Operate at block level Software flexibility Write-back Write-invalidate

SuperMatrix

CPU(s)PCI-e bus

GPU #1

GPU #3

GPU #0

GPU #2

Inter-connect

1

2

3

4

5 6

7

8 9 10

Page 49: PACC2011, Sept. 2011 1 [Soccer] is a very simple game. It’s just very hard to play it simple. - Johan Cruyff Dense Linear Algebra subjectmake RvdG

PACC2011, Sept. 2011 49

MultiGPU: Under the Cover

SuperMatrix

FLA_Part_2x2(…);

while ( FLA_Obj_length(ATL) < FLA_Obj_length(A) ){

FLA_Repart_2x2_to_3x3(…);

/*--------------------------------------*/ FLASH_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLASH_Syrk( FLA_LOWER_TRIANGULAR,FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /*--------------------------------------*/

FLA_Cont_with_3x3_to_2x2(…);

}

• Factor A11 on host

Page 50: PACC2011, Sept. 2011 1 [Soccer] is a very simple game. It’s just very hard to play it simple. - Johan Cruyff Dense Linear Algebra subjectmake RvdG

PACC2011, Sept. 2011 50

Multi-GPU: Under the Cover

SuperMatrix

FLA_Part_2x2(…);

while ( FLA_Obj_length(ATL) < FLA_Obj_length(A) ){

FLA_Repart_2x2_to_3x3(…);

/*--------------------------------------*/ FLASH_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLASH_Syrk( FLA_LOWER_TRIANGULAR,FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /*--------------------------------------*/

FLA_Cont_with_3x3_to_2x2(…);

}

• Transfer A11 from host to appropriate devices before using it in subsequent computations (write-update)

Page 51: PACC2011, Sept. 2011 1 [Soccer] is a very simple game. It’s just very hard to play it simple. - Johan Cruyff Dense Linear Algebra subjectmake RvdG

PACC2011, Sept. 2011 51

Multi-GPU: Under the Cover

SuperMatrix

FLA_Part_2x2(…);

while ( FLA_Obj_length(ATL) < FLA_Obj_length(A) ){

FLA_Repart_2x2_to_3x3(…);

/*--------------------------------------*/ FLASH_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLASH_Syrk( FLA_LOWER_TRIANGULAR,FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /*--------------------------------------*/

FLA_Cont_with_3x3_to_2x2(…);

}

• Cache A11 in receiving device(s) in case needed in subsequent computations

Page 52: PACC2011, Sept. 2011 1 [Soccer] is a very simple game. It’s just very hard to play it simple. - Johan Cruyff Dense Linear Algebra subjectmake RvdG

PACC2011, Sept. 2011 52

Multi-GPU: Under the Cover

SuperMatrix

FLA_Part_2x2(…);

while ( FLA_Obj_length(ATL) < FLA_Obj_length(A) ){

FLA_Repart_2x2_to_3x3(…);

/*--------------------------------------*/ FLASH_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLASH_Syrk( FLA_LOWER_TRIANGULAR,FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /*--------------------------------------*/

FLA_Cont_with_3x3_to_2x2(…);

}

• Send blocks to devices• Perform Trsm on blocks of

A21 (hopefully using cached A11)

• Keep updated A21 in device till needed by other GPU(s) (write-back)

Page 53: PACC2011, Sept. 2011 1 [Soccer] is a very simple game. It’s just very hard to play it simple. - Johan Cruyff Dense Linear Algebra subjectmake RvdG

PACC2011, Sept. 2011 53

Multi-GPU: Under the Cover

SuperMatrix

FLA_Part_2x2(…);

while ( FLA_Obj_length(ATL) < FLA_Obj_length(A) ){

FLA_Repart_2x2_to_3x3(…);

/*--------------------------------------*/ FLASH_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLASH_Syrk( FLA_LOWER_TRIANGULAR,FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /*--------------------------------------*/

FLA_Cont_with_3x3_to_2x2(…);

}

• Send blocks to devices• Perform Syrk/Gemm on

blocks of A22 (hopefully using cached blocks of A21)

• Keep updated A22 in device till needed by other GPU(s) (write-back)

Page 54: PACC2011, Sept. 2011 1 [Soccer] is a very simple game. It’s just very hard to play it simple. - Johan Cruyff Dense Linear Algebra subjectmake RvdG

PACC2011, Sept. 2011 54

C = C + ABT on S1070 (Tesla x 4)

Page 55: PACC2011, Sept. 2011 1 [Soccer] is a very simple game. It’s just very hard to play it simple. - Johan Cruyff Dense Linear Algebra subjectmake RvdG

PACC2011, Sept. 2011 55

Cholesky on S1070 (Tesla x 4)

Page 56: PACC2011, Sept. 2011 1 [Soccer] is a very simple game. It’s just very hard to play it simple. - Johan Cruyff Dense Linear Algebra subjectmake RvdG

PACC2011, Sept. 2011 56

Cholesky on S1070 (Tesla x 4)

Page 57: PACC2011, Sept. 2011 1 [Soccer] is a very simple game. It’s just very hard to play it simple. - Johan Cruyff Dense Linear Algebra subjectmake RvdG

PACC2011, Sept. 2011 57

Sampling of LAPACK functionality on S2050 (Fermi x 4)

Page 58: PACC2011, Sept. 2011 1 [Soccer] is a very simple game. It’s just very hard to play it simple. - Johan Cruyff Dense Linear Algebra subjectmake RvdG

PACC2011, Sept. 2011 58

Sampling of LAPACK functionality on S2050 (Fermi x 4)

Page 59: PACC2011, Sept. 2011 1 [Soccer] is a very simple game. It’s just very hard to play it simple. - Johan Cruyff Dense Linear Algebra subjectmake RvdG

PACC2011, Sept. 2011 59

Outline

Motivation What is FLAME? Deriving algorithms to be correct Representing algorithms in code Of blocked algorithms and algorithms-by-blocks Runtime support for multicore, GPU, and

multiGPU Extensions to distributed memory platforms Conclusion

Page 60: PACC2011, Sept. 2011 1 [Soccer] is a very simple game. It’s just very hard to play it simple. - Johan Cruyff Dense Linear Algebra subjectmake RvdG

PACC2011, Sept. 2011 60

libflame for Cluster (+ Accelerators)

PLAPACK Distributed memory (MPI) Inspired FLAME Recently modified so that each node can have GPU Keep data in GPU memory as much as possible

Elemental (Jack Poulson) Distributed memory (MPI) Inspired by FLAME/PLAPACK Can use GPU at each node/core

libflame + SuperMatrix Runtime schedules tasks and data transfer Appropriate for small clusters

Page 61: PACC2011, Sept. 2011 1 [Soccer] is a very simple game. It’s just very hard to play it simple. - Johan Cruyff Dense Linear Algebra subjectmake RvdG

PACC2011, Sept. 2011 61

PLAPACK + GPU accelerators

Fogue, Igual, E. Quintana, van de Geijn. "Retargeting PLAPACK to Clusters with Hardware Accelerators." (WEHA 2010).

Each node:Xeon Nehalem (8 cores)+ 2 NVIDIA C1060 (Tesla)

Page 62: PACC2011, Sept. 2011 1 [Soccer] is a very simple game. It’s just very hard to play it simple. - Johan Cruyff Dense Linear Algebra subjectmake RvdG

PACC2011, Sept. 2011 62

Targeting Clusters with GPUsSuperMatrix Distributed Runtime

Igual, G. Quintana, van de Geijn. “Scheduling Algorithms-by-Blocks on Small Clusters.” Concurrency and Computation: Practice and Experience. In review.

Each node:Xeon Nehalem (8 cores)+ 1 NVIDIA C2050 (Fermi)

Page 63: PACC2011, Sept. 2011 1 [Soccer] is a very simple game. It’s just very hard to play it simple. - Johan Cruyff Dense Linear Algebra subjectmake RvdG

PACC2011, Sept. 2011 63

Elemental Cholesky Factorization

Page 64: PACC2011, Sept. 2011 1 [Soccer] is a very simple game. It’s just very hard to play it simple. - Johan Cruyff Dense Linear Algebra subjectmake RvdG

PACC2011, Sept. 2011 64

Elemental vs. ScaLAPACK

Cholesky on 8192 cores - BlueGene/P

Elemental has full ScaLAPACK functionality(except nonsymmetric Eigenvalue Problem).

Poulson, Marker, Hammond, Romero, van de Geijn. "Elemental: A New Framework for Distributed Memory Dense Matrix Computations." ACM TOMS. Submitted.

Page 65: PACC2011, Sept. 2011 1 [Soccer] is a very simple game. It’s just very hard to play it simple. - Johan Cruyff Dense Linear Algebra subjectmake RvdG

PACC2011, Sept. 2011 65

Single-Chip Cloud Computer

Intel SCC research processor 48-core concept vehicle Created for many-core software research Custom communication library (RCCE)

Page 66: PACC2011, Sept. 2011 1 [Soccer] is a very simple game. It’s just very hard to play it simple. - Johan Cruyff Dense Linear Algebra subjectmake RvdG

PACC2011, Sept. 2011 66

SCC Results

- 48 Pentium cores- MPI replaced by RCCE

Igual, G. Quintana, van de Geijn. “Scheduling Algorithms-by-Blocks on Small Clusters.” Concurrency and Computation: Practice and Experience. In review.

Marker, Chan, Poulson, van de Geijn, Van der Wijngaart, Mattson, Kubaska. "Programming Many-Core Architectures - A Case Study: Dense Matrix Computations on the Intel SCC Processor." Concurrency and Computation: Practice and Experience. To Appear.

Page 67: PACC2011, Sept. 2011 1 [Soccer] is a very simple game. It’s just very hard to play it simple. - Johan Cruyff Dense Linear Algebra subjectmake RvdG

PACC2011, Sept. 2011 67

Outline

Motivation What is FLAME? Deriving algorithms to be correct Representing algorithms in code Of blocked algorithms and algorithms-by-blocks Runtime support for multicore, GPU, and multiGPU Extensions to distributed memory platforms Related work Conclusion

Page 68: PACC2011, Sept. 2011 1 [Soccer] is a very simple game. It’s just very hard to play it simple. - Johan Cruyff Dense Linear Algebra subjectmake RvdG

PACC2011, Sept. 2011 68

Related work

Data-flow parallelism, dynamic scheduling, runtime

Cilk OpenMP (task queues) StarSs (SMPSs) StarPU Threading Building Blocks (TBB) …

What we have is very dense linear algebra specific

Page 69: PACC2011, Sept. 2011 1 [Soccer] is a very simple game. It’s just very hard to play it simple. - Johan Cruyff Dense Linear Algebra subjectmake RvdG

PACC2011, Sept. 2011 69

Dense Linear Algebra Libraries

Target Platform LAPACK Project FLAME Project

Sequential LAPACK libflame

Sequential + multithreaded BLAS LAPACK libflame

Multicore/multithreaded PLASMA libflame+SuperMatrix

Multicore + out-of-order scheduling PLASMA+Quark libflame+SuperMatrix

CPU + single GPU MAGMA libflame+SuperMatrix

Multicore + multiGPU DAGuE? libflame+SuperMatrix

Distributed memory ScaLAPACK libflame+SuperMatrixPLAPACK

ElementalDistributed memory + GPU DAGuE?

ScaLAPACK?libflame+SuperMatrix

PLAPACKElemental

Out-of-Core ? libflame+SuperMatrix

Page 70: PACC2011, Sept. 2011 1 [Soccer] is a very simple game. It’s just very hard to play it simple. - Johan Cruyff Dense Linear Algebra subjectmake RvdG

PACC2011, Sept. 2011 70

Comparison with Quark

Agullo, Bouwmeester, Dongarra, Kurzak, Langou, Rosenbert. “Towards an Efficient Tile Matrix Inversion of Symmetric Positive Definite Matrices on Multicore Architectures.” VecPar, 2010.

Page 71: PACC2011, Sept. 2011 1 [Soccer] is a very simple game. It’s just very hard to play it simple. - Johan Cruyff Dense Linear Algebra subjectmake RvdG

PACC2011, Sept. 2011 71

Outline

Motivation What is FLAME? Deriving algorithms to be correct Representing algorithms in code Of blocked algorithms and algorithms-by-blocks Runtime support for multicore, GPU, and multiGPU Extensions to distributed memory platforms Related work Conclusion

Page 72: PACC2011, Sept. 2011 1 [Soccer] is a very simple game. It’s just very hard to play it simple. - Johan Cruyff Dense Linear Algebra subjectmake RvdG

PACC2011, Sept. 2011 72

Conclusions

Programmability is the key to harnessing parallel computation One code, many target platforms

Formal derivation provides confidence in code If there is a problem, it is not in the library!

Separation of concern Library developer derives algorithms and codes them Execution of routines generates DAG Parallelism, temporal locality, and spatial locality are captured in

DAG Runtime system uses appropriate heuristics to schedule

Page 73: PACC2011, Sept. 2011 1 [Soccer] is a very simple game. It’s just very hard to play it simple. - Johan Cruyff Dense Linear Algebra subjectmake RvdG

PACC2011, Sept. 2011 73

What does this mean for you?

One successful approach:

Identify units of data and units of computation Write a sequential program that generates a DAG Hand DAG to runtime for scheduling

Page 74: PACC2011, Sept. 2011 1 [Soccer] is a very simple game. It’s just very hard to play it simple. - Johan Cruyff Dense Linear Algebra subjectmake RvdG

PACC2011, Sept. 2011 74

The Future

Currently: Library is an instantiation in codeFuture

Create repository of algorithms, expert knowledge about algorithms, and knowledge about a target architecture

Mechanically generate a library for a target architecture, exactly as an expert would

Design-by-Transformation (DxT)

Bryan Marker, Andy Terrel, Jack Poulson, Don Batory, and Robert van de Geijn. "Mechanizing the Expert Dense Linear Algebra Developer." FLAME Working Note #58. 2011

Page 75: PACC2011, Sept. 2011 1 [Soccer] is a very simple game. It’s just very hard to play it simple. - Johan Cruyff Dense Linear Algebra subjectmake RvdG

PACC2011, Sept. 2011 75

Availability

Everything that has been discussed is available under LGPL license or BSD license

libflame + SuperMatrix http://www.cs.utexas.edu/users/flame/

Elemental http://code.google.com/p/elemental/

[Soccer] is a very simple game. It’s just very hard to play it simple.- Johan Cruyff

Dense Linear Algebra subject make

RvdG