PACC2011, Sept. 2011 1 [Soccer] is a very simple game. It’s just very hard to play it simple. - Johan Cruyff Dense Linear Algebra subjectmake RvdG

PACC2011, Sept. 2011 1

[Soccer] is a very simple game. It’s just very hard to play it simple.- Johan Cruyff

Dense Linear Algebra subject make

RvdG

PACC2011, Sept. 2011 2

Robert van de Geijn

Designing a Library to be Multi-Accelerator Ready: A Case Study

Department of Computer ScienceInstitute for Computational Engineering and Sciences

The University of Texas at Austin

PACC2011, Sept. 2011 3

Outline

Motivation What is FLAME? Deriving algorithms to be correct Representing algorithms in code Of blocked algorithms and algorithms-by-blocks Runtime support for multicore, GPU, and multiGPU Extensions to distributed memory platforms Related work Conclusion

PACC2011, Sept. 2011 4

Moments of Inspiration

Birth of multi-threaded libflame Aug. 2006 - an insight:

libflame + algorithm-by-blocks + out-of-order scheduling

(runtime) = multithreaded library

Sept. 2006 - working prototype (by G. Quintana)

Oct. 2006 - grant proposal (to NSF, later funded)

Jan. 2007 - paper submitted (to SPAA07, accepted)

April 2007 - released with libflame R1.0

Birth of multi-GPU libflame Fall 2007 - runtime used to

manage data and tasks on a single GPU. (UJI-Spain)

March 2008 - NVIDIA donates 4GPU Tesla S870 system

Two hours after unboxing the boards, multiple heuristics for multiGPU runtime implemented

Then the power cord fried…

PACC2011, Sept. 2011 5

After two hours Shortly after

Birth of MultiGPU libflame

G. Quintana, Igual, E. Quintana, van de Geijn. "Solving Dense Linear Algebra Problems on Platforms with Multiple Hardware Accelerators." PPoPP’09.

PACC2011, Sept. 2011 6

What Supports our Productivity/Performance?

Deep understanding of the domain Foundational computer science

Derivation of algorithms Software implementation of hardware techniques Blocking for performance

Abstraction Separation of concern

PACC2011, Sept. 2011 7

Outline


PACC2011, Sept. 2011 8

What is FLAME?

A notation for expressing linear algebra algorithms A methodology for deriving such algorithm A set of abstractions for representing such algorithms

In LaTeX, M-script, C, etc. A modern library (libflame)

Alternative to BLAS, LAPACK, ScaLAPACK, and related efforts Many new contributions to theory and practice of dense linear

algebra Also banded and Krylov subspace methods

A set of tools supporting the above Mechanical derivation Automatic generation of code Design-by-Transformation (DxT)

PACC2011, Sept. 2011 9

Who is FLAME?

PACC2011, Sept. 2011 10

Outline


PACC2011, Sept. 2011 11

Deriving Algorithms to be Correct

Include all algorithms for a given operation: Pick the right algorithm for the given architecture

Problem: How to find the right algorithm

Solution: Formal derivation (Hoare, Dijkstra, …):

Given operation, systematically derive family of algorithms for computing it.

PACC2011, Sept. 2011 12

Notation

The notation used to express an algorithm should reflect how the algorithm is naturally explained.

PACC2011, Sept. 2011 13

Example: The Cholesky Factorization

Lower triangular case:

Key in the solution of s.p.d. linear systems

A x = b (LLT)x = b L y = b y LT x = y x

A = *L LT

PACC2011, Sept. 2011 14

Algorithm Loop: Repartition

ATL

ABL ABR

a11

a21 A22

A00

A20

a10T

Indexing operations

PACC2011, Sept. 2011 15

Algorithm Loop: Update

a11

a21

A00

A20

a10T

a11

a21

/a11

A22 – a21 a21T

A00

A20

a10T

Real computation

PACC2011, Sept. 2011 16

Algorithm Loop: Merging

ATL

ABL ABR

a11

a21

/a11

A22 – a21 a21T

A00

A20

a10T

Indexing operation

PACC2011, Sept. 2011 17

Worksheet for Cholesky Factorization

PACC2011, Sept. 2011 18

Mechanical Derivation of Algorithms

Mechanical development

from math. specification

18

Mechanicalprocedure

A = L * LT

Paolo Bientinesi. "Mechanical Derivation and Systematic Analysis of Correct Linear Algebra Algorithms." Ph.D. Dissertation. UT-Austin. August 2006.

http://images.google.es/imgres?imgurl=http://documents.wolfram.com/mathematica/GettingStarted/SystemAdministrationGuide/InstallingMathematicaAsAMathLMClient/HTMLImages/InstallingMathematicaOnWindows.en/InstallingMathematicaOnWindows.en_pic_1.gif&imgrefurl=http://documents.wolfram.com/mathematica/GettingStarted/SystemAdministrationGuide/InstallingMathematicaAsAMathLMClient/InstallingMathematicaOnWindows.html&h=364&w=502&sz=82&hl=es&start=3&um=1&tbnid=VFZJ18tST4qYWM:&tbnh=94&tbnw=130&prev=/images?q=mathematica+wolfram&ndsp=20&um=1&hl=es&rlz=1T4SKPB_esGB230GB231&sa=N

PACC2011, Sept. 2011 19

Is Formal Derivation Practical?

libflame: 128+ matrix operations 1389+ implementations of algorithms Test suite created in 2011 126,756 tests executed Only 3 minor bugs in library… (now fixed)

PACC2011, Sept. 2011 20

Impact on (Single) GPU Computing

CUBLAS 2009 (have been optimized since)

PACC2011, Sept. 2011 21

Impact on (Single) GPU Computing

Fran Igual. “Matrix Computations on Graphics Processors and Clusters of GPUs." Ph.D. Dissertation. Univ. Jaume I. May 2011.Igual, G. Quintana, and van de Geijn. "Level-3 BLAS on a GPU: Picking the Low Hanging Fruit " FLAME Working Note #37. Universidad Jaume I, updated May 21, 2009.

CUBLAS 2009

PACC2011, Sept. 2011 22

A Sampling of Functionality

operation Classic FLAME SuperMatrixMultiThreaded/

MultiGPU

lapack2flame

Level-3 BLAS y y N.A.

Cholesky y y y

LU with partial pivoting y y y

LU with incremental pivoting y y N.A.

QR (UT) y y y

LQ (UT) y y y

SPD/HPD inversion y y y

Triangular inversion y y y

Triangular Sylvester y y y

Triangular Lyapunov y y y

Up-and-downdate (UT) y N.A.

SVD next week soon

EVD next week soon

PACC2011, Sept. 2011 23

Outline


PACC2011, Sept. 2011 24

Representing Algorithms in Code

Code should closely resemble how an algorithm is presented so that no bugs can be introduced when translating an algorithm to code.

PACC2011, Sept. 2011 25

Representing algorithms in code

Spark+APIsC,F77,Matlab,LabView,LaTeX

http://www.cs.utexas.edu/users/flame/Spark/

http://www.cs.utexas.edu/users/flame/Spark/

PACC2011, Sept. 2011 26

FLAME/C API

FLA_Part_2x2( A, &ATL, &ATR, &ABL, &ABR, 0, 0, FLA_TL );

while ( FLA_Obj_length( ATL ) < FLA_Obj_length( A ) ){ b = min( FLA_Obj_length( ABR ), nb_alg ); FLA_Repart_2x2_to_3x3( ATL, /**/ ATR, &A00, /**/ &a01, &A02, /* ************* */ /* ************************** */ &a10t,/**/ &alpha11, &a12t, ABL, /**/ ABR, &A20, /**/ &a21, &A22, 1, 1, FLA_BR ); /*--------------------------------------*/ FLA_Sqrt( alpha11 ); FLA_Inv_scal( alpha11, a21 ); FLA_Syr( FLA_LOWER_TRIANGULAR, FLA_NO_TRANSPOSE, FLA_MINUS_ONE, a21, A22 ); /*--------------------------------------*/ FLA_Cont_with_3x3_to_2x2( &ATL, /**/ &ATR, A00, a01, /**/ A02, a10t, alpha11, /**/ a12t, /* ************** */ /* ************************/ &ABL, /**/ &ABR, A20, a21, /**/ A22, FLA_TL );}

For now, libflame employs external BLAS: GotoBLAS, MKL, ACML, CUBLAS

PACC2011, Sept. 2011 27

Outline


PACC2011, Sept. 2011 28

Multicore/MultiGPU - Issues

Manage computation Assignment of tasks to cores and/or GPUs Granularity is important

Manage memory Manage data transfer between “host” and caches of

cores or host and GPU local memories Granularity is important Keep the data in the local memory as long as possible

PACC2011, Sept. 2011 29

Where have we seen this before?

Computer architecture late 1960s: Super scalar units proposed Unit of data: floating point number Unit of computation: floating point operation Examine dependencies Execute out-of-order, prefetch, cache data, etc., to keep

computational units busy Extract parallelism from sequential instruction stream

R. M. Tomasulo, “An Efficient Algorithm for Exploiting Multiple Arithmetic Units.” IBM J. of R&D, (1967)

Basis for exploitation of ILP on current superscalar processors!

PACC2011, Sept. 2011 30

Of Blocks and Tasks

Dense matrix computation Unit of data: block in matrix Unit of computation (task): operation with blocks Dependency: input/output of operation with blocks Instruction stream: sequential libflame code

Generates DAG Runtime system schedules tasks

Goal: minimize data transfer and maximize utilization

PACC2011, Sept. 2011 31 31

Review: Blocked Algorithms

Cholesky factorization

…

1st iteration 2nd iteration 3rd iteration

A11 = L11 L11T

A21 := L21 = A21 L11-T

A22 := A22 – L21 L21T

PACC2011, Sept. 2011 32 32

Blocked Algorithms

Cholesky factorizationA = L * LT

APIs + Tools

FLA_Part_2x2(…);

while ( FLA_Obj_length(ATL) < FLA_Obj_length(A) ){

FLA_Repart_2x2_to_3x3(…);

/*--------------------------------------*/ FLA_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLA_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLA_Syrk( FLA_LOWER_TRIANGULAR,FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /*--------------------------------------*/

FLA_Cont_with_3x3_to_2x2(…);

}

PACC2011, Sept. 2011 33 33

Simple Parallelization: Blocked Algorithms

Link with multi-threaded BLAS

FLA_Part_2x2(…);



/*--------------------------------------*/ FLA_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLA_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLA_Syrk( FLA_LOWER_TRIANGULAR,FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /*--------------------------------------*/


}A11 = L11 L11

T

A21 := L21 = A21 L11-T

A22 := A22 – L21 L21T

PACC2011, Sept. 2011 34 34

Blocked Algorithms

There is more parallelism!

1st iteration

Inside the same iteration

2nd iteration

In different iterations

PACC2011, Sept. 2011 35 35

Coding Algorithm-by-Blocks

Algorithm-by-blocks :

FLA_Part_2x2(…);



/*--------------------------------------*/ FLA_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLASH_Syrk( FLA_LOWER_TRIANGULAR,FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /*--------------------------------------*/


}

A11 = L11 L11T

A21 := L21 = A21 L11-T

A22 := A22 – L21 L21T

PACC2011, Sept. 2011 36

Outline


PACC2011, Sept. 2011 37

Generating a DAG

1

2

3

4

5 6

7

8 9 10

FLA_Part_2x2(…);



/*--------------------------------------*/ FLA_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLASH_Syrk( FLA_LOWER_TRIANGULAR,FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /*--------------------------------------*/


}

PACC2011, Sept. 2011 38

Managing Tasks and Blocks

Separation of concerns: Sequential libflame routine generates the DAG Runtime system (SuperMatrix) manages and

schedules the DAG As one moves from one architecture to another, only

the runtime system needs to be updated Multicore Out-of-core Single GPU MultiGPU Distributed Runtime …

PACC2011, Sept. 2011 39

Runtime system - SuperMatrix

1

2

3

4

5 6

7

8 9 10

SuperMatrix

DAG

Multicore

+ heuristic

PACC2011, Sept. 2011 40

Runtime system for GPU - SuperMatrix

1

2

3

4

5 6

7

8 9 10

SuperMatrix

DAG

CPU + GPU

Managedata transfer

accelerator

PACC2011, Sept. 2011 41

Runtime system for MultiGPU - SuperMatrix

1

2

3

4

5 6

7

8 9 10

SuperMatrix

DAG

CPU + MultiGPU

Managedata transfer

Multi-accelerator

PACC2011, Sept. 2011 42

MultiGPU

42

How do we program these?

CPU(s)PCI-e bus

GPU #1

GPU #3

GPU #0

GPU #2

Inter-connect

PACC2011, Sept. 2011 43 43

MultiGPU: a User’s View

FLA_Obj A;

// Initialize conventional matrix: buffer, m, rs, cs// Obtain storage blocksize, # of threads: b, n_threads

FLA_Init();

FLASH_Obj_create( FLA_DOUBLE, m, m, 1, &b, &A );FLASH_Copy_buffer_to_hier( m, m, buffer, rs, cs, 0, 0, A );FLASH_Queue_set_num_threads( n_threads );FLASH_Queue_enable_gpu();FLASH_Chol( FLA_LOWER_TRIANGULAR, A );

FLASH_Obj_free( &A );

FLA_Finalize();

PACC2011, Sept. 2011 44

MultiGPU: Under the Cover

44

Naïve approach: Before execution, transfer

data to device Call CUBLAS operations

(implementation “someone else’s problem”)

Upon completion, retrieve results back to host

poor data locality

CPU(s)PCI-e bus

GPU #1

GPU #3

GPU #0

GPU #2

Inter-connect

PACC2011, Sept. 2011 45


45

How do we program these?

View as a…

Shared-memory

multiprocessor +

Distributed Shared

Memory (DSM)

architecture

CPU(s)PCI-e bus

GPU #1

GPU #3

GPU #0

GPU #2

Inter-connect

PACC2011, Sept. 2011 46


46

View system as a shared-memory multiprocessors (multi-core processor with hw. coherence)

MP P0+Cache0

P1+Cache1

P2+Cache2

P3+Cache3

CPU(s)PCI-e bus

GPU #1

GPU #3

GPU #0

GPU #2

Inter-connect

PACC2011, Sept. 2011 47


47

Software Distributed-Shared Memory (DSM) Software: flexibility vs. efficiency Underlying distributed memory hidden Reduce memory transfers using write-back, write-

invalidate,… Well-known approach, not too efficient as a

middleware for general apps.

Regularity of dense linear algebra makes a difference!

PACC2011, Sept. 2011 48


48

Reduce #data transfers: Run-time handles device

memory as a software cache: Operate at block level Software flexibility Write-back Write-invalidate

SuperMatrix

CPU(s)PCI-e bus

GPU #1

GPU #3

GPU #0

GPU #2

Inter-connect

1

2

3

4

5 6

7

8 9 10

PACC2011, Sept. 2011 49


SuperMatrix

FLA_Part_2x2(…);



/*--------------------------------------*/ FLASH_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLASH_Syrk( FLA_LOWER_TRIANGULAR,FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /*--------------------------------------*/


}

• Factor A11 on host

PACC2011, Sept. 2011 50

Multi-GPU: Under the Cover

SuperMatrix

FLA_Part_2x2(…);





}

• Transfer A11 from host to appropriate devices before using it in subsequent computations (write-update)

PACC2011, Sept. 2011 51


SuperMatrix

FLA_Part_2x2(…);





}

• Cache A11 in receiving device(s) in case needed in subsequent computations

PACC2011, Sept. 2011 52


SuperMatrix

FLA_Part_2x2(…);





}

• Send blocks to devices• Perform Trsm on blocks of

A21 (hopefully using cached A11)

• Keep updated A21 in device till needed by other GPU(s) (write-back)

PACC2011, Sept. 2011 53


SuperMatrix

FLA_Part_2x2(…);





}

• Send blocks to devices• Perform Syrk/Gemm on

blocks of A22 (hopefully using cached blocks of A21)

• Keep updated A22 in device till needed by other GPU(s) (write-back)

PACC2011, Sept. 2011 54

C = C + ABT on S1070 (Tesla x 4)

PACC2011, Sept. 2011 55

Cholesky on S1070 (Tesla x 4)

PACC2011, Sept. 2011 56

Cholesky on S1070 (Tesla x 4)

PACC2011, Sept. 2011 57

Sampling of LAPACK functionality on S2050 (Fermi x 4)

PACC2011, Sept. 2011 58

Sampling of LAPACK functionality on S2050 (Fermi x 4)

PACC2011, Sept. 2011 59

Outline

Motivation What is FLAME? Deriving algorithms to be correct Representing algorithms in code Of blocked algorithms and algorithms-by-blocks Runtime support for multicore, GPU, and

multiGPU Extensions to distributed memory platforms Conclusion

PACC2011, Sept. 2011 60

libflame for Cluster (+ Accelerators)

PLAPACK Distributed memory (MPI) Inspired FLAME Recently modified so that each node can have GPU Keep data in GPU memory as much as possible

Elemental (Jack Poulson) Distributed memory (MPI) Inspired by FLAME/PLAPACK Can use GPU at each node/core

libflame + SuperMatrix Runtime schedules tasks and data transfer Appropriate for small clusters

PACC2011, Sept. 2011 61

PLAPACK + GPU accelerators

Fogue, Igual, E. Quintana, van de Geijn. "Retargeting PLAPACK to Clusters with Hardware Accelerators." (WEHA 2010).

Each node:Xeon Nehalem (8 cores)+ 2 NVIDIA C1060 (Tesla)

PACC2011, Sept. 2011 62

Targeting Clusters with GPUsSuperMatrix Distributed Runtime

Igual, G. Quintana, van de Geijn. “Scheduling Algorithms-by-Blocks on Small Clusters.” Concurrency and Computation: Practice and Experience. In review.

Each node:Xeon Nehalem (8 cores)+ 1 NVIDIA C2050 (Fermi)

PACC2011, Sept. 2011 63

Elemental Cholesky Factorization

PACC2011, Sept. 2011 64

Elemental vs. ScaLAPACK

Cholesky on 8192 cores - BlueGene/P

Elemental has full ScaLAPACK functionality(except nonsymmetric Eigenvalue Problem).

Poulson, Marker, Hammond, Romero, van de Geijn. "Elemental: A New Framework for Distributed Memory Dense Matrix Computations." ACM TOMS. Submitted.

PACC2011, Sept. 2011 65

Single-Chip Cloud Computer

Intel SCC research processor 48-core concept vehicle Created for many-core software research Custom communication library (RCCE)

PACC2011, Sept. 2011 66

SCC Results

- 48 Pentium cores- MPI replaced by RCCE

Igual, G. Quintana, van de Geijn. “Scheduling Algorithms-by-Blocks on Small Clusters.” Concurrency and Computation: Practice and Experience. In review.

Marker, Chan, Poulson, van de Geijn, Van der Wijngaart, Mattson, Kubaska. "Programming Many-Core Architectures - A Case Study: Dense Matrix Computations on the Intel SCC Processor." Concurrency and Computation: Practice and Experience. To Appear.

PACC2011, Sept. 2011 67

Outline


PACC2011, Sept. 2011 68

Related work

Data-flow parallelism, dynamic scheduling, runtime

Cilk OpenMP (task queues) StarSs (SMPSs) StarPU Threading Building Blocks (TBB) …

What we have is very dense linear algebra specific

PACC2011, Sept. 2011 69

Dense Linear Algebra Libraries

Target Platform LAPACK Project FLAME Project

Sequential LAPACK libflame

Sequential + multithreaded BLAS LAPACK libflame

Multicore/multithreaded PLASMA libflame+SuperMatrix

Multicore + out-of-order scheduling PLASMA+Quark libflame+SuperMatrix

CPU + single GPU MAGMA libflame+SuperMatrix

Multicore + multiGPU DAGuE? libflame+SuperMatrix

Distributed memory ScaLAPACK libflame+SuperMatrixPLAPACK

ElementalDistributed memory + GPU DAGuE?

ScaLAPACK?libflame+SuperMatrix

PLAPACKElemental

Out-of-Core ? libflame+SuperMatrix

PACC2011, Sept. 2011 70

Comparison with Quark

Agullo, Bouwmeester, Dongarra, Kurzak, Langou, Rosenbert. “Towards an Efficient Tile Matrix Inversion of Symmetric Positive Definite Matrices on Multicore Architectures.” VecPar, 2010.

PACC2011, Sept. 2011 71

Outline


PACC2011, Sept. 2011 72

Conclusions

Programmability is the key to harnessing parallel computation One code, many target platforms

Formal derivation provides confidence in code If there is a problem, it is not in the library!

Separation of concern Library developer derives algorithms and codes them Execution of routines generates DAG Parallelism, temporal locality, and spatial locality are captured in

DAG Runtime system uses appropriate heuristics to schedule

PACC2011, Sept. 2011 73

What does this mean for you?

One successful approach:

Identify units of data and units of computation Write a sequential program that generates a DAG Hand DAG to runtime for scheduling

PACC2011, Sept. 2011 74

The Future

Currently: Library is an instantiation in codeFuture

Create repository of algorithms, expert knowledge about algorithms, and knowledge about a target architecture

Mechanically generate a library for a target architecture, exactly as an expert would

Design-by-Transformation (DxT)

Bryan Marker, Andy Terrel, Jack Poulson, Don Batory, and Robert van de Geijn. "Mechanizing the Expert Dense Linear Algebra Developer." FLAME Working Note #58. 2011

PACC2011, Sept. 2011 75

Availability

Everything that has been discussed is available under LGPL license or BSD license

libflame + SuperMatrix http://www.cs.utexas.edu/users/flame/

Elemental http://code.google.com/p/elemental/

[Soccer] is a very simple game. It’s just very hard to play it simple.- Johan Cruyff

Dense Linear Algebra subject make

RvdG

http://www.cs.utexas.edu/users/flame/

http://code.google.com/p/elemental/

Documents

PACC2011, Sept. 2011 1 [Soccer] is a very simple game. It’s just very hard to play it simple. - Johan Cruyff Dense Linear Algebra subjectmake RvdG