Implementing Dense Linear Algebra Algorithms on Multi-Core Processors Using Dataflow Execution Model...

Implementing Dense Linear Algebra Implementing Dense Linear Algebra Algorithms on Multi-Core Processors Algorithms on Multi-Core Processors

Using Dataflow Execution ModelUsing Dataflow Execution Model

Jakub KurzakJack Dongarra

University of Tennessee

John ShalfLawrence Berkeley National Laboratory

SIAM Parallel Processing for Scientific ComputingMS18: Dataflow 2.0: The Re-emergence and Evolution

of Dataflow Programming Techniques for Parallel ComputingMarch 13, 2008

New Class of AlgorithmsNew Class of Algorithms

LU, QR, Cholesky, one-sided, factorizations Inspired by out-of-memory algorithmsProcessing of input by small tilesBlock data layoutFine granularitySuitable for dynamic schedulingSuitable for dynamic scheduling

LAPACK Working Note 190LAPACK Working Note 191 LAPACK Working Note 184

New Class of AlgorithmsNew Class of Algorithms

Classic multi-core (x86)Faster than MKLFaster than ACMLWay faster than LAPACK

Exotic multi-core (CELL)Extremely fastNo alternatives

New Algorithms – LU on x86New Algorithms – LU on x86

New Algorithms – QR on x86New Algorithms – QR on x86

New Algorithms – Cholesky on x86New Algorithms – Cholesky on x86

New Algorithms – Cholesky on the CELLNew Algorithms – Cholesky on the CELL

Going “Out of Sync”Going “Out of Sync”

CELL Cholesky – 8 cores

CELL Cholesky – 16 cores

PLASMA FrameworkPLASMA Framework

Don't reinvent the wheelSCHEDULEHeNCEParametrized DAGs (Cosnard, Jeannot)Dataflow Systems (Simulink, Gedae)Systolic Arrays

PLASMA Goals PLASMA Goals (WHAT)(WHAT)

ScalabilityMassive parallelism

PerformanceEfficient runtime

PortabilityCluster / MPPSMP / CMPOut-of-memory / CELL

PLASMA Principles PLASMA Principles (HOW)(HOW)

Data-driven (dynamic) execution Explicit expression of parallelism Implicit communication Component architecture

“Language” Runtime Tools

Dataflow visualization DAG visualization Profiling Tracing ..... There is no such thing as autoparallelization.

There is no such thing as auto SIMD'zation.

(see LAPACK Working Note 189)

Current InfrastructureCurrent Infrastructure

MPIMPI

ScaLAPACScaLAPACKK

LAPACKLAPACKPBLASPBLAS

BLASBLAS

BLACSBLACS

Current InfrastructureCurrent Infrastructure

MPIMPI

LAPACKLAPACKPBLASPBLAS

BLASBLAS

BLACSBLACS

ScaLAPACScaLAPACKK

Target InfrastructureTarget Infrastructure

TRACTRACEE

VIVISS PLASMAPLASMA

““LanguageLanguage””

μ-kernelsμ-kernels(“μ(“μBLAS”)BLAS”)

RDMA-likeRDMA-like(“μMPI(“μMPI”)”)

PROFPROFPLASMAPLASMARuntimeRuntime

DAG / Task GraphDAG / Task Graph

view computation as a DAG

use dependency / data driven scheduling

pursue the critical path

DAGs can be hugeDifficult / impossible to storeDifficult / impossible to manage

I don't need to build the DAGI need the ability to traverse the DAG

Producer Taskknows where its output goes

Consumer Taskknows where its input comes from

Nobody knows the DAG

Parents know their children

Children know their parents

0,00,0 0,1 0,2

1,0 1,1

0, 1, 2, 3

Task ID / Task CoordinatesTask ID / Task Coordinates

0,00,0 0,1 0,2

1,0 1,1

// Task NameP

// Execution Spacei = 0 ... N

// Parallel Partitioningi % Ncores = coreID

// Parametersinput T ( i – 1, 0 )output T ( i, 0 ... i )

Task DefinitionTask Definition

0,00,0 0,1 0,2

1,0 1,1

// Task NameT

// Execution Spacei = 0 ... N – 1j = 0 ... N – 1 – i

// Parallel Partitioningi % Ncores = coreID

// Parametersinput 1 P ( i )input 2 T ( i – 1, j + 1 )output T ( i + 1, j – 1 )

Task DefinitionTask Definition

// NamePOTRF(k)

// Execution spacek = 0..SIZE

// Parallel partitioningk % GRIDrows == rowRANKk % GRIDcols == colRANK

// ParametersINOUT T <- (k == 0) ? IN(k, k) : T SYRK(k, k) -> T TRSM(k, k+1..SIZE) -> OUT(k, k)

// Bodylapack_spotrf(lapack_lower, NB, T, NB, &info);

// NameSYRK(k, n)

// Execution spacek = 0..SIZEn = 0..k-1

// Parallel partitioningk % GRIDrows == rowRANKk % GRIDcols == colRANK

// ParametersIN A <- C TRSM(n, k)INOUT T <- (n == 0) ? IN(k, k) : T SYRK(k, n-1) -> (n == k-1) ? T POTRF(k) : T SYRK(k, n+1)

// Bodycblas_ssyrk(CblasColMajor, CblasLower, CblasNoTrans, NB, NB, 1.0, A, NB, 1.0, T, NB);

// NameTRSM(k, m)

// Execution spacek = 0..SIZEm = k+1..SIZE

// Parallel partitioningm % GRIDrows == rowRANKk % GRIDcols == colRANK

// ParametersIN T <- T POTRF(k)INOUT C <- (k == 0) ? IN(m, k) : C GEMM(k, m, k-1) -> A GEMM(m, m+1..SIZE, k) -> B GEMM(k+1..m-1, m, k) -> A SYRK(m, k) -> OUT(k, m)

// Bodycblas_strsm(CblasColMajor, CblasRight, CblasLower, CblasTrans, CblasNonUnit, NB, NB, 1.0, T, NB, B, NB);

// NameGEMM(k, m, n)

// Execuiton spacek = 0..SIZEm = k+1..SIZEn = 0..k-1

// Parallel partitioningm % GRIDrows == rowRANKk % GRIDcols == colRANK

// ParametersIN A <- C TRSM(n, k)IN B <- C TRSM(n, m)INOUT C <- (n == 0) ? IN(m, n) : C GEMM(k, m, n-1) -> (n == k-1) ? C TRSM(k, m) : C GEMM(k, m, n+1)

// Bodycblas_sgemm(CblasColMajor, CblasNoTrans, CblasTrans, NB, NB, NB, 1.0, B, NB, A, NB, 1.0, C, NB);

// GlobalsGRIDrows, GRIDcols, NB

depreq

datadata data

depdep req

execute

notify

notifynotify

Task Life-CycleTask Life-Cycle

response

request

POTRFcompletionnotifications

GEMMcompletionnotifications

TRSMdataavailable

TRSMdependencies satisfied

TRSMcompletion notifications

no dependencies satisfied

some dependencies satisfied waiting for all dependencies

all dependencies satisfied some data delivered waiting for all data

all data delivered waiting for execution

Existing compiler technologytask body – standard C / Fortranformulas – C / Fortran expressionsmetadata – human readable ASCII

CompilationCompilation

Schedulermanagement of task bins

Communication systemschared mem – pointer aliasingdistrib mem – messagingout-of-memory – mix

RuntimeRuntime

Standalone ASCII codeLittle / No error checking

Visual tools to assist developmentplot dataflow diagraminstantiate small DAG

Visual ToolsVisual Tools

OUTOUTININ

Dataflow VisualizationDataflow Visualization

Task Graph VisualizationTask Graph Visualization

Tile LU Tile QR

Tile Cholesky

Connectivity / TopologyConnectivity / Topology

4D cubequad-core

Connectivity / TopologyConnectivity / Topology

communication matrix should resemble connectivity matrix

Thank You !Thank You !

Questions ?Questions ?

Implementing Dense Linear Algebra Algorithms on Multi-Core Processors Using Dataflow Execution Model...

Documents

Stare żaglowce Jakub Ejdys

C++ API for BLAS and LAPACK - icl.utk.edu2 C++ API for BLAS and LAPACK Mark Gates ICL1 Piotr Luszczek ICL Ahmad Abdelfattah ICL Jakub Kurzak ICL Jack Dongarra ICL Konstantin Arturov

Author Proof Bringing High Performance Computing to Big ... · UNCORRECTED PROOF Bringing High Performance Computing to Big Data Algorithms H. Anzt, J. Dongarra, M. Gates, J. Kurzak,

Jakub Fiala: Quantified Self

Jakub Karlec - Intuition over analytics

Vaclav Prajzler, Jakub Zavřel

Jakub Jel´ınek Red Hat, Inc. March 4, 2004 1 Prefacepeople.redhat.com/jakub/prelink.pdf · Jakub Jel´ınek Red Hat, Inc. jakub@redhat.com March 4, 2004 Abstract Prelink is a tool

Jack Dongarra : Papers - The Netlib · Web viewDense Linear Algebra on Accelerated Multicore Hardware, Jack Dongarra, Jakub Kurzak, Piotr Luszczek, and Stanimire Tomov, in High Performance

An interview with Jack J. Dongarra Conducted by Thomas Haigh

Tools ICT /HelpDesk/ Halina Bednarz & Ewa Kurzak

LOOSERS (Jakub Jandík, Tomáš Bojanovský)

Jakub Vagasky - Portfolio Full

Virtual Systolic Array for QR Decomposition - The · PDF fileVirtual Systolic Array for QR Decomposition Jakub Kurzak, Piotr Luszczek, Mark Gates, Ichitaro Yamazaki ... motivation

Jakub Wilczek - Chili Models

Jakub Leszczyński kl. VB

Jakub Orszulam LCVP Application

Executives Compensation Tereza Bůžková Tereza Bůžková Veronika Holá Veronika Holá Jakub Mikolášek Jakub Mikolášek

Jack Dongarra - netlib.org

by Alfredo Buttari Piotr Luszczek Jakub Kurzak Jack Dongarragraal.ens-lyon.fr/~abuttari/mypapers/paper_scop3.pdf · for sharing their CELL experience, in particular John Greene, Michael

Sarabande Polak Jakub