2/25/2009CS267 Lecture 101 Parallelism and Locality in Matrix Computations demmel/cs267_Spr09 Dense...

2/25/2009 CS267 Lecture 10 1

Parallelism and Locality inMatrix Computations

www.cs.berkeley.edu/~demmel/cs267_Spr09

Dense Linear Algebra:Optimizing Parallel Matrix Multiply

James DemmelEECS & Math Departments, UC Berkeley

demmel@cs.berkeley.edu

Outline of Dense Linear Algebra

• Optimizing Sequential Matrix Multiplication (MatMul)• Optimizing Parallel MatMul• Optimizing Gaussian Elimination• Lower bounds on communication (moving data)• The rest of dense linear algebra (eig, svd, etc.)

Outline of Dense Linear Algebra

• Optimizing Sequential Matrix Multiplication (MatMul)• Optimizing Parallel MatMul

• How to layout matrices across processors/memories• Warm-up: Matrix-Vector Multiplication• Reducing network communication for MatMul

• Cannon’s Algorithm, SUMMA Algorithm• Lower bound on network communication

• Optimizing Gaussian Elimination • Lower bounds on communication (moving data)• The rest of dense linear algebra (eig, svd, etc.)

2/25/2009 CS267 Lecture 10 4

Different Parallel Data Layouts for Matrices (not all!)

0123012301230123

0 1 2 3 0 1 2 3

1) 1D Column Blocked Layout 2) 1D Column Cyclic Layout

3) 1D Column Block Cyclic Layout

4) Row versions of the previous layouts

Generalizes others

0 1 0 1 0 1 0 12 3 2 3 2 3 2 30 1 0 1 0 1 0 12 3 2 3 2 3 2 30 1 0 1 0 1 0 12 3 2 3 2 3 2 30 1 0 1 0 1 0 12 3 2 3 2 3 2 3 6) 2D Row and Column

Block Cyclic Layout

0 1 2 3

5) 2D Row and Column Blocked Layout

2/25/2009 CS267 Lecture 10 5

Parallel Matrix-Vector Product• Compute y = y + A*x, where A is a dense matrix• Layout:

• 1D row blocked

• A(i) refers to the n by n/p block row that processor i owns

• x(i) and y(i) similarly refer to segments of x,y owned by i

• Algorithm:• Foreach processor i• Broadcast x(i)• Compute y(i) = A(i)*x

• Algorithm uses the formulay(i) = y(i) + A(i)*x = y(i) + j A(i,j)*x(j)

P0 P1 P2 P3

2/25/2009 CS267 Lecture 10 6

Matrix-Vector Product y = y + A*x

• A column layout of the matrix eliminates the broadcast of x• But adds a reduction to update the destination y

• A 2D blocked layout uses a broadcast and reduction, both on a subset of processors

• sqrt(p) for square processor grid

P0 P1 P2 P3

P4 P5 P6 P7

P8 P9 P10 P11

P12 P13 P14 P15

2/25/2009 CS267 Lecture 10 7

Parallel Matrix Multiply

• Computing C=C+A*B• Using basic algorithm: 2*n3 Flops• Depends on:

• Data layout• Topology of machine • Scheduling communication

• Use of simple performance model for algorithm design• Message Time = “latency” + #words * time-per-word

= + #words *• Efficiency:

• serial time / (p * parallel time)• perfect (linear) speedup efficiency = 1

2/25/2009 CS267 Lecture 10 8

Matrix Multiply with 1D Column Layout

• Assume matrices are n x n and n is divisible by p

• A(i) refers to the n by n/p block column that processor i owns (similiarly for B(i) and C(i))

• B(i,j) is the n/p by n/p sublock of B(i) • in rows j*n/p through (j+1)*n/p

• Algorithm uses the formulaC(i) = C(i) + A*B(i) = C(i) + j A(j)*B(j,i)

p0 p1 p2 p3 p5 p4 p6 p7

May be a reasonable assumption for analysis, not for code

2/25/2009 CS267 Lecture 10 9

Matrix Multiply: 1D Layout on Bus or Ring

• Algorithm uses the formulaC(i) = C(i) + A*B(i) = C(i) + j A(j)*B(j,i)

• First consider a bus-connected machine without broadcast: only one pair of processors can communicate at a time (ethernet)

• Second consider a machine with processors on a ring: all processors may communicate with nearest neighbors simultaneously

2/25/2009 CS267 Lecture 10 10

MatMul: 1D layout on Bus without Broadcast

Naïve algorithm: C(myproc) = C(myproc) + A(myproc)*B(myproc,myproc)

for i = 0 to p-1 for j = 0 to p-1 except i if (myproc == i) send A(i) to processor j if (myproc == j) receive A(i) from processor i C(myproc) = C(myproc) + A(i)*B(i,myproc) barrier

Cost of inner loop: computation: 2*n*(n/p)2 = 2*n3/p2 communication: + *n2 /p

2/25/2009 CS267 Lecture 10 11

Naïve MatMul (continued)

Cost of inner loop: computation: 2*n*(n/p)2 = 2*n3/p2 communication: + *n2 /p … approximately

Only 1 pair of processors (i and j) are active on any iteration, and of those, only i is doing computation => the algorithm is almost entirely serial

Running time: = (p*(p-1) + 1)*computation + p*(p-1)*communication 2*n3 + p2* + p*n2*

This is worse than the serial time and grows with p.When might you still want to do this?

2/25/2009 CS267 Lecture 10 12

Matmul for 1D layout on a Processor Ring

• Pairs of adjacent processors can communicate simultaneously

Copy A(myproc) into Tmp

C(myproc) = C(myproc) + Tmp*B(myproc , myproc)

for j = 1 to p-1

Send Tmp to processor myproc+1 mod p

Receive Tmp from processor myproc-1 mod p

C(myproc) = C(myproc) + Tmp*B( myproc-j mod p , myproc)

• Need to be careful about talking to neighboring processors

• May want double buffering in practice for overlap

• Ignoring deadlock details in code• Time of inner loop = 2*( + *n2/p) + 2*n*(n/p)2

2/25/2009 CS267 Lecture 10 13

Matmul for 1D layout on a Processor Ring

• Time of inner loop = 2*( + *n2/p) + 2*n*(n/p)2

• Total Time = 2*n* (n/p)2 + (p-1) * Time of inner loop• 2*n3/p + 2*p* + 2**n2

• (Nearly) Optimal for 1D layout on Ring or Bus, even with Broadcast:• Perfect speedup for arithmetic• A(myproc) must move to each other processor, costs at least (p-1)*cost of sending n*(n/p) words

• Parallel Efficiency = 2*n3 / (p * Total Time) = 1/(1 + * p2/(2*n3) + * p/(2*n) ) = 1/ (1 + O(p/n))• Grows to 1 as n/p increases (or and shrink)

2/25/2009 CS267 Lecture 10 14

MatMul with 2D Layout

• Consider processors in 2D grid (physical or logical)• Processors communicate with 4 nearest neighbors

• Assume p processors form square s x s grid, s = p1/2

p(0,0) p(0,1) p(0,2)

p(1,0) p(1,1) p(1,2)

p(2,0) p(2,1) p(2,2)

p(0,0) p(0,1) p(0,2)

p(1,0) p(1,1) p(1,2)

p(2,0) p(2,1) p(2,2)

p(0,0) p(0,1) p(0,2)

p(1,0) p(1,1) p(1,2)

p(2,0) p(2,1) p(2,2)

2/25/2009 CS267 Lecture 10 15

Cannon’s Algorithm

… C(i,j) = C(i,j) + A(i,k)*B(k,j)… assume s = sqrt(p) is an integer forall i=0 to s-1 … “skew” A left-circular-shift row i of A by i … so that A(i,j) overwritten by A(i,(j+i)mod s) forall i=0 to s-1 … “skew” B up-circular-shift column i of B by i … so that B(i,j) overwritten by B((i+j)mod s), j) for k=0 to s-1 … sequential forall i=0 to s-1 and j=0 to s-1 … all processors in parallel C(i,j) = C(i,j) + A(i,j)*B(i,j) left-circular-shift each row of A by 1 up-circular-shift each column of B by 1

2/25/2009 CS267 Lecture 10 16

C(1,2) = A(1,0) * B(0,2) + A(1,1) * B(1,2) + A(1,2) * B(2,2)

Cannon’s Matrix Multiplication

2/25/2009 CS267 Lecture 10 17

Initial Step to Skew Matrices in Cannon

• Initial blocked input

• After skewing before initial block multiplies

A(1,0)

A(2,0)

A(0,1) A(0,2)

A(1,1)

A(2,1)

A(1,2)

A(2,2)

A(0,0)

B(0,1) B(0,2)

B(1,0)

B(2,0)

B(1,1) B(1,2)

B(2,1) B(2,2)

B(0,0)

A(1,0)

A(2,0)

A(0,1) A(0,2)

A(1,1)

A(2,1)

A(1,2)

A(2,2)

A(0,0)

B(0,1)

B(0,2)B(1,0)

B(2,0)

B(1,1)

B(1,2)

B(2,1)

B(2,2)B(0,0)

2/25/2008 CS267 Lecture 10 18

Skewing Steps in CannonAll blocks of A must multiply all like-colored blocks of B

• First step

• Second

• Third

A(1,0)

A(2,0)

A(0,1) A(0,2)

A(1,1)

A(2,1)

A(1,2)

A(2,2)

A(0,0)

B(0,1)

B(0,2)B(1,0)

B(2,0)

B(1,1)

B(1,2)

B(2,1)

B(2,2)B(0,0)

A(1,0)

A(2,0)

A(0,1) A(0,2)

A(2,1)

A(1,2) B(0,1)

B(0,2)B(1,0)

B(2,0)

B(1,1)

B(1,2)

B(2,1)

B(2,2)B(0,0)

A(1,0)

A(2,0)

A(0,1)A(0,2)

A(1,1)

A(2,1)

A(1,2)

A(2,2)

A(0,0) B(0,1)

B(0,2)B(1,0)

B(2,0)

B(1,1)

B(1,2)

B(2,1)

B(2,2)B(0,0)

A(1,1)

A(2,2)

A(0,0)

2/25/2009 CS267 Lecture 10 19

Cost of Cannon’s Algorithm forall i=0 to s-1 … recall s = sqrt(p) left-circular-shift row i of A by i … cost ≤ s*( + *n2/p) forall i=0 to s-1 up-circular-shift column i of B by i … cost ≤ s*( + *n2/p) for k=0 to s-1 forall i=0 to s-1 and j=0 to s-1

C(i,j) = C(i,j) + A(i,j)*B(i,j) … cost = 2*(n/s)3 = 2*n3/p3/2

left-circular-shift each row of A by 1 … cost = + *n2/p up-circular-shift each column of B by 1 … cost = + *n2/p

° Total Time = 2*n3/p + 4* s* + 4**n2/s ° Parallel Efficiency = 2*n3 / (p * Total Time)

= 1/( 1 + * 2*(s/n)3 + * 2*(s/n) ) = 1/(1 + O(sqrt(p)/n)) ° Grows to 1 as n/s = n/sqrt(p) = sqrt(data per processor) grows° Better than 1D layout, which had Efficiency = 1/(1 + O(p/n))

Cannon’s Algorithm is “optimal”• Optimal means

• Considering only O(n3) matmul algs (not Strassen)• Considering only O(n2/p) storage per processor

• Similar to Communication Lower bound from last lecture • A processor doing f flops from matmul with a local

memory of size M must do (f / M1/2 ) references to remote memory

• Local memory = O(n2/p) , f n3/p for at least one proc.• So f / M1/2 = ((n3/p)/ (n2/p)1/2 ) = ( n2/p1/2 )

• #Messages = ( #words sent / max message size) = ( (n2/p1/2)/(n2/p)) = (p1/2)

2/25/2009 CS267 Lecture 10 20

2/25/2009 CS267 Lecture 10 21

Pros and Cons of Cannon

• So what if it’s “optimal”, is it fast?• Yes: Local computation one call to (optimized) matrix-

multiply

• Hard to generalize for• p not a perfect square• A and B not square• Dimensions of A, B not perfectly divisible by s=sqrt(p)• A and B not “aligned” in the way they are stored on

processors• block-cyclic layouts

• Memory hog (extra copies of local matrices)

2/25/2009 CS267 Lecture 10 22

SUMMA Algorithm

• SUMMA = Scalable Universal Matrix Multiply • Slightly less efficient than Cannon, but simpler and easier to

generalize• Can accommodate any layout, dimensions, alignment• Uses broadcast of submatrices instead of circular shifts• Sends log p times as much data as Cannon• Can use much less extra memory than Cannon, but

send more messages

• Similar ideas appeared many times

• Used in practice in PBLAS = Parallel BLAS• www.netlib.org/lapack/lawns/lawn{96,100}.ps

2/25/2009 CS267 Lecture 10 23

A(i,k)

B(k,j)

• i, j represent all rows, columns owned by a processor• k is a block of b 1 rows or columns

• C(i,j) = C(i,j) + k A(i,k)*B(k,j)

• Assume a pr by pc processor grid (pr = pc = 4 above) • Need not be square

C(i,j)

2/25/2009 CS267 Lecture 10 24

For k=0 to n-1 … or n/b-1 where b is the block size

… = # cols in A(i,k) and # rows in B(k,j)

for all i = 1 to pr … in parallel

owner of A(i,k) broadcasts it to whole processor row

for all j = 1 to pc … in parallel

owner of B(k,j) broadcasts it to whole processor column

Receive A(i,k) into Acol

Receive B(k,j) into Brow

C_myproc = C_myproc + Acol * Brow

A(i,k)

B(k,j)

C(i,j)

2/28/2009 CS267 Lecture 10 25

SUMMA performance

For k=0 to n/b-1

for all i = 1 to s … s = sqrt(p)

owner of A(i,k) broadcasts it to whole processor row

… time = log s *( + * b*n/s), using a tree

for all j = 1 to s

owner of B(k,j) broadcasts it to whole processor column

… time = log s *( + * b*n/s), using a tree

Receive A(i,k) into Acol

Receive B(k,j) into Brow

C_myproc = C_myproc + Acol * Brow

… time = 2*(n/s)2*b

° Total time = 2*n3/p + * log p * n/b + * log p * n2 /s

° To simplify analysis only, assume s = sqrt(p)

2/25/2009 CS267 Lecture 10 26

SUMMA performance

• Total time = 2*n3/p + * log p * n/b + * log p * n2 /s

• Parallel Efficiency =

1/(1 + * log p * p / (2*b*n2) + * log p * s/(2*n) )

• Same term as Cannon, except for log p factor

log p grows slowly so this is ok

• Latency () term can be larger, depending on b

When b=1, get * log p * n

As b grows to n/s, term shrinks to

* log p * s (log p times Cannon)

• Temporary storage grows like 2*b*n/s

• Can change b to tradeoff latency cost with memory

2/25/2009 CS267 Lecture 8 27

PDGEMM = PBLAS routine for matrix multiply

Observations: For fixed N, as P increases Mflops increases, but less than 100% efficiency For fixed P, as N increases, Mflops (efficiency) rises

DGEMM = BLAS routine for matrix multiply

Maximum speed for PDGEMM = # Procs * speed of DGEMM

Observations (same as above): Efficiency always at least 48% For fixed N, as P increases, efficiency drops For fixed P, as N increases, efficiency increases

2/25/2009 CS267 Lecture 10 28

Summary of Parallel Matrix Multiplication• 1D Layout

• Bus without broadcast - slower than serial• Nearest neighbor communication on a ring (or bus with

broadcast): Efficiency = 1/(1 + O(p/n))• 2D Layout

• Cannon• Efficiency = 1/(1+O(sqrt(p) /n+* sqrt(p) /n)) – optimal!• Hard to generalize for general p, n, block cyclic, alignment

• SUMMA• Efficiency = 1/(1 + O(log p * p / (b*n2) + log p * sqrt(p) /n))• Very General• b small => less memory, lower efficiency• b large => more memory, high efficiency• Used in practice (PBLAS)

2/25/2009 CS267 Lecture 10 29

ScaLAPACK Parallel Library

2/25/2009 CS267 Lecture 10 30

Extra Slides

2/27/08 CS267 Guest Lecture 1 31

Recursive Layouts

• For both cache hierarchies and parallelism, recursive layouts may be useful

• Z-Morton, U-Morton, and X-Morton Layout

• Also Hilbert layout and others• What about the user’s view?

• Fortunately, many problems can be solved on a permutation

• Never need to actually change the user’s layout

02/09/2006 CS267 Lecture 8 32

Gaussian Elimination

Standard Waysubtract a multiple of a row

LINPACKapply sequence to a column

then apply nb to rest of matrix

a3=a3-a1*a2

a2 =L-1

nb LAPACKapply sequence to nb

Slide source: Dongarra

02/09/2006 CS267 Lecture 8 33

LU Algorithm: 1: Split matrix into two rectangles (m x n/2) if only 1 column, scale by reciprocal of pivot & return

2: Apply LU Algorithm to the left part

3: Apply transformations to right part (triangular solve A12 = L-1A12 and matrix multiplication A22=A22 -A21*A12 )

4: Apply LU Algorithm to right part

Gaussian Elimination via a Recursive Algorithm

A21 A22

F. Gustavson and S. Toledo

Most of the work in the matrix multiply Matrices of size n/2, n/4, n/8, …

02/09/2006 CS267 Lecture 8 34

Recursive Factorizations

• Just as accurate as conventional method• Same number of operations• Automatic variable blocking

• Level 1 and 3 BLAS only !• Extreme clarity and simplicity of expression• Highly efficient• The recursive formulation is just a rearrangement of the point-wise

LINPACK algorithm• The standard error analysis applies (assuming the matrix

operations are computed the “conventional” way).

02/09/2006 CS267 Lecture 8 35

DGEMM ATLAS & DGETRF Recursive

AMD Athlon 1GHz (~$1100 system)

500 1000 1500 2000 2500 3000

Pentium III 550 MHz Dual Processor LU Factorization

500 1000 1500 2000 2500 3000 3500 4000 4500 5000

LAPACK

Recursive LU

LAPACK

Dual-processor

Uniprocessor

02/09/2006 CS267 Lecture 8 36

Review: BLAS 3 (Blocked) GEPP

for ib = 1 to n-1 step b … Process matrix b columns at a time end = ib + b-1 … Point to end of block of b columns apply BLAS2 version of GEPP to get A(ib:n , ib:end) = P’ * L’ * U’ … let LL denote the strict lower triangular part of A(ib:end , ib:end) + I

A(ib:end , end+1:n) = LL-1 * A(ib:end , end+1:n) … update next b rows of U A(end+1:n , end+1:n ) = A(end+1:n , end+1:n ) - A(end+1:n , ib:end) * A(ib:end , end+1:n) … apply delayed updates with single matrix-multiply … with inner dimension b

BLAS 3

02/09/2006 CS267 Lecture 8 37

Review: Row and Column Block Cyclic Layout

processors and matrix blocksare distributed in a 2d array

pcol-fold parallelismin any column, and calls to the BLAS2 and BLAS3 on matrices of size brow-by-bcol

serial bottleneck is eased

need not be symmetric in rows andcolumns

02/09/2006 CS267 Lecture 8 38

Distributed GE with a 2D Block Cyclic Layout

block size b in the algorithm and the block sizes brow and bcol in the layout satisfy b=brow=bcol.

shaded regions indicate busy processors or communication performed.

unnecessary to have a barrier between each step of the algorithm, e.g.. step 9, 10, and 11 can be pipelined

02/09/2006 CS267 Lecture 8 39

Distributed GE with a 2D Block Cyclic Layout

02/09/2006 CS267 Lecture 8 40

02/09/2006 CS267 Lecture 8 41

PDGESV = ScaLAPACK parallel LU routine

Since it can run no faster than its inner loop (PDGEMM), we measure:Efficiency = Speed(PDGESV)/Speed(PDGEMM)

Observations: Efficiency well above 50% for large enough problems For fixed N, as P increases, efficiency decreases (just as for PDGEMM) For fixed P, as N increases efficiency increases (just as for PDGEMM) From bottom table, cost of solving Ax=b about half of matrix multiply for large enough matrices. From the flop counts we would

expect it to be (2*n3)/(2/3*n3) = 3 times faster, but communication makes it a little slower.

02/09/2006 CS267 Lecture 8 42

02/09/2006 CS267 Lecture 8 43

Scales well, nearly full machine speed

02/09/2006 CS267 Lecture 8 44

Old version,pre 1998 Gordon Bell Prize

Still have ideas to accelerateProject Available!

Old Algorithm, plan to abandon

02/09/2006 CS267 Lecture 8 45

Have good ideas to speedupProject available!

Hardest of all to parallelizeHave alternative, and would like to compareProject available!

02/09/2006 CS267 Lecture 8 46

Out-of-core means matrix lives on disk; too big for main mem

Much harder to hide latency of disk

QR much easier than LU because no pivoting needed for QR

Moral: use QR to solve Ax=b

Projects available (perhaps very hard…)

02/09/2006 CS267 Lecture 8 47

A small software project ...

02/09/2006 CS267 Lecture 8 48

Work-Depth Model of Parallelism

• The work depth model:• The simplest model is used• For algorithm design, independent of a machine

• The work, W, is the total number of operations• The depth, D, is the longest chain of dependencies• The parallelism, P, is defined as W/D

• Specific examples include:• circuit model, each input defines a graph with ops at

nodes• vector model, each step is an operation on a vector of

elements• language model, where set of operations defined by

language

02/09/2006 CS267 Lecture 8 49

Latency Bandwidth Model

• Network of fixed number P of processors• fully connected• each with local memory

• Latency ()• accounts for varying performance with number of

messages• gap (g) in logP model may be more accurate cost if

messages are pipelined• Inverse bandwidth ()

• accounts for performance varying with volume of data• Efficiency (in any model):

• serial time / (p * parallel time)• perfect (linear) speedup efficiency = 1

02/09/2006 CS267 Lecture 8 50

Initial Step to Skew Matrices in Cannon

• Initial blocked input

• After skewing before initial block multiplies

A(0,1) A(0,2)

A(1,0)

A(2,0)

A(1,1) A(1,2)

A(2,1)A(2,2)

A(0,0)

B(0,1) B(0,2)

B(1,0)

B(2,0)

B(1,1) B(1,2)

B(2,1) B(2,2)

B(0,0)A(0,1) A(0,2)

A(1,0)

A(2,0)

A(1,1) A(1,2)

A(2,1) A(2,2)

A(0,0)

B(0,1)

B(0,2)B(1,0)

B(2,0)

B(1,1)

B(1,2)

B(2,1)

B(2,2)B(0,0)

02/09/2006 CS267 Lecture 8 51

Skewing Steps in Cannon

• First step

• Second

• Third

A(0,1) A(0,2)

A(1,0)

A(2,0)

A(1,1) A(1,2)

A(2,1)A(2,2)

A(0,0)

B(0,1)

B(0,2)B(1,0)

B(2,0)

B(1,1)

B(1,2)

B(2,1)

B(2,2)B(0,0)

A(0,1) A(0,2)

A(1,0)

A(2,0)

A(1,2)

A(2,1)

B(0,1)

B(0,2)B(1,0)

B(2,0)

B(1,1)

B(1,2)

B(2,1)

B(2,2)B(0,0)

A(0,1)A(0,2)

A(1,0)

A(2,0)

A(1,1) A(1,2)

A(2,1) A(2,2)

A(0,0) B(0,1)

B(0,2)B(1,0)

B(2,0)

B(1,1)

B(1,2)

B(2,1)

B(2,2)B(0,0)

A(1,1)

A(2,2)

A(0,0)

2/25/2009 CS267 Lecture 8 52

Motivation (1)

3 Basic Linear Algebra Problems1. Linear Equations: Solve Ax=b for x

2. Least Squares: Find x that minimizes ||r||2 ri2

where r=Ax-b• Statistics: Fitting data with simple functions

3a. Eigenvalues: Find and x where Ax = x• Vibration analysis, e.g., earthquakes, circuits

3b. Singular Value Decomposition: ATAx=2x• Data fitting, Information retrieval

Lots of variations depending on structure of A• A symmetric, positive definite, banded, …

2/25/2009 CS267 Lecture 8 53

Motivation (2)

• Why dense A, as opposed to sparse A?• Many large matrices are sparse, but …• Dense algorithms easier to understand • Some applications yields large dense matrices

• LINPACK Benchmark (www.top500.org)• “How fast is your computer?” =

“How fast can you solve dense Ax=b?”

• Large sparse matrix algorithms often yield smaller (but still large) dense problems

• Do ParLab Apps most use small dense matrices?

02/25/2009 CS267 Lecture 8

Algorithms for 2D (3D) Poisson Equation (N = n2 (n3) vars)

Algorithm Serial PRAM Memory #Procs• Dense LU N3 N N2 N2

• Band LUN2 (N7/3) N N3/2 (N5/3) N (N4/3)• Jacobi N2 (N5/3) N (N2/3) N N• Explicit Inv. N2 log N N2 N2

• Conj.Gradients N3/2 (N4/3) N1/2(1/3) *log N N N• Red/Black SOR N3/2 (N4/3) N1/2 (N1/3) N N• Sparse LU N3/2 (N2) N1/2 N*log N (N4/3) N• FFT N*log N log N N N• Multigrid N log2 N N N• Lower bound N log N N

PRAM is an idealized parallel model with zero cost communicationReference: James Demmel, Applied Numerical Linear Algebra, SIAM, 1997

(Note: corrected complexities for 3D case from last lecture!).

Lessons and Questions (1)

• Structure of the problem matters• Cost of solution can vary dramatically (n3 to n)• Many other examples

• Some structure can be figured out automatically• “A\b” can figure out symmetry, some sparsity

• Some structures known only to (smart) user• If performance not critical, user may be happy to settle for A\b

• How much of this goes into the motifs?• How much should we try to help user choose?• Tuning, but at algorithmic choice level (SALSA)

• Motifs overlap• Dense, sparse, (un)structured grids, spectral

Organizing Linear Algebra (1)

• By Operations• Low level (eg mat-mul: BLAS)• Standard level (eg solve Ax=b, Ax=λx: Sca/LAPACK)• Applications level (eg systems & control: SLICOT)

• By Performance/accuracy tradeoffs• “Direct methods” with guarantees vs “iterative methods” that

may work faster and accurately enough• By Structure

• Storage• Dense

– columnwise, rowwise, 2D block cyclic, recursive space-filling curves• Banded, sparse (many flavors), black-box, …

• Mathematical• Symmetries, positive definiteness, conditioning, …• As diverse as the world being modeled

• By Data Type• Real vs Complex• Floating point (fixed or varying length), other

• By Target Platform• Serial, manycore, GPU, distributed memory, out-of-

DRAM, Grid, …• By programming interface

• Language bindings• “A\b” versus access to details

For all linear algebra problems:Ex: LAPACK Table of Contents

• Linear Systems• Least Squares

• Overdetermined, underdetermined• Unconstrained, constrained, weighted

• Eigenvalues and vectors of Symmetric Matrices• Standard (Ax = λx), Generallzed (Ax=λxB)

• Eigenvalues and vectors of Unsymmetric matrices• Eigenvalues, Schur form, eigenvectors, invariant subspaces• Standard, Generalized

• Singular Values and vectors (SVD)• Standard, Generalized

• Level of detail• Simple Driver• Expert Drivers with error bounds, extra-precision, other options• Lower level routines (“apply certain kind of orthogonal transformation”)

For all matrix/problem structures:Ex: LAPACK Table of Contents

• BD – bidiagonal• GB – general banded• GE – general • GG – general , pair• GT – tridiagonal• HB – Hermitian banded• HE – Hermitian• HG – upper Hessenberg, pair• HP – Hermitian, packed• HS – upper Hessenberg• OR – (real) orthogonal• OP – (real) orthogonal, packed• PB – positive definite, banded• PO – positive definite• PP – positive definite, packed• PT – positive definite, tridiagonal

• SB – symmetric, banded• SP – symmetric, packed• ST – symmetric, tridiagonal • SY – symmetric • TB – triangular, banded• TG – triangular, pair• TB – triangular, banded• TP – triangular, packed• TR – triangular• TZ – trapezoidal• UN – unitary• UP – unitary packed

• BD – bidiagonal• GB – general banded• GE – general • GG – general, pair • GT – tridiagonal• HB – Hermitian banded• HE – Hermitian• HG – upper Hessenberg, pair• HP – Hermitian, packed• HS – upper Hessenberg• OR – (real) orthogonal• OP – (real) orthogonal, packed• PB – positive definite, banded• PO – positive definite• PP – positive definite, packed• PT – positive definite, tridiagonal

For all data types:Ex: LAPACK Table of Contents

• Real and complex• Single and double precision

• Arbitrary precision in progress

www.netlib.org/lapack www.netlib.org/scalapack

www.cs.utk.edu/~dongarra/etemplateswww.netlib.org/templates

gams.nist.gov

2/27/08 CS267 Guest Lecture 1 68

Review of the BLAS

BLAS level Ex. # mem refs # flops q

1 “Axpy”, Dot prod

3n 2n1 2/3

2 Matrix-vector mult

n2 2n2 2

3 Matrix-matrix mult

4n2 2n3 n/2

• Building blocks for all linear algebra• Parallel versions call serial versions on each processor

• So they must be fast!• Define q = # flops / # mem refs = “computational intensity”

• The larger is q, the faster the algorithm can go in the presence of memory hierarchy

• “axpy”: y = *x + y, where scalar, x and y vectors

2/25/2009 CS267 Lecture 10 69

Outline

• History and motivation• Structure of the Dense Linear Algebra motif• Parallel Matrix-matrix multiplication• Parallel Gaussian Elimination (next time)

2/25/2009 CS267 Lecture 10 70

Outline

Motifs

The Motifs (formerly “Dwarfs”) from “The Berkeley View” (Asanovic et al.)

Motifs form key computational patterns

A brief history of (Dense) Linear Algebra software (1/6)

• Libraries like EISPACK (for eigenvalue problems)• Then the BLAS (1) were invented (1973-1977)

• Standard library of 15 operations (mostly) on vectors• “AXPY” ( y = α·x + y ), dot product, scale (x = α·x ), etc• Up to 4 versions of each (S/D/C/Z), 46 routines, 3300 LOC

• Goals• Common “pattern” to ease programming, readability, self-

documentation• Robustness, via careful coding (avoiding over/underflow)• Portability + Efficiency via machine specific implementations

• Why BLAS 1 ? They do O(n1) ops on O(n1) data• Used in libraries like LINPACK (for linear systems)

• Source of the name “LINPACK Benchmark” (not the code!)2/25/2009 CS267 Lecture 10 72

• In the beginning was the do-loop…

2/25/2009 CS267 Lecture 10 73

Current Records for Solving Dense Systems (11/2008)

GigaflopsMachine n=100 n=1000 Any n Peak

“Roadrunner” (LANL) 1.1M 1.5M (1.1 Petaflops, 130K cores, 2.5 MW, n=2.3M)

www.netlib.org, click on Performance Database Server

Palm Pilot III .00000169 (1.69 Kiloflops)

NEC SX 8 (fastest in 2005) (8 proc, 2 GHz) 75.1 128 (1 proc, 2 GHz) 2.2 15.0 16

• But the BLAS-1 weren’t enough• Consider AXPY ( y = α·x + y ): 2n flops on 3n read/writes • Computational intensity = (2n)/(3n) = 2/3• Too low to run near peak speed (read/write dominates)• Hard to vectorize (“SIMD’ize”) on supercomputers of the

day (1980s)• So the BLAS-2 were invented (1984-1986)

• Standard library of 25 operations (mostly) on matrix/vector pairs

• “GEMV”: y = α·A·x + β·x, “GER”: A = A + α·x·yT, x = T-1·x• Up to 4 versions of each (S/D/C/Z), 66 routines, 18K LOC

• Why BLAS 2 ? They do O(n2) ops on O(n2) data• So computational intensity still just ~(2n2)/(n2) = 2

• OK for vector machines, but not for machine with caches2/25/2009 CS267 Lecture 10 74

• The next step: BLAS-3 (1987-1988)• Standard library of 9 operations (mostly) on matrix/matrix

pairs• “GEMM”: C = α·A·B + β·C, C = α·A·AT + β·C, B = T-1·B• Up to 4 versions of each (S/D/C/Z), 30 routines, 10K LOC

• Why BLAS 3 ? They do O(n3) ops on O(n2) data• So computational intensity (2n3)/(4n2) = n/2 – big at last!

• Good for machines with caches, other mem. hierarchy levels

• How much BLAS1/2/3 code so far (all at www.netlib.org/blas)• Source: 142 routines, 31K LOC, Testing: 28K LOC• Reference (unoptimized) implementation only

• Ex: 3 nested loops for GEMM

• Lots more optimized code (eg Homework 1)• Motivates “automatic tuning” of the BLAS (details later)

2/25/2009 CS267 Lecture 10 75

02/25/2009 CS267 Lecture 8 76

A brief history of (Dense) Linear Algebra software (4/6)• LAPACK – “Linear Algebra PACKage” - uses BLAS-3 (1989 – now)

• Ex: Obvious way to express Gaussian Elimination (GE) is adding multiples of one row to other rows – BLAS-1

• How do we reorganize GE to use BLAS-3 ? (details later)

• Contents of LAPACK (summary)• Algorithms we can turn into (nearly) 100% BLAS 3

– Linear Systems: solve Ax=b for x

– Least Squares: choose x to minimize ||r||2 ri2 where r=Ax-b

• Algorithms we can only make up to 50% BLAS 3 (so far)– “Eigenproblems”: Find and x where Ax = x

– Singular Value Decomposition (SVD): ATAx=2x • Error bounds for everything• Lots of variants depending on A’s structure (banded, A=AT, etc)

• How much code? (Release 3.2, Nov 2008) (www.netlib.org/lapack)• Source: 1582 routines, 490K LOC, Testing: 352K LOC

• Ongoing development (at UCB and elsewhere) (class projects!)

2/25/2009 CS267 Lecture 10 77

• Is LAPACK parallel?• Only if the BLAS are parallel (possible in shared memory)

• ScaLAPACK – “Scalable LAPACK” (1995 – now)• For distributed memory – uses MPI• More complex data structures, algorithms than LAPACK

• Only (small) subset of LAPACK’s functionality available• Details later (class projects!)

• All at www.netlib.org/scalapack

2/25/2009 CS267 Lecture 10 78

2/25/2009 CS267 Lecture 10 79

Success Stories for Sca/LAPACK

Cosmic Microwave Background Analysis, BOOMERanG

collaboration, MADCAP code (Apr. 27, 2000).

ScaLAPACK

• Widely used• Adopted by Mathworks, Cray,

Fujitsu, HP, IBM, IMSL, Intel, NAG, NEC, SGI, …

• >96M web hits(in 2009, 56M in 2006) @ Netlib (incl. CLAPACK, LAPACK95)

• New Science discovered through the solution of dense matrix systems

• Nature article on the flat universe used ScaLAPACK

• Other articles in Physics Review B that also use it

• 1998 Gordon Bell Prize• www.nersc.gov/news/reports/

newNERSCresults050703.pdf

• PLASMA and MAGMA (now)• Planned extensions to Multicore/GPU/Heterogeneous• Can one software infrastructure accommodate all

algorithms and platforms of current (future) interest?• How much code generation and tuning can we automate?

• Details later (Class projects!)• Other related projects

• BLAST Forum (www.netlib.org/blas/blast-forum)• Attempt to extend BLAS to other languages, add some new

functions, sparse matrices, extra-precision, interval arithmetic• Only partly successful (extra-precise BLAS used in latest LAPACK)

• FLAME (www.cs.utexas.edu/users/flame/)• Formal Linear Algebra Method Environment• Attempt to automate code generation across multiple platforms

2/25/2009 CS267 Lecture 10 80

2/25/2009 CS267 Lecture 10 81

Outline

What could go into the linear algebra motif(s)?

For all linear algebra problems

For all matrix/problem structures

For all data types

For all programming interfaces

Produce best algorithm(s) w.r.t. performance and accuracy (including condition estimates, etc)

For all architectures and networks

Need to prioritize, automate!

For all linear algebra problems:Ex: LAPACK Table of Contents

• Linear Systems• Least Squares

• Overdetermined, underdetermined• Unconstrained, constrained, weighted

• Eigenvalues and vectors of Symmetric Matrices• Standard (Ax = λx), Generalized (Ax=λxB)

• Eigenvalues and vectors of Unsymmetric matrices• Eigenvalues, Schur form, eigenvectors, invariant subspaces• Standard, Generalized

• Singular Values and vectors (SVD)• Standard, Generalized

• Level of detail• Simple Driver• Expert Drivers with error bounds, extra-precision, other options• Lower level routines (“apply certain kind of orthogonal transformation”)

Organizing Linear Algebra – in books

www.netlib.org/lapack www.netlib.org/scalapack

www.cs.utk.edu/~dongarra/etemplateswww.netlib.org/templates

gams.nist.gov

2/25/2009 CS267 Lecture 10 92

Outline

2/25/2009CS267 Lecture 101 Parallelism and Locality in Matrix Computations demmel/cs267_Spr09 Dense...

Documents

02/11/2009CS267 Lecture 71 CS 267 Sources of Parallelism and Locality in Simulation – Part 2 James Demmel demmel/cs267_Spr09

Parallel Spectral Methods: Fast Fourier Transform (FFTs) with Applications James Demmel demmel/cs267_Spr12

04/27/2009CS267 Lecture 241 Parallel Sorting James Demmel demmel/cs267_Spr09

Automatic Performance Tuning Sparse Matrix Algorithms James Demmel demmel/cs267_Spr06

1 CS267/E233 Applications of Parallel Computers demmel/cs267_Spr09 Lecture 1: Introduction Jim Demmel EECS & Math Departments demmel@cs.berkeley.edu

Automatic Performance Tuning and Sparse-Matrix-Vector-Multiplication (SpMV) James Demmel demmel/cs267_Spr10

Automatic Performance Tuning Sparse Matrix Kernels James Demmel demmel/cs267_Spr05

Introduction to Communication-Avoiding Algorithms demmel/SC11_tutorial Jim Demmel EECS & Math Departments UC Berkeley

Jim Demmel EECS & Math Departments, UC Berkeley demmel@cs.berkeley.edu

1 02/11/2009CS267 Lecture 7+ CS 267 Tricks with Trees James Demmel demmel/cs267_Spr09

04/13/2009CS267 Lecture 20 CS 267: Applications of Parallel Computers Lecture 20 -- Structured Grids Horst D. Simon hdsimon@lbl.gov demmel/cs267_Spr09

Computational Science and Engineering at Berkeley Jim Demmel EECS & Math Departments demmel 20 Jan 2009

James Demmel cs.berkeley/~demmel/cs267_Spr99

CS267 L15 Graph Partitioning II.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 15: Graph Partitioning - II James Demmel demmel/cs267_Spr99

01/24/2006CS267 Lecture 31 CS 267: Introduction to Parallel Machines and Programming Models James Demmel demmel@cs.berkeley.edu demmel/cs267_Spr06

01/19/2005CS267-Lecture 11 CS267/E233 Applications of Parallel Computers Lecture 1: Introduction James Demmel demmel@cs.berkeley.edu demmel/cs267_Spr05

CS267 Lecture 61 Shared Memory Programming: Threads and OpenMP Lecture 6 James Demmel demmel/cs267_Spr14/ demmel/cs267_Spr14

CS267 Poisson - 1.1 Demmel Fall 2002 CS 267 Applications of Parallel Computers Solving Linear Systems arising from PDEs - I James Demmel demmel/CS267_2002_Poisson_1

CS267 Dense Linear Algebra I.1 Demmel Fa 2002 CS 267 Applications of Parallel Computers Dense Linear Algebra James Demmel demmel/cs267_171002.ppt

Introduction to Communication-Avoiding Algorithms demmel/SC12_tutorial Jim Demmel EECS & Math Departments UC Berkeley