Lecture 13: Dense Linear Algebra II - cs.cornell.edubindel/class/cs5220-s10/slides/lec13.pdf ·...

Lecture 13:Dense Linear Algebra II

David Bindel

8 Mar 2010

Logistics

I Tell me your project idea today (if you haven’t already)!I HW 2 extension to Friday

I Meant to provide more flexibility, not more work!I See comments at start of last time about expectation

I HW 2 common issuesI Segfault in binning probably means particle out of rangeI Particles too close together means either an interaction

skipped or a time step too short

Review: Parallel matmul

I Basic operation: C = C + ABI Computation: 2n3 flopsI Goal: 2n3/p flops per processor, minimal communication

1D layout

I Block MATLAB notation: A(:, j) means j th block columnI Processor j owns A(:, j), B(:, j), C(:, j)I C(:, j) depends on all of A, but only B(:, j)I How do we communicate pieces of A?

1D layout on bus (no broadcast)

I Everyone computes local contributions firstI P0 sends A(:,0) to each processor j in turn;

processor j receives, computes A(:,0)B(0, j)I P1 sends A(:,1) to each processor j in turn;

processor j receives, computes A(:,1)B(1, j)I P2 sends A(:,2) to each processor j in turn;

processor j receives, computes A(:,2)B(2, j)

Self A(:,1) A(:,2)A(:,0)

C(:,myproc) += A(:,myproc)*B(myproc,myproc)for i = 0:p-1for j = 0:p-1if (i == j) continue;if (myproc == i) isend A(:,i) to processor j

if (myproc == j)receive A(:,i) from iC(:,myproc) += A(:,i)*B(i,myproc)

endend

Performance model?

No overlapping communications, so in a simple α− β model:I p(p − 1) messagesI Each message involves n2/p dataI Communication cost: p(p − 1)α+ (p − 1)n2β

1D layout on ring

I Every process j can send data to j + 1 simultaneouslyI Pass slices of A around the ring until everyone sees the

whole matrix (p − 1 phases).

1D layout on ring

tmp = A(myproc)C(myproc) += tmp*B(myproc,myproc)for j = 1 to p-1sendrecv tmp to myproc+1 mod p,

from myproc-1 mod pC(myproc) += tmp*B(myproc-j mod p, myproc)

Performance model?

1D layout on ring

In a simple α− β model, at each processor:I p − 1 message sends (and simultaneous receives)I Each message involves n2/p dataI Communication cost: (p − 1)α+ (1− 1/p)n2β

Outer product algorithm

Serial: Recall outer product organization:

for k = 0:s-1C += A(:,k)*B(k,:);

Parallel: Assume p = s2 processors, block s × s matrices.For a 2× 2 example:[

C00 C01C10 C11

[A00B00 A00B01A10B00 A10B01

[A01B10 A01B11A11B10 A11B11

I Processor for each (i , j) =⇒ parallel work for each k !I Note everyone in row i uses A(i , k) at once,

and everyone in row j uses B(k , j) at once.

Parallel outer product (SUMMA)

for k = 0:s-1for each i in parallelbroadcast A(i,k) to row

for each j in parallelbroadcast A(k,j) to col

On processor (i,j), C(i,j) += A(i,k)*B(k,j);end

If we have tree along each row/column, thenI log(s) messages per broadcastI α+ βn2/s2 per messageI 2 log(s)(αs + βn2/s) total communicationI Compare to 1D ring: (p − 1)α+ (1− 1/p)n2β

Note: Same ideas work with block size b < n/s

Cannon’s algorithm

[C00 C01C10 C11

[A00B00 A01B11A11B10 A10B01

[A01B10 A00B01A10B00 A11B11

Idea: Reindex products in block matrix multiply

C(i , j) =

p−1∑k=0

A(i , k)B(k , j)

p−1∑k=0

A(i , k + i + j mod p) B(k + i + j mod p, j)

For a fixed k , a given block of A (or B) is needed forcontribution to exactly one C(i , j).

Cannon’s algorithm

% Move A(i,j) to A(i,i+j)for i = 0 to s-1cycle A(i,:) left by i

% Move B(i,j) to B(i+j,j)for j = 0 to s-1cycle B(:,j) up by j

for k = 0 to s-1in parallel;C(i,j) = C(i,j) + A(i,j)*B(i,j);

cycle A(:,i) left by 1cycle B(:,j) up by 1

Cost of Cannon

I Assume 2D torus topologyI Initial cyclic shifts: ≤ s messages each (≤ 2s total)I For each phase: 2 messages each (2s total)I Each message is size n2/s2

I Communication cost: 4s(α+ βn2/s2) = 4(αs + βn2/s)

I This communication cost is optimal!... but SUMMA is simpler, more flexible, almost as good

Speedup and efficiency

Recall

Speedup := tserial/tparallel

Efficiency := Speedup/p

Assuming no overlap of communication and computation,efficiencies are

1D layout(1 + O

))−1

SUMMA(

1 + O(√

p log pn

))−1

Cannon(

1 + O(√

))−1

Reminder: Why matrix multiply?

LAPACK

LAPACK structure

Build fast serial linear algebra (LAPACK) on top of BLAS 3.

Reminder: Why matrix multiply?

ScaLAPACK structure

ScaLAPACK

LAPACK

BLAS MPI

ScaLAPACK builds additional layers on same idea.

Reminder: Evolution of LU

On board...

Blocked GEPP

Find pivot

Blocked GEPP

Swap pivot row

Blocked GEPP

Update within block

Blocked GEPP

Delayed update (at end of block)

Big idea

I Delayed update strategy lets us do LU fastI Could have also delayed application of pivots

I Same idea with other one-sided factorizations (QR)I Can get decent multi-core speedup with parallel BLAS!

... assuming n sufficiently large.

There are still some issues left over (block size? pivoting?)...

Explicit parallelization of GE

What to do:I Decompose into work chunksI Assign work to threads in a balanced wayI Orchestrate the communication and synchronizationI Map which processors execute which threads

Possible matrix layouts

1D column blocked: bad load balance

0 0 0 1 1 1 2 2 20 0 0 1 1 1 2 2 20 0 0 1 1 1 2 2 20 0 0 1 1 1 2 2 20 0 0 1 1 1 2 2 20 0 0 1 1 1 2 2 20 0 0 1 1 1 2 2 20 0 0 1 1 1 2 2 20 0 0 1 1 1 2 2 2

1D column cyclic: hard to use BLAS2/3

0 1 2 0 1 2 0 1 20 1 2 0 1 2 0 1 20 1 2 0 1 2 0 1 20 1 2 0 1 2 0 1 20 1 2 0 1 2 0 1 20 1 2 0 1 2 0 1 20 1 2 0 1 2 0 1 20 1 2 0 1 2 0 1 20 1 2 0 1 2 0 1 2

1D column block cyclic: block column factorization a bottleneck

0 0 1 1 2 2 0 0 1 10 0 1 1 2 2 0 0 1 10 0 1 1 2 2 0 0 1 10 0 1 1 2 2 0 0 1 10 0 1 1 2 2 0 0 1 10 0 1 1 2 2 0 0 1 10 0 1 1 2 2 0 0 1 10 0 1 1 2 2 0 0 1 10 0 1 1 2 2 0 0 1 10 0 1 1 2 2 0 0 1 1

Block skewed: indexing gets messy

0 0 0 1 1 1 2 2 20 0 0 1 1 1 2 2 20 0 0 1 1 1 2 2 22 2 2 0 0 0 1 1 12 2 2 0 0 0 1 1 12 2 2 0 0 0 1 1 11 1 1 2 2 2 0 0 01 1 1 2 2 2 0 0 01 1 1 2 2 2 0 0 0

2D block cyclic:

0 0 1 1 0 0 1 10 0 1 1 0 0 1 12 2 3 3 2 2 3 32 2 3 3 2 2 3 30 0 1 1 0 0 1 10 0 1 1 0 0 1 12 2 3 3 2 2 3 32 2 3 3 2 2 3 3

I 1D column blocked: bad load balanceI 1D column cyclic: hard to use BLAS2/3I 1D column block cyclic: factoring column is a bottleneckI Block skewed (a la Cannon): just complicatedI 2D row/column block: bad load balanceI 2D row/column block cyclic: win!

Distributed GEPP

Find pivot (column broadcast)

Distributed GEPP

Swap pivot row within block column + broadcast pivot

Distributed GEPP

Update within block column

Distributed GEPP

At end of block, broadcast swap info along rows

Distributed GEPP

Apply all row swaps to other columns

Distributed GEPP

Broadcast block LII right

Distributed GEPP

Update remainder of block row

Distributed GEPP

Broadcast rest of block row down

Distributed GEPP

Broadcast rest of block col right

Distributed GEPP

Update of trailing submatrix

Cost of ScaLAPACK GEPP

Communication costs:I Lower bound: O(n2/

√P) words, O(

√P) messages

I ScaLAPACK:I O(n2 log P/

√P) words sent

I O(n log p) messagesI Problem: reduction to find pivot in each column

I Recent research on stable variants without partial pivoting

Onward!

Next up: Sparse linear algebra and iterative solvers!

Lecture 13: Dense Linear Algebra II - cs.cornell.edubindel/class/cs5220-s10/slides/lec13.pdf ·...

Documents

Lecture 16: Sparse Direct Solvers - Cornell Universitybindel/class/cs5220-s10/slides/lec16.pdf · 2010. 3. 17. · Skylines and proﬁles I Proﬁle solvers generalize band solvers

Lec13 aminoac met

Lecture 23: Bits and Bones - Cornell Universitybindel/class/cs5220-s10/slides/lec23.pdfI Micro-CT scan data from patient I Inference of material properties I Construction of coarse

Lecture 2: Tiling matrix-matrix multiply, code tuningbindel/class/cs5220-s10/...2.Matrix-vector multiply: n2 data, 2n2 ﬂops 3.Matrix-matrix multiply: 2n2 data, 2n2 ﬂops These are

Ch232 lec13 jan2015

Lec13 Pollination

Lecture 5: Parallel machines and models; shared memory ...bindel/class/cs5220-s10/slides/lec05.pdfSnoopy bus protocol Basic idea: I Broadcast operations on memory bus I Cache controllers

Lecture 2: Tiling matrix-matrix multiply, code tuningbindel/class/cs5220-s10/slides/lec03.pdfTips on tuning “We should forget bout small efﬁciences, say about 97% of the time:

Vacuum Cleaner S10 / S10 plus - cleanfix.com · S10 / S10 plus Cleanfix / S10 user manual / 03.2020. S10 / S10 plus 1 6 10 15 2 7 11 16 1 3 8 12 17 2 4 9.1 13 18 3 5 9.2 a a a b b

Lecture 14: Iterative Methods and Sparse Linear Algebrabindel/class/cs5220-s10/slides/lec14.pdf · Reminder: World of Linear Algebra I Dense methods I Direct representation of matrices

281 lec13 prokaryotic_transcription

Lecture 19: Graph Partitioningbindel/class/cs5220-s10/slides/lec19.pdf · Partitioningwithcoordinates I Lotsofpartitioningproblemsfrom“nice” meshes I Planarmeshes(maybewithregularitycondition)

Lecture 6: A Monte Carlo case study; OpenMP programmingbindel/class/cs5220-s10/slides/lec06.pdf · 2010-02-10 · A Monte Carlo case study; OpenMP programming David Bindel 10 Feb

EEE3 Lec13 - Electrical System Design

Lec13 Lesson19 Post-Buckle Plate

Lec13 Bandpass Modulation I

Lecture 2: Single processor architecture and memorybindel/class/cs5220-s10/slides/lec02.pdfThe real world I Memory operations are not all the same! I Registers and caches lead to variable

Samsung Galaxy S10e|S10|S10+ - cdn.cnetcontent.com

S10 Exercises

Bouley S10