65
. Simon - CS267 - L8 06/17/22 CS 267 Applications of Parallel Processors Lecture 9: Computational Electromagnetics - Large Dense Linear Systems 2/19/97 Horst D. Simon http://www.cs.berkeley.edu/ cs267

2/19/97 Horst D. Simon cs.berkeley/cs267

Embed Size (px)

DESCRIPTION

CS 267 Applications of Parallel Processors Lecture 9: Computational Electromagnetics - Large Dense Linear Systems. 2/19/97 Horst D. Simon http://www.cs.berkeley.edu/cs267. Outline - Lecture 9. - Computational Electromagnetics - Sources of large dense linear systems - PowerPoint PPT Presentation

Citation preview

Page 1: 2/19/97 Horst D. Simon cs.berkeley/cs267

H. Simon - CS267 - L8 04/20/23 1

CS 267 Applications of Parallel Processors

Lecture 9: Computational Electromagnetics -Large Dense Linear Systems

2/19/97

Horst D. Simon

http://www.cs.berkeley.edu/cs267

Page 2: 2/19/97 Horst D. Simon cs.berkeley/cs267

H. Simon - CS267 - L8 04/20/23 2

Outline - Lecture 9

- Computational Electromagnetics

- Sources of large dense linear systems

- Review of solution of linear systems with

Gaussian elimination

- BLAS and memory hierarchy for linear algebra kernels

Page 3: 2/19/97 Horst D. Simon cs.berkeley/cs267

H. Simon - CS267 - L8 04/20/23 3

Outline - Lecture 10

- Layout of matrices on distributed memory machines

- Distributed Gaussian elimination

- Speeding up with advanced algorithms

- LINPACK and LAPACK

- LINPACK benchmark

- Tflops result

Page 4: 2/19/97 Horst D. Simon cs.berkeley/cs267

H. Simon - CS267 - L8 04/20/23 4

Outline - Lecture 11

- Designing portable libraries for parallel machines

- BLACS

- ScaLAPACK for dense linear systems

- other linear algebra algorithms in ScaLAPACK

Page 5: 2/19/97 Horst D. Simon cs.berkeley/cs267

H. Simon - CS267 - L8 04/20/23 5

Computational Electromagnetics

- developed during 1980s, driven by defense applications

- determine the RCS (radar cross section) of airplane

- reduce signature of plane (stealth technology)

- other applications are antenna design, medical equipment

- two fundamental numerical approaches: MOM methods of moments ( frequency domain), and finite differences (time domain)

Page 6: 2/19/97 Horst D. Simon cs.berkeley/cs267

H. Simon - CS267 - L8 04/20/23 6

Computational Electromagnetics

image: NW Univ. Comp. Electromagnetics Laboratory http://nueml.ece.nwu.edu/

- discretize surface into triangular facets using standard modeling tools

- amplitude of currents on surface are unknowns

- integral equation is discretized into a set of linear equations

Page 7: 2/19/97 Horst D. Simon cs.berkeley/cs267

H. Simon - CS267 - L8 04/20/23 7

Computational Electromagnetics (MOM)

After discretization the integral equation has the

form Z J = V

where

Z is the impedance matrix, J is the unknown vector of amplitudes, and V is the excitation vector.

Z is given as a four dimensional integral.

(see Cwik, Patterson, and Scott, Electromagnetic Scattering on the Intel Touchstone Delta, IEEE Supercomputing ‘92, pp 538 - 542)

Page 8: 2/19/97 Horst D. Simon cs.berkeley/cs267

H. Simon - CS267 - L8 04/20/23 8

The main steps in the solution process are

A) computing the matrix elements

B) factoring the dense matrix

C) solving for one or more excitations

D) computing the fields scattered from the object

Computational Electromagnetics (MOM)

Page 9: 2/19/97 Horst D. Simon cs.berkeley/cs267

H. Simon - CS267 - L8 04/20/23 9

Analysis of MOM for Parallel Implementation

Task Work Parallelism Parallel Speed

Fill O(n**2) embarrassing low

Factor O(n**3) moderately diff. very high

Solve O(n**2) moderately diff. high

Field Calc. O(n) embarrassing high

For most scientific applications the biggest gain in performance can be obtained by focusing on one tasks.

Page 10: 2/19/97 Horst D. Simon cs.berkeley/cs267

H. Simon - CS267 - L8 04/20/23 10

Results for Parallel Implementation on Delta

Task Time (hours) Performance (Gflop/s)

Fill 9.20 ~ 1.0

Factor 8.25 10.35

Solve 2.17 -

Field Calc. 0.12 3.0

The problem solved was for a matrix of size 48,672. (The world record in 1991.)

Page 11: 2/19/97 Horst D. Simon cs.berkeley/cs267

H. Simon - CS267 - L8 04/20/23 11

Current Records for Solving Dense Systems

Year System Size Machine

1950's O(100) 1991 55,296 CM-2 1992 75,264 Intel 1993 75,264 Intel 1994 76,800 CM-5 1995 128,600 Paragon XP1996 215,000 ASCI Red

source: Alan Edelman http://www-math.mit.edu/~edelman/records.html

Page 12: 2/19/97 Horst D. Simon cs.berkeley/cs267

H. Simon - CS267 - L8 04/20/23 12

Sources for large dense linear systems

- not many outside CEM

- even within CEM community alternatives such FD-TD are heavily debated

In many instances choices for algorithms or methods in existing scientific codes or applications are not the resultof careful planning and design. At best they are reflecting the start-of-the-art at the time, at worst they are purelycoincidental.

Page 13: 2/19/97 Horst D. Simon cs.berkeley/cs267

H. Simon - CS267 - L8 04/20/23 13

Review of Gaussian Elimination

see Demmel http://HTTP.CS.Berkeley.EDU/~demmel/cs267/lecture12/lecture12.html

Gaussian elimination to solve Ax=b - start with a dense matrix - add multiples of each row to subsequent rows in order to create zeros below the diagonal- ending up with an upper triangular matrix U. Solve a linear system with U by substitution, startingwith the last variable.

Page 14: 2/19/97 Horst D. Simon cs.berkeley/cs267

H. Simon - CS267 - L8 04/20/23 14

... for each column i, ... zero it out below the diagonal by ... adding multiples of row i to later rows for i = 1 to n-1 ... each row j below row i for j = i+1 to n ... add a multiple of row i to row j for k = i to n A(j,k) = A(j,k) - (A(j,i)/A(i,i)) * A(i,k)

Review of Gaussian Elimination (cont.)

Page 15: 2/19/97 Horst D. Simon cs.berkeley/cs267

H. Simon - CS267 - L8 04/20/23 15

Review of Gaussian Elimination (cont.)

Page 16: 2/19/97 Horst D. Simon cs.berkeley/cs267

H. Simon - CS267 - L8 04/20/23 16

... for each column i, ... zero it out below the diagonal by ... adding multiples of row i to later rows for i = 1 to n-1 ... each row j below row i for j = i+1 to n ... add a multiple of row i to row j for k = i to n A(j,k) = A(j,k) - (A(j,i)/A(i,i)) * A(i,k)

Review of Gaussian Elimination (cont.)

= m

Page 17: 2/19/97 Horst D. Simon cs.berkeley/cs267

H. Simon - CS267 - L8 04/20/23 17

Review of Gaussian Elimination (cont.)

for i = 1 to n-1 for j = i+1 to n m = A(j,i)/A(i,i) for k = i+1 to n A(j,k) = A(j,k) - m * A(i,k)

avoid computation of known matrix entry

Page 18: 2/19/97 Horst D. Simon cs.berkeley/cs267

H. Simon - CS267 - L8 04/20/23 18

Review of Gaussian Elimination (cont.)

It will be convenient to store the multipliers m in the implicitly created zeros below the diagonal, so we can use them later to transform the right hand side b:

for i = 1 to n-1 for j = i+1 to n A(j,i) = A(j,i)/A(i,i) for j = i+1 to n for k = i+1 to n A(j,k) = A(j,k) - A(j,i) * A(i,k)

Page 19: 2/19/97 Horst D. Simon cs.berkeley/cs267

H. Simon - CS267 - L8 04/20/23 19

Review of Gaussian Elimination (cont.)

Now we use Matlab (data parallel) notation to express

the algorithm even more compactly:

for i = 1 to n-1

A(i+1:n, i) = A(i+1:n, i) / A(i,i)

A(i+1:n, i+1:n) = A(i+1:n, i+1:n) -

A(i+1:n, i)*A(i, i+1:n)

The inner loop consists of one vector operation, and one matrix-vector operation.Note that the loop looks elegant, but no longer intuitive.

Page 20: 2/19/97 Horst D. Simon cs.berkeley/cs267

H. Simon - CS267 - L8 04/20/23 20

Review of Gaussian Elimination (cont.)

Page 21: 2/19/97 Horst D. Simon cs.berkeley/cs267

H. Simon - CS267 - L8 04/20/23 21

Review of Gaussian Elimination (cont.)

Lemma. (LU Factorization). If the above algorithm terminates (i.e. it did not try to divide by zero) then A = L*U.

Now we can state our complete algorithm for solving A*x=b: 1) Factorize A = L*U. 2) Solve L*y = b for y by forward substitution. 3) Solve U*x = y for x by backward substitution.

Then x is the solution we seek because A*x = L*(U*x) = L*y = b.

Page 22: 2/19/97 Horst D. Simon cs.berkeley/cs267

H. Simon - CS267 - L8 04/20/23 22

Here are some obvious problems with this algorithm, which we need to address:

- If A(i,i) is zero, the algorithm cannot proceed. If A(i,i) is tiny, we will also have numerical problems. - The majority of the work is done by a rank-one update, which does not exploit a memory hierarchy as well as an operation like matrix-matrix multiplication

Review of Gaussian Elimination (cont.)

Page 23: 2/19/97 Horst D. Simon cs.berkeley/cs267

H. Simon - CS267 - L8 04/20/23 23

Pivoting for Small A(i,i)

Why pivoting is needed?

A= [ 0 1 ] [ 1 0 ]

Even if A(i,i) is tiny, but not zero difficulties can arise(see example in Jim Demmel’s lecture notes).

This problem is resolved by partial pivoting.

Page 24: 2/19/97 Horst D. Simon cs.berkeley/cs267

H. Simon - CS267 - L8 04/20/23 24

Partial Pivoting

Reordering the rows of A so that A(i,i) is large at each step of the algorithm.At step i of the algorithm, row i is swappedwith row k>i if |A(k,i)| is the largest entry among |A(i:n,i)|.

for i = 1 to n-1 find and record k where |A(k,i)| = max_{i<=j<=n} |A(j,i)| if |A(k,i)|=0, exit with a warning that A is singular, or nearly so if i != k, swap rows i and k of A A(i+1:n, i) = A(i+1:n, i) / A(i,i) ... each quotient lies in [-1,1] A(i+1:n, i+1:n) = A(i+1:n, i+1:n) - A(i+1:n, i)*A(i, i+1:n)

Page 25: 2/19/97 Horst D. Simon cs.berkeley/cs267

H. Simon - CS267 - L8 04/20/23 25

Partial Pivoting (cont.)

- for 2-by-2 example, we get a very accurate answer- several choices as to when to swap rows i and k- could use indirect addressing and not swap them at all, but this would be slow- keep permutation, then solving A*x=b only requires the additional step of permuting b

Page 26: 2/19/97 Horst D. Simon cs.berkeley/cs267

H. Simon - CS267 - L8 04/20/23 26

Fast linear algebra kernels: BLAS

- Simple linear algebra kernels such as matrix-matrixmultiply (exercise) can be performed fast on memoryhierarchies.- More complicated algorithms can be built from somevery basic building blocks and kernels.- The interfaces of these kernels have been standardized as the Basic Linear Algebra Subroutinesor BLAS. - Early agreement on standard interface (around 1980) led to portable libraries for vector and shared memory parallel machines.- BLAS are classified into three categories, level 1,2,3

see Demmel http://HTTP.CS.Berkeley.EDU/~demmel/cs267/lecture02.html

Page 27: 2/19/97 Horst D. Simon cs.berkeley/cs267

H. Simon - CS267 - L8 04/20/23 27

Level 1 BLAS

Operate mostly on vectors (1D arrays), or pairs of vectors; perform O(n) operations; return either a vector or a scalar. Examples saxpy y(i) = a * x(i) + y(i), for i=1 to n. Saxpy is an acronym for the operation. S stands for single precision, daxpy is for double precision, caxpy for complex, and zaxpy for double complex, sscal y = a * x, srot replaces vectors x and y by c*x+s*y and -s*x+c*y, where c and s are typically a cosine and sine. sdot computes s = sum_{i=1}^n x(i)*y(i)

Page 28: 2/19/97 Horst D. Simon cs.berkeley/cs267

H. Simon - CS267 - L8 04/20/23 28

Level 2 BLAS

operate mostly on a matrix (2D array) and a vector; return a matrix or a vector; O(n^2) operations.Examples.

sgemv Matrix-vector multiplication computes y = y + A*x where A is m-by-n, x is n-by-1 and y is m-by-1.

sger rank-one update computes A = A + y*x', where A is m-by-n, y is m-by-1, x is n-by-1, x' is the transpose of x. This is a short way of saying A(i,j) = A(i,j) + y(i)*x(j) for all i,j. strsv triangular solve solves y=T*x for x, where T is a triangular matrix.

Page 29: 2/19/97 Horst D. Simon cs.berkeley/cs267

H. Simon - CS267 - L8 04/20/23 29

Level 3 BLAS

operate on pairs or triples of matrices, returning a matrix;complexity is O(n**3).

Examples sgemm Matrix-matrix multiplication computes C = C + A*B, where C is m-by-n, A is m-by-k, and B is k-by-n

sgtrsm multiple triangular solve solves Y = T*X for X, where T is a triangular matrix, and X is a rectangular matrix.

Page 30: 2/19/97 Horst D. Simon cs.berkeley/cs267

H. Simon - CS267 - L8 04/20/23 30

Performance of BLAS

Level 2

Level 3

Level 1

Page 31: 2/19/97 Horst D. Simon cs.berkeley/cs267

H. Simon - CS267 - L8 04/20/23 31

Performance of BLAS (cont.)

- BLAS are specially optimized by the vendor (IBM)to take advantage of all features of the RS 6000/590.- Potentially a big speed advantage if an algorithm can be expressed in terms of the BLAS3 instead ofBLAS2 or BLAS1. - The top speed of the BLAS3, about 250 Mflops, is veryclose to the peak machine speed of 266 Mflops.- We will reorganize algorithms, like Gaussian elimination, so that they use BLAS3 rather than BLAS1 or BLAS2.

Page 32: 2/19/97 Horst D. Simon cs.berkeley/cs267

H. Simon - CS267 - L8 04/20/23 32

Explanation of Performance of BLAS

m = number of memory references to slow memory (read + write) f = number of floating point operations q = f/m = average number of flops per slow memory reference

m justification for m f q

saxpy 3*n read x(i), y(i) ; write y(i) 2*n 2/3 sgemv n^2+O(n) read each A(i,j) once 2*n^2 2sgemm 4*n^2 read A(i,j),B(i,j),C(i,j) 2*n^3 n/2 write C(i,j) once

Page 33: 2/19/97 Horst D. Simon cs.berkeley/cs267

H. Simon - CS267 - L8 04/20/23 33

CS 267 Applications of Parallel Processors

Lecture 10: Large Dense Linear Systems -Distributed Implementations

2/21/97

Horst D. Simon

http://www.cs.berkeley.edu/cs267

Page 34: 2/19/97 Horst D. Simon cs.berkeley/cs267

H. Simon - CS267 - L8 04/20/23 34

Review - Lecture 9

- computational electromagnetics and linear systems

- rewritten Gaussian elimination as vector and matrix-vector operation (level 2 BLAS)

- discussed the efficiency of level 3 BLAS in terms of reducing number of memory accesses

Page 35: 2/19/97 Horst D. Simon cs.berkeley/cs267

H. Simon - CS267 - L8 04/20/23 35

Outline - Lecture 10

- Layout of matrices on distributed memory machines

- Distributed Gaussian elimination

- Speeding up with advanced algorithms

- LINPACK and LAPACK

- LINPACK benchmark

- Tflops result

Page 36: 2/19/97 Horst D. Simon cs.berkeley/cs267

H. Simon - CS267 - L8 04/20/23 36

Review of Gaussian Elimination

Now we use Matlab (data parallel) notation to express

the algorithm even more compactly:

for i = 1 to n-1

A(i+1:n, i) = A(i+1:n, i) / A(i,i)

A(i+1:n, i+1:n) = A(i+1:n, i+1:n) -

A(i+1:n, i)*A(i, i+1:n)

The inner loop consists of one vector operation, and one matrix-vector operation.Note that the loop looks elegant, but no longer intuitive.

Page 37: 2/19/97 Horst D. Simon cs.berkeley/cs267

H. Simon - CS267 - L8 04/20/23 37

Review of Gaussian Elimination (cont.)

Page 38: 2/19/97 Horst D. Simon cs.berkeley/cs267

H. Simon - CS267 - L8 04/20/23 38

Partial Pivoting

Reordering the rows of A so that A(i,i) is large at each step of the algorithm.At step i of the algorithm, row i is swappedwith row k>i if |A(k,i)| is the largest entry among |A(i:n,i)|.

for i = 1 to n-1 find and record k where |A(k,i)| = max_{i<=j<=n} |A(j,i)| if |A(k,i)|=0, exit with a warning that A is singular, or nearly so if i != k, swap rows i and k of A A(i+1:n, i) = A(i+1:n, i) / A(i,i) ... each quotient lies in [-1,1] A(i+1:n, i+1:n) = A(i+1:n, i+1:n) - A(i+1:n, i)*A(i, i+1:n)

Page 39: 2/19/97 Horst D. Simon cs.berkeley/cs267

H. Simon - CS267 - L8 04/20/23 39

How to Use Level 3 BLAS ?

The current algorithm only uses level 1 and level 2 BLAS.

Want to use level 3 BLAS because of higher performance.

The standard technique is called blocking or delayed updating.

We want to save up a sequence of level 2 operations and do them all at once.

Page 40: 2/19/97 Horst D. Simon cs.berkeley/cs267

H. Simon - CS267 - L8 04/20/23 40

How to Use Level 3 BLAS in LU Decomposition

- process the matrix in blocks of b columns at a time

- b is called the block size.

- do a complete LU decomposition just of the b columns in the current block, essentially using the above BLAS2code.

- then update the remainder of the matrix doing b rank-one updates all at once, which turns out to be a single matrix-matrix multiplication of size b

Page 41: 2/19/97 Horst D. Simon cs.berkeley/cs267

H. Simon - CS267 - L8 04/20/23 41

Block GE with Level 3 BLAS

Page 42: 2/19/97 Horst D. Simon cs.berkeley/cs267

H. Simon - CS267 - L8 04/20/23 42

Block GE with Level 3 BLAS

Gaussian elimination with Partial Pivoting, BLAS3 implementation

... process matrix b columns at a time for ib = 1 to n-1 step b ... point to end of block of b columns end = min(ib+b-1,n)

... LU factorize A(ib:n,ib:end) with BLAS2 for i = ib to end find and record k where |A(k,i)| = max_{i<=j<=n} |A(j,i)| if |A(k,i)|=0, exit with a warning that A is singular, or nearly so if i != k, swap rows i and k of A A(i+1:n, i) = A(i+1:n, i) / A(i,i) ... only update columns i+1 to end A(i+1:n, i+1:end) = A(i+1:n, i+1:end) - A(i+1:n, i)*A(i, i+1:end) endfor

Page 43: 2/19/97 Horst D. Simon cs.berkeley/cs267

H. Simon - CS267 - L8 04/20/23 43

Block GE with Level 3 BLAS (cont.)

... Let LL be the b-by-b lower triangular ... matrix whose subdiagonal entries are ... stored in A(ib:end,ib:end), and with ... 1s on the diagonal. Do delayed update ... of A(ib:end, end+1:n) by solving ... n-end triangular systems ... (A(ib:end, end+1:n) is pink below) A(ib:end, end+1:n) = LL \ A(ib:end, end+1:n)

... do delayed update of rest of matrix ... using matrix-matrix multiplication ... (A(end+1:n, end+1:n) is green below) ... (A(end+1:n, ib:end) is blue below) A(end+1:n, end+1:n) = A(end+1:n, end+1:n) - A(end+1:n,ib:end)*A(ib(end,end+1:n)

endfor

Page 44: 2/19/97 Horst D. Simon cs.berkeley/cs267

H. Simon - CS267 - L8 04/20/23 44

Block GE with Level 3 BLAS (cont.)

- LU factorization of A(ib:n,ib:end) uses the same algorithm as before (level 2 BLAS) - Solving a system of n-end equations with triangular coefficient matrix LL is a single call to a BLAS3 subroutine(strsm) designed for that purpose. - No work or data motion is required to refer to LL; done with a pointer. - When n>>b, almost all the work is done in the final line,which multiplies an (n-end)-by-b matrix times a b-by-(n-end) matrix in a single BLAS3 call (to sgemm).

Page 45: 2/19/97 Horst D. Simon cs.berkeley/cs267

H. Simon - CS267 - L8 04/20/23 45

How to select b?

b will be chosen in a machine dependent way to maximize performance. A good value of bwill have the following properties:

- b is small enough so that the b columns currently being LU-factorized fit in the fast memory (cache, say) of the machine.

- b is large enough to make matrix-matrix multiplication fast.

Page 46: 2/19/97 Horst D. Simon cs.berkeley/cs267

H. Simon - CS267 - L8 04/20/23 46

LINPACK - LAPACK -ScaLAPACK

LINPACK - linear systems, least squares problems level 1 BLAS - late 70s

LAPACK - redesigned LINPACK to include eigenvalue software, level 3 BLAS for parallel and shared memory parallel machines - late 80s

ScaLAPACK - scaleable LAPACK based on BLACS for communication, distributed memory machine - mid 90s

Page 47: 2/19/97 Horst D. Simon cs.berkeley/cs267

H. Simon - CS267 - L8 04/20/23 47

Efficiency on Cray C90

Page 48: 2/19/97 Horst D. Simon cs.berkeley/cs267

H. Simon - CS267 - L8 04/20/23 48

Comparison of Different Machines

Machine #Procs Clock Peak Block Speed Mflops Size b (MHz)---------------------------------------------------------------------Convex C4640 1 135 810 64Convex C4640 4 135 3240 64Cray C90 1 240 952 128Cray C90 16 240 15238 128DEC Alpha 3000-500X 1 200 200 32IBM RS 6000/590 1 66 264 64SGI Power Challenge 1 75 300 64

Page 49: 2/19/97 Horst D. Simon cs.berkeley/cs267

H. Simon - CS267 - L8 04/20/23 49

Efficiency of LAPACK LU, for n=1000

Page 50: 2/19/97 Horst D. Simon cs.berkeley/cs267

H. Simon - CS267 - L8 04/20/23 50

Efficiency of LAPACK LU, for n=1000

LU factorization is almost as efficient as matrix-matrixmultiply for most machines, except on C90 (16 processors). (why?)

LAPACK - LU is almost as good as best vendor effort.Trade-off between performance and portability.

Vendors place a premium on LU performance - why?

Page 51: 2/19/97 Horst D. Simon cs.berkeley/cs267

H. Simon - CS267 - L8 04/20/23 51

LINPACK Benchmark

- named after the LINPACK package - originally consisted of timings for 100-by-100 matrices; no vendor optimization(code changes) permitted- interesting historical record, with literally everymachine for the last 2 decades listed in decreasing order of speed, from the largest supercomputers to a hand-held calculator. - as machines grew faster 1000-by-1000 matrices wereintroduced (all code changes allowed). - a third benchmark was added for large parallel machines, which measured their speed on the largest linear system that would fit in memory, as well as the size of the system required to get half the Mflop rate ofthe largest matrix.

Page 52: 2/19/97 Horst D. Simon cs.berkeley/cs267

H. Simon - CS267 - L8 04/20/23 52

Computer Num_Procs Rmax(GFlops) Nmax(order) N1/2(order) Rpeak(GFlops)--------------------------------------------- --------- ------------ ------------ ------------ -------------Intel ASCI Option Red (200 MHz Pentium Pro) 7264 1068. 215000 53400 1453CP-PACS* (150 MHz PA-RISC based CPU) 2048 368.2 103680 30720 614Intel Paragon XP/S MP (50 MHz OS=SUNMOS) 6768 281.1 128600 25700 338Intel Paragon XP/S MP (50 MHz OS=SUNMOS) 6144 256.2 122500 24300 307Numerical Wind Tunnel* (9.5 ns) 167 229.7 66132 18018 281Intel Paragon XP/S MP (50 MHz OS=SUNMOS) 5376 223.6 114500 22900 269HITACHI SR2201/1024(150MHz) 1024 220.4 138240 34560 307Fujitsu VPP500/153(10nsec) 153 200.6 62730 17000 245 Numerical Wind Tunnel* (9.5 ns) 140 195.0 60480 15730 236Intel Paragon XP/S MP (50 MHz OS=SUNMOS) 4608 191.5 106000 21000 230Numerical Wind Tunnel* (9.5 ns) 128 179.2 56832 14800 216

LINPACK Benchmark

Page 53: 2/19/97 Horst D. Simon cs.berkeley/cs267

H. Simon - CS267 - L8 04/20/23 53

Efficiency of LAPACK LU, for n=100

Page 54: 2/19/97 Horst D. Simon cs.berkeley/cs267

H. Simon - CS267 - L8 04/20/23 54

Data Layouts for Distributed Memory Machines

The two main issues in choosing a data layout for Gaussian elimination are

1) load balance, or splitting the work reasonably evenly among the processors 2)ability to use the BLAS3 during computations on a single processor, to account for the memory hierarchy on each processor.

Several layouts will be discussed here. All these are partof HPF. Solving linear systems served as a prototype forthese designs.

Page 55: 2/19/97 Horst D. Simon cs.berkeley/cs267

H. Simon - CS267 - L8 04/20/23 55

Gaussian Elimination using BLAS 3

Page 56: 2/19/97 Horst D. Simon cs.berkeley/cs267

H. Simon - CS267 - L8 04/20/23 56

Column Blocked

column i is stored on processor floor(i/c) where c=ceiling(n/p) is the maximum number of columns stored per processor.

does not permit good load balancing.

after c columns have been computedprocessor 0 is idle

row blocked has similar problemn=16 and p=4.

Page 57: 2/19/97 Horst D. Simon cs.berkeley/cs267

H. Simon - CS267 - L8 04/20/23 57

Column Cyclic

each processor owns approximately 1/p-th of the square southeast corner of the matrix

good load balance

single columns are stored rather than blocks means we cannot usethe BLAS3 to update

transpose of this layout, the Row Cyclic Layout, has a similarproblem.

Page 58: 2/19/97 Horst D. Simon cs.berkeley/cs267

H. Simon - CS267 - L8 04/20/23 58

Column Block Cyclic

choose a block size b, divide the columns into groups of size b, distribute these groups cyclically

for b >1, slightly worse balance than the Column Cyclic Layout;can use the BLAS2 and BLAS3

b < c, better load balancethan the Columns Blocked Layout, but can only call the BLAS on smaller subproblems, take less advantage of the local memory hierarchy

disadvantage that the factorization of A(ib:n,ib:end) will take place on perhaps juston one processor; possible serial bottleneck.

n=16, p=4 and b=2b not necessarilyBLAS3 block size

Page 59: 2/19/97 Horst D. Simon cs.berkeley/cs267

H. Simon - CS267 - L8 04/20/23 59

Row and Column Block Cyclic

processors and matrix blocksare distributed in a 2d array

pcol-fold parallelismin any column, and calls to the BLAS2 and BLAS3 on matrices of size brow-by-bcol

serial bottleneck is eased

need not be symmetric in rows andcolumns

Page 60: 2/19/97 Horst D. Simon cs.berkeley/cs267

H. Simon - CS267 - L8 04/20/23 60

Skewered Block

each row and each column is shared among all p processors

so p-fold parallelism is available for any row operation or any column operation

in contrast, the 2D block cyclic layoutcan have at most sqrt(p)-fold parallelism in all the rows and all the columns not useful for Gaussian elimination, but in a variety of other matrixoperations

Page 61: 2/19/97 Horst D. Simon cs.berkeley/cs267

H. Simon - CS267 - L8 04/20/23 61

Distributed GE with a 2D Block Cyclic Layout

block size b in the algorithm and the block sizes brow and bcol in the layout satisfy b=brow=bcol.

shaded regions indicate busy processors or communication performed.

unnecessary to have a barrier between each step of the algorithm, e.g.. step 9, 10, and 11 can be pipelined

Page 62: 2/19/97 Horst D. Simon cs.berkeley/cs267

H. Simon - CS267 - L8 04/20/23 62

Distributed GE with a 2D Block Cyclic Layout

Page 63: 2/19/97 Horst D. Simon cs.berkeley/cs267

H. Simon - CS267 - L8 04/20/23 63

Page 64: 2/19/97 Horst D. Simon cs.berkeley/cs267

H. Simon - CS267 - L8 04/20/23 64

ScaLAPACK LU Performance Results

Page 65: 2/19/97 Horst D. Simon cs.berkeley/cs267

H. Simon - CS267 - L8 04/20/23 65

Teraflop/s Performance Result

“Sorry for the delay in responding. The system had about 7000 200Mhz Pentium Pro Processors. It solved a 64bit real matrix of size 216000. It did not use Strassen. The algorithm was basically the same that Robert van de Geijn used on the Delta years ago. It does a 2D block cyclic map of the matrix and requires a power of 2 number of nodes in the vertical direction. The basicblock size was 64x64. A custom dual processor matrix multiply was written for the DGEMM call. It took a little less than 2 hours to run.”