CS 484. Dense Matrix Algorithms There are two types of Matrices Dense (Full) Sparse We will consider matrices that are Dense Square

CS 484

Dense Matrix Algorithms

There are two types of Matrices Dense (Full) Sparse

We will consider matrices that are Dense Square

Mapping Matrices

How do we partition a matrix for parallel processing?

There are two basic ways Striped partitioning Block partitioning

Striped Partitioning

01

2

3

4

5

6

7

01

2

3

4

5

6

7

P0

P1

P2

P3

P0

P1P2P3P0P1P2P3

Block striping Cyclic striping

Block Partitioning

P0 P1

P2 P3

P0 P1 P2 P3

P4 P5 P6 P7

P0 P1 P2 P3

P4 P5 P6 P7

Block checkerboard Cyclic checkerboard

Block vs. Striped Partitioning

Scalability? Striping is limited to n processors Checkerboard is limited to n x n

processors

Complexity? Striping is easy Block could introduce more

dependencies

Matrix Multiplication

One Dimensional Decomposition

Each processor "owns" black portionTo compute the owned portion of the answer, each processor requires all of A

P

NttPT ws

2

)1(

Two Dimensional Decomposition

Requires less data per processorAlgorithm can be performed stepwise.Fox’s algorithm

Broadcast an A sub-matrix to the other processors in row.

Compute

Rotate the B sub-matrix upwards

AlgorithmSet B' = Blocal

for j = 0 to sqrt(P) -2in each row I the [(I+j) mod sqrt(P)]th task broadcasts

A' = Alocal to the other tasks in the rowaccumulate A' * B'send B' to upward neighbor

done

P

Ntt

PPT ws

2

12

log1

Cannon’s Algorithm

Broadcasting a submatrix to all who need it is costly.Suggestion: Shift both submatrices

P

NttPT ws

2

12

Blocks Need to Be Aligned

A00

B00

A01

B01

A02

B02

A03

B03

A10

B10

A11

B11

A12

B12

A13

B13

A20

B20

A21

B21

A22

B22

A23

B23

A30

B30

A31

B31

A32

B32

A33

B33

Each trianglerepresents a matrix block

Only same-colortriangles shouldbe multiplied

Rearrange Blocks

A00

B00

A01

B01

A02

B02

A03

B03A10

B10

A11

B11

A12

B12

A13

B13

A20

B20

A21

B21

A22

B22

A23

B23

A30

B30

A31

B31

A32

B32

A33

B33

Block Aij cyclesleft i positions

Block Bij cyclesup j positions

Consider Process P1,2

B02

A10A11 A12

B12

A13

B22

B32 Step 1


B12

A11A12 A13

B22

A10

B32

B02 Step 2


B22

A12A13 A10

B32

A11

B02

B12 Step 3


B32

A13A10 A11

B02

A12

B12

B22 Step 4

Complexity Analysis

Algorithm has p iterationsDuring each iteration process multiplies two (n / p ) (n / p ) matrices: (n3 / p 3/2)Computational complexity: (n3 / p)During each iteration process sends and receives two blocks of size (n / p ) (n / p )Communication complexity: (n2/ p)

Divide and Conquer

App Apq

Aqp Aqq

Bpp Bpq

Bqp Bqq

P0 = App * BppP1 = Apq * BpqP2 = App * BpqP3 = Aqp * Bqq

P4 = Aqp * BppP5 = Aqq * BqpP6 = Aqp * BpqP7 = Aqq * Bqq

P0 + P1 P2 + P3

P4 + P5 P6 + P7

=x

Systems of Linear Equations

A linear equation in n variables has the form

A set of linear equations is called a system.A solution exists for a system iff the solution satisfies all equations in the system.Many scientific and engineering problems take this form.

a0x0 + a1x1 + … + an-1xn-1 = b

Solving Systems of Equations

Many such systems are large. Thousands of equations and unknowns

a0,0x0 + a0,1x1 + … + a0,n-1xn-1 = b0

a1,0x0 + a1,1x1 + … + a1,n-1xn-1 = b1

an-1,0x0 + an-1,1x1 + … + an-1,n-1xn-1 = bn-1


A linear system of equations can be represented in matrix form

a0,0 a0,1 … a0,n-1 x0 b0

a1,0 a1,1 … a1,n-1 x1 b1

an-1,0 an-1,1 … an-1,n-1 xn-1 bn-1

=

Ax = b


Solving a system of linear equations is done in two steps: Reduce the system to upper-

triangular Use back-substitution to find solution

These steps are performed on the system in matrix form. Gaussian Elimination, etc.


Reduce the system to upper-triangular form

Use back-substitution

a0,0 a0,1 … a0,n-1 x0 b0

0 a1,1 … a1,n-1 x1 b1

0 0 … an-1,n-1 xn-1 bn-1

=

Reducing the System

Gaussian elimination systematically eliminates variable x[k] from equations k+1 to n-1. Reduces the coefficients to zero

This is done by subtracting a appropriate multiple of the kth equation from each of the equations k+1 to n-1

Procedure GaussianElimination(A, b, y) for k = 0 to n-1

/* Division Step */for j = k + 1 to n - 1 A[k,j] = A[k,j] / A[k,k]y[k] = b[k] / A[k,k]A[k,k] = 1

/* Elimination Step */for i = k + 1 to n - 1 for j = k + 1 to n - 1

A[i,j] = A[i,j] - A[i,k] * A[k,j] b[i] = b[i] - A[i,k] * y[k] A[i,k] = 0endfor

endforend

Parallelizing Gaussian Elim.

Use domain decomposition Rowwise striping

Division step requires no communicationElimination step requires a one-to-all broadcast for each equation.No agglomerationInitially map one to to each processor

Communication Analysis

Consider the algorithm step by stepDivision step requires no communicationElimination step requires one-to-all bcast only bcast to other active processors only bcast active elements

Final computation requires no communication.

Communication Analysis

One-to-all broadcast log2q communications q = n - k - 1 active processors

Message size q active processors q elements required

T = (ts + twq)log2q

Computation Analysis

Division step q divisions

Elimination step q multiplications and subtractions

Assuming equal time --> 3q operations

Computation Analysis

In each step, the active processor set is reduced by one resulting in:

2/)1(3

11

0

nnCompTime

knCompTimen

k

Can we do better?

Previous version is synchronous and parallelism is reduced at each step.Pipeline the algorithmRun the resulting algorithm on a linear array of processors.Communication is nearest-neighborResults in O(n) steps of O(n) operations

Pipelined Gaussian Elim.

Basic assumption: A processor does not need to wait until all processors have received a value to proceed.Algorithm If processor p has data for other processors,

send the data to processor p+1 If processor p can do some computation

using the data it has, do it. Otherwise, wait to receive data from

processor p-1

Conclusion

Using a striped partitioning method, it is natural to pipeline the Gaussian elimination algorithm to achieve best performance.Pipelined algorithms work best on a linear array of processors. Or something that can be linearly mapped

Would it be better to block partition? How would it affect the algorithm?

Row Ordering

When dealing with a sparse matrix, sometimes operations can cause a zero space in the matrix to become non-zero

Nested Disection Ordering

Complete these slides using notes in the black binder.

Documents

CS 484. Dense Matrix Algorithms There are two types of Matrices Dense (Full) Sparse We will consider matrices that are Dense Square