Upload
lambert-jayson-hunter
View
223
Download
0
Embed Size (px)
Citation preview
CS 484
Dense Matrix Algorithms
There are two types of Matrices Dense (Full) Sparse
We will consider matrices that are Dense Square
Mapping Matrices
How do we partition a matrix for parallel processing?
There are two basic ways Striped partitioning Block partitioning
Striped Partitioning
01
2
3
4
5
6
7
01
2
3
4
5
6
7
P0
P1
P2
P3
P0
P1P2P3P0P1P2P3
Block striping Cyclic striping
Block Partitioning
P0 P1
P2 P3
P0 P1 P2 P3
P4 P5 P6 P7
P0 P1 P2 P3
P4 P5 P6 P7
Block checkerboard Cyclic checkerboard
Block vs. Striped Partitioning
Scalability? Striping is limited to n processors Checkerboard is limited to n x n
processors
Complexity? Striping is easy Block could introduce more
dependencies
Matrix Multiplication
One Dimensional Decomposition
Each processor "owns" black portionTo compute the owned portion of the answer, each processor requires all of A
P
NttPT ws
2
)1(
Two Dimensional Decomposition
Requires less data per processorAlgorithm can be performed stepwise.Fox’s algorithm
Broadcast an A sub-matrix to the other processors in row.
Compute
Rotate the B sub-matrix upwards
AlgorithmSet B' = Blocal
for j = 0 to sqrt(P) -2in each row I the [(I+j) mod sqrt(P)]th task broadcasts
A' = Alocal to the other tasks in the rowaccumulate A' * B'send B' to upward neighbor
done
P
Ntt
PPT ws
2
12
log1
Cannon’s Algorithm
Broadcasting a submatrix to all who need it is costly.Suggestion: Shift both submatrices
P
NttPT ws
2
12
Blocks Need to Be Aligned
A00
B00
A01
B01
A02
B02
A03
B03
A10
B10
A11
B11
A12
B12
A13
B13
A20
B20
A21
B21
A22
B22
A23
B23
A30
B30
A31
B31
A32
B32
A33
B33
Each trianglerepresents a matrix block
Only same-colortriangles shouldbe multiplied
Rearrange Blocks
A00
B00
A01
B01
A02
B02
A03
B03A10
B10
A11
B11
A12
B12
A13
B13
A20
B20
A21
B21
A22
B22
A23
B23
A30
B30
A31
B31
A32
B32
A33
B33
Block Aij cyclesleft i positions
Block Bij cyclesup j positions
Consider Process P1,2
B02
A10A11 A12
B12
A13
B22
B32 Step 1
Consider Process P1,2
B12
A11A12 A13
B22
A10
B32
B02 Step 2
Consider Process P1,2
B22
A12A13 A10
B32
A11
B02
B12 Step 3
Consider Process P1,2
B32
A13A10 A11
B02
A12
B12
B22 Step 4
Complexity Analysis
Algorithm has p iterationsDuring each iteration process multiplies two (n / p ) (n / p ) matrices: (n3 / p 3/2)Computational complexity: (n3 / p)During each iteration process sends and receives two blocks of size (n / p ) (n / p )Communication complexity: (n2/ p)
Divide and Conquer
App Apq
Aqp Aqq
Bpp Bpq
Bqp Bqq
P0 = App * BppP1 = Apq * BpqP2 = App * BpqP3 = Aqp * Bqq
P4 = Aqp * BppP5 = Aqq * BqpP6 = Aqp * BpqP7 = Aqq * Bqq
P0 + P1 P2 + P3
P4 + P5 P6 + P7
=x
Systems of Linear Equations
A linear equation in n variables has the form
A set of linear equations is called a system.A solution exists for a system iff the solution satisfies all equations in the system.Many scientific and engineering problems take this form.
a0x0 + a1x1 + … + an-1xn-1 = b
Solving Systems of Equations
Many such systems are large. Thousands of equations and unknowns
a0,0x0 + a0,1x1 + … + a0,n-1xn-1 = b0
a1,0x0 + a1,1x1 + … + a1,n-1xn-1 = b1
an-1,0x0 + an-1,1x1 + … + an-1,n-1xn-1 = bn-1
Solving Systems of Equations
A linear system of equations can be represented in matrix form
a0,0 a0,1 … a0,n-1 x0 b0
a1,0 a1,1 … a1,n-1 x1 b1
an-1,0 an-1,1 … an-1,n-1 xn-1 bn-1
=
Ax = b
Solving Systems of Equations
Solving a system of linear equations is done in two steps: Reduce the system to upper-
triangular Use back-substitution to find solution
These steps are performed on the system in matrix form. Gaussian Elimination, etc.
Solving Systems of Equations
Reduce the system to upper-triangular form
Use back-substitution
a0,0 a0,1 … a0,n-1 x0 b0
0 a1,1 … a1,n-1 x1 b1
0 0 … an-1,n-1 xn-1 bn-1
=
Reducing the System
Gaussian elimination systematically eliminates variable x[k] from equations k+1 to n-1. Reduces the coefficients to zero
This is done by subtracting a appropriate multiple of the kth equation from each of the equations k+1 to n-1
Procedure GaussianElimination(A, b, y) for k = 0 to n-1
/* Division Step */for j = k + 1 to n - 1 A[k,j] = A[k,j] / A[k,k]y[k] = b[k] / A[k,k]A[k,k] = 1
/* Elimination Step */for i = k + 1 to n - 1 for j = k + 1 to n - 1
A[i,j] = A[i,j] - A[i,k] * A[k,j] b[i] = b[i] - A[i,k] * y[k] A[i,k] = 0endfor
endforend
Parallelizing Gaussian Elim.
Use domain decomposition Rowwise striping
Division step requires no communicationElimination step requires a one-to-all broadcast for each equation.No agglomerationInitially map one to to each processor
Communication Analysis
Consider the algorithm step by stepDivision step requires no communicationElimination step requires one-to-all bcast only bcast to other active processors only bcast active elements
Final computation requires no communication.
Communication Analysis
One-to-all broadcast log2q communications q = n - k - 1 active processors
Message size q active processors q elements required
T = (ts + twq)log2q
Computation Analysis
Division step q divisions
Elimination step q multiplications and subtractions
Assuming equal time --> 3q operations
Computation Analysis
In each step, the active processor set is reduced by one resulting in:
2/)1(3
11
0
nnCompTime
knCompTimen
k
Can we do better?
Previous version is synchronous and parallelism is reduced at each step.Pipeline the algorithmRun the resulting algorithm on a linear array of processors.Communication is nearest-neighborResults in O(n) steps of O(n) operations
Pipelined Gaussian Elim.
Basic assumption: A processor does not need to wait until all processors have received a value to proceed.Algorithm If processor p has data for other processors,
send the data to processor p+1 If processor p can do some computation
using the data it has, do it. Otherwise, wait to receive data from
processor p-1
Conclusion
Using a striped partitioning method, it is natural to pipeline the Gaussian elimination algorithm to achieve best performance.Pipelined algorithms work best on a linear array of processors. Or something that can be linearly mapped
Would it be better to block partition? How would it affect the algorithm?
Row Ordering
When dealing with a sparse matrix, sometimes operations can cause a zero space in the matrix to become non-zero
Nested Disection Ordering
Complete these slides using notes in the black binder.