Chapter 7-Matrix Multiplication from the book Parallel Computing by Michael J. Quinn

8/11/2019 Chapter 7-Matrix Multiplication from the book Parallel Computing by Michael J. Quinn

1/39

MATRIX MULTIPLICATION

(Part b)

By:

Shahrzad AbediProfessor: Dr. Haj Seyed Javadi


2/39

MATRIX Multiplication

SIMD MIMD

Multiprocessors

Multicomputers

Chapter 7: Matrix Multiplication , Parallel Computing :Theory and Practice, Michael J. Quinn 2


3/39

Matrix Multiplication Algorithmsfor Multiprocessors


p1

p2

p3

p4

p1 p2 p3 p4


4/39

Matrix Multiplication Algorithmfor a UMA Multiprocessor


p1

p2

p3

p4


5/39



p1

p2

C A B

Example:n= 8 , P=2 n/p= 4

n/p times

We must read n/p rows of A and we must read everyelement of B, n/p times


6/39

Matrix Multiplication Algorithms

for Multiprocessors

Question : Which Loop should be madeparallel in the sequential Matrix multiplicationalgorithm?

Grain Size :Amount of work performed between processor

interactions

Ratio of Computation time to CommunicationTime : Computation time / Communication time

6Chapter 7: Matrix Multiplication , Parallel Computing :Theory and Practice, Michael J. Quinn


7/39

Sequential Matrix Multiplication

Algorithm



8/39

Matrix Multiplication Algorithms

for Multiprocessors

Design StrategyIf load balancing is not a

problem maximize grain size

Question : Which Loop should be made

parallel ? i or j or k ?

K has data dependency

If j

Grain-size = O(n3

/np)= O(n2

/p) If iGrain-size = O(n

3/p)

X



9/39

Matrix Multiplication Algorithm

for a UMA MultiprocessorParallelizing i loop



10/39


n/p rows each (n2)

n/p xn2= (n3/p)

Synchronizationoverhead(p)

Complexity(n3/p + p)



11/39

Matrix Multiplication in Loosely

Coupled Multiprocessors Some matrix elements may be much easier

to access than others

It is important to keep local as many memoryreferences as possible

In previous UMA algorithm : Every process must access n/prows of matrix A

and access every element of B n/p times

Only a single addition and a singlemultiplication occur for every element of Bfetched . This is not a good ratio!Implementation of this algorithm on NUMAMulti-processors yields poor speedup!



12/39

Matrix Multiplication in LooselyCoupled Multiprocessors

Another method must be found to partition

the problem

An attractive methodBlock Matrix

Multiplication



13/39

Block Matrix Multiplication

A and B are both n x n matrices, n= 2k

A and B can be thought of as conglomerates of

4 smaller matrices, each of size k x k

Given this partitioning of A and B into blocks , C is

defined as follows:



14/39


For example there are processes,

then matrix multiplication is done by dividing

A and B into p blocks of size k x k.



15/39


STEP 1: compute Ci, j= Ai,1B1,j

A B


P1: P2:

P3: P4:

P1: :P2

P3: :P4


16/39


STEP 2: Compute Ci,j=Ci,j+Ai,2B2,j


P1:

P3:

P2:

P4:

P1: :P2

:P4P3:


17/39


Each block multiplicationrequires 2k2

memory fetches, k3additions and k3

multiplications

The number of arithmetic operations per

memory access has risen from 2 , in previous

algorithm to:

17Chapter 7: Matrix Multiplication , Parallel Computing :Theory and Practice, Michael J.Quinn


18/39

Matrix Multiplication Algorithm

for NUMA Multiprocessors

Try to resolve memory contention as much as

possible

Increase the locality of memory references to

reduce memory access time

Design Strategy Reduce average memory

latency time by increasing locality


Al i h f M l i


19/39

Algorithms for Multicomputers:

Row-Column Oriented Algorithm

Partition Matrix A into rows and B into columns(n is apower of 2 and we are executing algorithm on an n-processor hypercube):

One imaginable parallelization:

Parallelize the outer loop (i) All parallel processes access column 0 of b, then column 1

of b, etc.

This results in a sequence of broadcast steps each having(logn) on an n-processor hypercube( refer to chapter 6,

p. 170) In the case of a multiprocessor too much contention

for the same memory bank is called hot spot



20/39


Design Strategy Eliminate contention forshared resources by changing temporal orderof data accesses.

New Solution for a multicomputer: Change the order in which the algorithm

computes the elements of each row of C

Processes are organized as a ring.

After each process has used its current column ofB, it fetches the next column of B from itssuccessor on the ring



21/39


1 0

5 4

3 2

7 6

We embed a ring in a hypercube

with dilation 1 using Gray Codes

Each message can be sent in

time (1)

2

6

3

4

1 0

5

7



22/39


Example : Use 4 processes to multiply two matrices A4x4and B4x4


P1: P4:

P2: P3:


23/39




P1: P4:

P2: P3:


24/39




P1: P4:

P2: P3:


25/39




P1: P4:

P2: P3:


26/39


Generalizing the algorithm :

Multiplying l x m and m x n matrices on p processors where p


27/39


total Communication time: The standard assumption : Sending and

receiving a message has Message latency plus message transmission time times the

number of values sent : Message latency

: Message transmission time

Every iteration has communication time :2(+m(n/p))

Over p iteration total communication time is :



28/39

Algorithms for Multicomputers:Block-Oriented Algorithm

We want to maximizenumber of multiplicationsperformed per iteration

Multiplying l x m matrix Aby m x n matrix B(l, m andn are integer multiples of where p is an evenpower of 2.

Processors as a two-dimensional mesh withwraparound connections

Give each processor a subsectionof A and subsection of B.



29/39

Block-Oriented Algorithm

The new Matrix multiplication algorithm is acorollary of two results shown earlier:

Block matrix multiplication performed analogously toscalar matrix multiplicationEach occurrence ofscalar multiplication is replaced by an occurrence ofmatrix multiplication

The algorithm previously used on 2-dimensional meshof processors with a staggering techniqueThe same

staggering technique is used to position the blocks ofA and B, so that every processor multiplies twosubmatrices every iteration



30/39



Phase 1: Staggering the block submatrixes of matrixA is done in both directions: left and right


31/39



Phase 1: Staggering the block submatrixes of matrixB is done in both directions: up and down


32/39


Chapter 7: Matrix Multiplication , Parallel

Computing :Theory and Practice, Michael J.

Quinn

32


33/39


From s point of view:


A1,2*B2,2A1,1*B1,2+A1,0*B0,2++C1,2 = A1,3*B3,2

(1) (2)


34/39



(3) (4)


35/39


There are iterations that every processor sends andreceives a portion of matrix A and B

Number of Computation steps

The staggering and unstaggering phase takes steps instead ofp -1steps in Getlemansalgorithm How?

There are iterations that every processor sends and

receives a portion of matrix A and B Total communication steps for transferring A block /B block

2( + ( )) =



36/39


37/39

The two multicomputer Algorithms

Both the block oriented algorithm and the row-column algorithm have the same number ofcomputation steps : (lmn/p)

When does the second algorithm require lesscommunication time?

Assume that we are multiplying two n x n matrices,where n is an integer multiple of p



38/39

The two multicomputer Algorithms

Thus the block oriented algorithm is uniformly

superior to the row-column algorithm when

the number of processors is an even power of2 greater than or equal to 16.



39/39

Questions?

39Ch t 7 M t i M lti li ti P ll l C ti Th d P ti Mi h l J Q i

Documents

Chapter 7-Matrix Multiplication from the book Parallel Computing by Michael J. Quinn