Chapter 7-Matrix Multiplication from the book Parallel Computing by Michael J. Quinn

Embed Size (px)

Citation preview

  • 8/11/2019 Chapter 7-Matrix Multiplication from the book Parallel Computing by Michael J. Quinn

    1/39

    MATRIX MULTIPLICATION

    (Part b)

    By:

    Shahrzad AbediProfessor: Dr. Haj Seyed Javadi

  • 8/11/2019 Chapter 7-Matrix Multiplication from the book Parallel Computing by Michael J. Quinn

    2/39

    MATRIX Multiplication

    SIMD MIMD

    Multiprocessors

    Multicomputers

    Chapter 7: Matrix Multiplication , Parallel Computing :Theory and Practice, Michael J. Quinn 2

  • 8/11/2019 Chapter 7-Matrix Multiplication from the book Parallel Computing by Michael J. Quinn

    3/39

    Matrix Multiplication Algorithmsfor Multiprocessors

    Chapter 7: Matrix Multiplication , Parallel Computing :Theory and Practice, Michael J. Quinn 3

    p1

    p2

    p3

    p4

    p1 p2 p3 p4

  • 8/11/2019 Chapter 7-Matrix Multiplication from the book Parallel Computing by Michael J. Quinn

    4/39

    Matrix Multiplication Algorithmfor a UMA Multiprocessor

    Chapter 7: Matrix Multiplication , Parallel Computing :Theory and Practice, Michael J. Quinn 4

    p1

    p2

    p3

    p4

  • 8/11/2019 Chapter 7-Matrix Multiplication from the book Parallel Computing by Michael J. Quinn

    5/39

    Matrix Multiplication Algorithmfor a UMA Multiprocessor

    Chapter 7: Matrix Multiplication , Parallel Computing :Theory and Practice, Michael J. Quinn 5

    p1

    p2

    C A B

    Example:n= 8 , P=2 n/p= 4

    n/p times

    We must read n/p rows of A and we must read everyelement of B, n/p times

  • 8/11/2019 Chapter 7-Matrix Multiplication from the book Parallel Computing by Michael J. Quinn

    6/39

    Matrix Multiplication Algorithms

    for Multiprocessors

    Question : Which Loop should be madeparallel in the sequential Matrix multiplicationalgorithm?

    Grain Size :Amount of work performed between processor

    interactions

    Ratio of Computation time to CommunicationTime : Computation time / Communication time

    6Chapter 7: Matrix Multiplication , Parallel Computing :Theory and Practice, Michael J. Quinn

  • 8/11/2019 Chapter 7-Matrix Multiplication from the book Parallel Computing by Michael J. Quinn

    7/39

    Sequential Matrix Multiplication

    Algorithm

    Chapter 7: Matrix Multiplication , Parallel Computing :Theory and Practice, Michael J. Quinn 7

  • 8/11/2019 Chapter 7-Matrix Multiplication from the book Parallel Computing by Michael J. Quinn

    8/39

    Matrix Multiplication Algorithms

    for Multiprocessors

    Design StrategyIf load balancing is not a

    problem maximize grain size

    Question : Which Loop should be made

    parallel ? i or j or k ?

    K has data dependency

    If j

    Grain-size = O(n3

    /np)= O(n2

    /p) If iGrain-size = O(n

    3/p)

    X

    8Chapter 7: Matrix Multiplication , Parallel Computing :Theory and Practice, Michael J. Quinn

  • 8/11/2019 Chapter 7-Matrix Multiplication from the book Parallel Computing by Michael J. Quinn

    9/39

    Matrix Multiplication Algorithm

    for a UMA MultiprocessorParallelizing i loop

    9Chapter 7: Matrix Multiplication , Parallel Computing :Theory and Practice, Michael J. Quinn

  • 8/11/2019 Chapter 7-Matrix Multiplication from the book Parallel Computing by Michael J. Quinn

    10/39

    Matrix Multiplication Algorithmfor a UMA Multiprocessor

    n/p rows each (n2)

    n/p xn2= (n3/p)

    Synchronizationoverhead(p)

    Complexity(n3/p + p)

    10Chapter 7: Matrix Multiplication , Parallel Computing :Theory and Practice, Michael J. Quinn

  • 8/11/2019 Chapter 7-Matrix Multiplication from the book Parallel Computing by Michael J. Quinn

    11/39

    Matrix Multiplication in Loosely

    Coupled Multiprocessors Some matrix elements may be much easier

    to access than others

    It is important to keep local as many memoryreferences as possible

    In previous UMA algorithm : Every process must access n/prows of matrix A

    and access every element of B n/p times

    Only a single addition and a singlemultiplication occur for every element of Bfetched . This is not a good ratio!Implementation of this algorithm on NUMAMulti-processors yields poor speedup!

    11Chapter 7: Matrix Multiplication , Parallel Computing :Theory and Practice, Michael J. Quinn

  • 8/11/2019 Chapter 7-Matrix Multiplication from the book Parallel Computing by Michael J. Quinn

    12/39

    Matrix Multiplication in LooselyCoupled Multiprocessors

    Another method must be found to partition

    the problem

    An attractive methodBlock Matrix

    Multiplication

    12Chapter 7: Matrix Multiplication , Parallel Computing :Theory and Practice, Michael J. Quinn

  • 8/11/2019 Chapter 7-Matrix Multiplication from the book Parallel Computing by Michael J. Quinn

    13/39

    Block Matrix Multiplication

    A and B are both n x n matrices, n= 2k

    A and B can be thought of as conglomerates of

    4 smaller matrices, each of size k x k

    Given this partitioning of A and B into blocks , C is

    defined as follows:

    13Chapter 7: Matrix Multiplication , Parallel Computing :Theory and Practice, Michael J. Quinn

  • 8/11/2019 Chapter 7-Matrix Multiplication from the book Parallel Computing by Michael J. Quinn

    14/39

    Block Matrix Multiplication

    For example there are processes,

    then matrix multiplication is done by dividing

    A and B into p blocks of size k x k.

    14Chapter 7: Matrix Multiplication , Parallel Computing :Theory and Practice, Michael J. Quinn

  • 8/11/2019 Chapter 7-Matrix Multiplication from the book Parallel Computing by Michael J. Quinn

    15/39

    Block Matrix Multiplication

    STEP 1: compute Ci, j= Ai,1B1,j

    A B

    15Chapter 7: Matrix Multiplication , Parallel Computing :Theory and Practice, Michael J. Quinn

    P1: P2:

    P3: P4:

    P1: :P2

    P3: :P4

  • 8/11/2019 Chapter 7-Matrix Multiplication from the book Parallel Computing by Michael J. Quinn

    16/39

    Block Matrix Multiplication

    STEP 2: Compute Ci,j=Ci,j+Ai,2B2,j

    16Chapter 7: Matrix Multiplication , Parallel Computing :Theory and Practice, Michael J. Quinn

    P1:

    P3:

    P2:

    P4:

    P1: :P2

    :P4P3:

  • 8/11/2019 Chapter 7-Matrix Multiplication from the book Parallel Computing by Michael J. Quinn

    17/39

    Block Matrix Multiplication

    Each block multiplicationrequires 2k2

    memory fetches, k3additions and k3

    multiplications

    The number of arithmetic operations per

    memory access has risen from 2 , in previous

    algorithm to:

    17Chapter 7: Matrix Multiplication , Parallel Computing :Theory and Practice, Michael J.Quinn

  • 8/11/2019 Chapter 7-Matrix Multiplication from the book Parallel Computing by Michael J. Quinn

    18/39

    Matrix Multiplication Algorithm

    for NUMA Multiprocessors

    Try to resolve memory contention as much as

    possible

    Increase the locality of memory references to

    reduce memory access time

    Design Strategy Reduce average memory

    latency time by increasing locality

    18Chapter 7: Matrix Multiplication , Parallel Computing :Theory and Practice, Michael J. Quinn

    Al i h f M l i

  • 8/11/2019 Chapter 7-Matrix Multiplication from the book Parallel Computing by Michael J. Quinn

    19/39

    Algorithms for Multicomputers:

    Row-Column Oriented Algorithm

    Partition Matrix A into rows and B into columns(n is apower of 2 and we are executing algorithm on an n-processor hypercube):

    One imaginable parallelization:

    Parallelize the outer loop (i) All parallel processes access column 0 of b, then column 1

    of b, etc.

    This results in a sequence of broadcast steps each having(logn) on an n-processor hypercube( refer to chapter 6,

    p. 170) In the case of a multiprocessor too much contention

    for the same memory bank is called hot spot

    19Chapter 7: Matrix Multiplication , Parallel Computing :Theory and Practice, Michael J. Quinn

  • 8/11/2019 Chapter 7-Matrix Multiplication from the book Parallel Computing by Michael J. Quinn

    20/39

    Row-Column Oriented Algorithm

    Design Strategy Eliminate contention forshared resources by changing temporal orderof data accesses.

    New Solution for a multicomputer: Change the order in which the algorithm

    computes the elements of each row of C

    Processes are organized as a ring.

    After each process has used its current column ofB, it fetches the next column of B from itssuccessor on the ring

    20Chapter 7: Matrix Multiplication , Parallel Computing :Theory and Practice, Michael J. Quinn

  • 8/11/2019 Chapter 7-Matrix Multiplication from the book Parallel Computing by Michael J. Quinn

    21/39

    Row-Column Oriented Algorithm

    1 0

    5 4

    3 2

    7 6

    We embed a ring in a hypercube

    with dilation 1 using Gray Codes

    Each message can be sent in

    time (1)

    2

    6

    3

    4

    1 0

    5

    7

    21Chapter 7: Matrix Multiplication , Parallel Computing :Theory and Practice, Michael J. Quinn

  • 8/11/2019 Chapter 7-Matrix Multiplication from the book Parallel Computing by Michael J. Quinn

    22/39

    Row-Column Oriented Algorithm

    Example : Use 4 processes to multiply two matrices A4x4and B4x4

    22Chapter 7: Matrix Multiplication , Parallel Computing :Theory and Practice, Michael J. Quinn

    P1: P4:

    P2: P3:

  • 8/11/2019 Chapter 7-Matrix Multiplication from the book Parallel Computing by Michael J. Quinn

    23/39

    Row-Column Oriented Algorithm

    Example : Use 4 processes to multiply two matrices A4x4and B4x4

    23Chapter 7: Matrix Multiplication , Parallel Computing :Theory and Practice, Michael J. Quinn

    P1: P4:

    P2: P3:

  • 8/11/2019 Chapter 7-Matrix Multiplication from the book Parallel Computing by Michael J. Quinn

    24/39

    Row-Column Oriented Algorithm

    Example : Use 4 processes to multiply two matrices A4x4and B4x4

    24Chapter 7: Matrix Multiplication , Parallel Computing :Theory and Practice, Michael J. Quinn

    P1: P4:

    P2: P3:

  • 8/11/2019 Chapter 7-Matrix Multiplication from the book Parallel Computing by Michael J. Quinn

    25/39

    Row-Column Oriented Algorithm

    Example : Use 4 processes to multiply two matrices A4x4and B4x4

    25Chapter 7: Matrix Multiplication , Parallel Computing :Theory and Practice, Michael J. Quinn

    P1: P4:

    P2: P3:

  • 8/11/2019 Chapter 7-Matrix Multiplication from the book Parallel Computing by Michael J. Quinn

    26/39

    Row-Column Oriented Algorithm

    Generalizing the algorithm :

    Multiplying l x m and m x n matrices on p processors where p

  • 8/11/2019 Chapter 7-Matrix Multiplication from the book Parallel Computing by Michael J. Quinn

    27/39

    Row-Column Oriented Algorithm

    total Communication time: The standard assumption : Sending and

    receiving a message has Message latency plus message transmission time times the

    number of values sent : Message latency

    : Message transmission time

    Every iteration has communication time :2(+m(n/p))

    Over p iteration total communication time is :

    27Chapter 7: Matrix Multiplication , Parallel Computing :Theory and Practice, Michael J. Quinn

  • 8/11/2019 Chapter 7-Matrix Multiplication from the book Parallel Computing by Michael J. Quinn

    28/39

    Algorithms for Multicomputers:Block-Oriented Algorithm

    We want to maximizenumber of multiplicationsperformed per iteration

    Multiplying l x m matrix Aby m x n matrix B(l, m andn are integer multiples of where p is an evenpower of 2.

    Processors as a two-dimensional mesh withwraparound connections

    Give each processor a subsectionof A and subsection of B.

    28Chapter 7: Matrix Multiplication , Parallel Computing :Theory and Practice, Michael J. Quinn

  • 8/11/2019 Chapter 7-Matrix Multiplication from the book Parallel Computing by Michael J. Quinn

    29/39

    Block-Oriented Algorithm

    The new Matrix multiplication algorithm is acorollary of two results shown earlier:

    Block matrix multiplication performed analogously toscalar matrix multiplicationEach occurrence ofscalar multiplication is replaced by an occurrence ofmatrix multiplication

    The algorithm previously used on 2-dimensional meshof processors with a staggering techniqueThe same

    staggering technique is used to position the blocks ofA and B, so that every processor multiplies twosubmatrices every iteration

    29Chapter 7: Matrix Multiplication , Parallel Computing :Theory and Practice, Michael J. Quinn

  • 8/11/2019 Chapter 7-Matrix Multiplication from the book Parallel Computing by Michael J. Quinn

    30/39

    Block-Oriented Algorithm

    Chapter 7: Matrix Multiplication , Parallel Computing :Theory and Practice, Michael J. Quinn 30

    Phase 1: Staggering the block submatrixes of matrixA is done in both directions: left and right

  • 8/11/2019 Chapter 7-Matrix Multiplication from the book Parallel Computing by Michael J. Quinn

    31/39

    Block-Oriented Algorithm

    Chapter 7: Matrix Multiplication , Parallel Computing :Theory and Practice, Michael J. Quinn 31

    Phase 1: Staggering the block submatrixes of matrixB is done in both directions: up and down

  • 8/11/2019 Chapter 7-Matrix Multiplication from the book Parallel Computing by Michael J. Quinn

    32/39

    Block-Oriented Algorithm

    Chapter 7: Matrix Multiplication , Parallel

    Computing :Theory and Practice, Michael J.

    Quinn

    32

  • 8/11/2019 Chapter 7-Matrix Multiplication from the book Parallel Computing by Michael J. Quinn

    33/39

    Block-Oriented Algorithm

    From s point of view:

    Chapter 7: Matrix Multiplication , Parallel Computing :Theory and Practice, Michael J. Quinn 33

    A1,2*B2,2A1,1*B1,2+A1,0*B0,2++C1,2 = A1,3*B3,2

    (1) (2)

  • 8/11/2019 Chapter 7-Matrix Multiplication from the book Parallel Computing by Michael J. Quinn

    34/39

    Block-Oriented Algorithm

    Chapter 7: Matrix Multiplication , Parallel Computing :Theory and Practice, Michael J. Quinn 34

    (3) (4)

  • 8/11/2019 Chapter 7-Matrix Multiplication from the book Parallel Computing by Michael J. Quinn

    35/39

    Block-Oriented Algorithm

    There are iterations that every processor sends andreceives a portion of matrix A and B

    Number of Computation steps

    The staggering and unstaggering phase takes steps instead ofp -1steps in Getlemansalgorithm How?

    There are iterations that every processor sends and

    receives a portion of matrix A and B Total communication steps for transferring A block /B block

    2( + ( )) =

    35Chapter 7: Matrix Multiplication , Parallel Computing :Theory and Practice, Michael J. Quinn

  • 8/11/2019 Chapter 7-Matrix Multiplication from the book Parallel Computing by Michael J. Quinn

    36/39

  • 8/11/2019 Chapter 7-Matrix Multiplication from the book Parallel Computing by Michael J. Quinn

    37/39

    The two multicomputer Algorithms

    Both the block oriented algorithm and the row-column algorithm have the same number ofcomputation steps : (lmn/p)

    When does the second algorithm require lesscommunication time?

    Assume that we are multiplying two n x n matrices,where n is an integer multiple of p

    37Chapter 7: Matrix Multiplication , Parallel Computing :Theory and Practice, Michael J. Quinn

  • 8/11/2019 Chapter 7-Matrix Multiplication from the book Parallel Computing by Michael J. Quinn

    38/39

    The two multicomputer Algorithms

    Thus the block oriented algorithm is uniformly

    superior to the row-column algorithm when

    the number of processors is an even power of2 greater than or equal to 16.

    38Chapter 7: Matrix Multiplication , Parallel Computing :Theory and Practice, Michael J. Quinn

  • 8/11/2019 Chapter 7-Matrix Multiplication from the book Parallel Computing by Michael J. Quinn

    39/39

    Questions?

    39Ch t 7 M t i M lti li ti P ll l C ti Th d P ti Mi h l J Q i