b56a2Presentation2

Embed Size (px)

Citation preview

  • 8/3/2019 b56a2Presentation2

    1/49

  • 8/3/2019 b56a2Presentation2

    2/49

    2

    Topics Covered

    An Overview of Parallel Processing

    Parallelism in Uniprocessor Systems

    Organization of Multiprocessor Flynns Classification

    System Topologies

    MIMD System Architectures

  • 8/3/2019 b56a2Presentation2

    3/49

    3

    An Overview of Parallel

    Processing

    What is parallel processing?

    Parallel processing is a method to improve computersystem performance by executing two or more

    instructions simultaneously. The goals of parallel processing.

    One goal is to reduce the wall-clock time or theamount of real time that you need to wait for a problemto be solved.

    Another goal is to solve bigger problems that might notfit in the limited memory of a single CPU.

  • 8/3/2019 b56a2Presentation2

    4/49

    4

    An Analogy of Parallelism

    The task of ordering a shuffled deck of cards bysuit and then by rank can be done faster if the task iscarried out by two or more people. By splitting up thedecks and performing the instructionssimultaneously, then at the end combining the partialsolutions you have performed parallel processing.

  • 8/3/2019 b56a2Presentation2

    5/49

    5

    Another Analogy of Parallelism

    Another analogy is having severalstudents grade quizzes simultaneously.Quizzes are distributed to a few students and

    different problems are graded by eachstudent at the same time. After they arecompleted, the graded quizzes are thengathered and the scores are recorded.

  • 8/3/2019 b56a2Presentation2

    6/49

    6

    Parallelism in Uniprocessor Systems

    It is possible to achieve parallelism with auniprocessor system.

    Some examples are the instruction pipeline, arithmeticpipeline, I/O processor.

    Note that a system that performs different operationson the same instruction is not considered parallel.

    Only if the system processes two differentinstructions simultaneously can it be consideredparallel.

  • 8/3/2019 b56a2Presentation2

    7/49

    7

    Parallelism in a Uniprocessor

    System

    A reconfigurable arithmetic pipeline is anexample of parallelism in a uniprocessor

    system.Each stage of a reconfigurable arithmeticpipeline has a multiplexer at its input. Themultiplexer may pass input data, or the dataoutput from other stages, to the stage inputs.The control unit of the CPU sets the selectsignals of the multiplexer to control the flow ofdata, thus configuring the pipeline.

  • 8/3/2019 b56a2Presentation2

    8/49

    8

    A Reconfigurable Pipeline With Data Flow

    for the Computation

    A[i]

    B[i] * C[i] + D[i]

    0

    1

    MUX

    2

    3

    S1 S0

    0

    1

    MUX

    2

    3

    S1 S0

    0

    1

    MUX

    2

    3

    S1 S0

    0

    1

    MUX

    2

    3

    S1 S0

    *

    LATCH

    +

    LATCH

    |

    LATCH

    To

    memory

    and

    registers

    Data

    Inputs

    0 0 x x 0 1 1 1

  • 8/3/2019 b56a2Presentation2

    9/49

    9

    Although arithmetic pipelines can performmany iterations of the same operation inparallel, they cannot perform different

    operations simultaneously. To performdifferent arithmetic operations in parallel, aCPU may include a vectored arithmetic unit.

  • 8/3/2019 b56a2Presentation2

    10/49

    10

    Vector Arithmetic Unit

    A vector arithmetic unit contains multiplefunctional units that perform addition,subtraction, and other functions. The control

    unit routes input values to the differentfunctional units to allow the CPU to executemultiple instructions simultaneously.

    For the operations AB+C and DE-F,

    the CPU would route B and C to an adderand then route E and F to a subtractor forsimultaneous execution.

  • 8/3/2019 b56a2Presentation2

    11/49

    11

    A Vectored Arithmetic Unit

    Data

    Input

    Connections

    Data

    Input

    Connections

    *

    +

    -

    %

    Data

    Inputs

    AB+C

    D

    E-F

  • 8/3/2019 b56a2Presentation2

    12/49

    12

    Organization of Multiprocessor

    Systems

    Flynns Classification

    Was proposed by researcher Michael J. Flynnin 1966.

    It is the most commonly accepted taxonomy ofcomputer organization.

    In this classification, computers are classifiedby whether it processes a single instruction at

    a time or multiple instructions simultaneously,and whether it operates on one or multipledata sets.

  • 8/3/2019 b56a2Presentation2

    13/49

    13

    Taxonomy of Computer

    Architectures

    4 categories of Flynns classification of multiprocessor systems by

    their instruction and data streams

    Architecture Categories

    SISD SIMD MISD MIMD

  • 8/3/2019 b56a2Presentation2

    14/49

    14

    Single Instruction, Single Data

    (SISD)

    SISD machines executes a single instructionon individual data values using a singleprocessor.

    Based on traditional Von Neumannuniprocessor architecture, instructions areexecuted sequentially or serially, one stepafter the next.

    Until most recently, most computers are ofSISD type.

  • 8/3/2019 b56a2Presentation2

    15/49

    15

    SISD

    Simple Diagrammatic Representation

    C P MIS IS DS

  • 8/3/2019 b56a2Presentation2

    16/49

    16

    Single Instruction, Multiple Data

    (SIMD) An SIMD machine executes a single instruction on

    multiple data values simultaneously using manyprocessors.

    Since there is only one instruction, each processordoes not have to fetch and decode each instruction.Instead, a single control unit does the fetch anddecoding for all processors.

    SIMD architectures include array processors.

  • 8/3/2019 b56a2Presentation2

    17/49

  • 8/3/2019 b56a2Presentation2

    18/49

    18

    Multiple Instruction, Multiple Data

    (MIMD) MIMD machines are usually referred to as

    multiprocessors or multicomputers.

    It may execute multiple instructions simultaneously,contrary to SIMD machines.

    Each processor must include its own control unit thatwill assign to the processors parts of a task or aseparate task.

    It has two subclasses: Shared memory anddistributed memory

  • 8/3/2019 b56a2Presentation2

    19/49

    19

    MIMD

    C

    C

    P

    P

    M

    IS

    IS

    IS

    IS

    DS

    DS

  • 8/3/2019 b56a2Presentation2

    20/49

    20

    Multiple Instruction, Single Data

    (MISD)

    This category does not actually exist. Thiscategory was included in the taxonomy forthe sake of completeness.

  • 8/3/2019 b56a2Presentation2

    21/49

    21

    C

    C

    P

    P

    M

    IS

    IS

    IS

    IS

    DS

    DS

    MISD

  • 8/3/2019 b56a2Presentation2

    22/49

    22

    Analogy of Flynns Classifications

    An analogy of Flynns classification is the

    check-in desk at an airport

    SISD: a single desk

    SIMD: many desks and a supervisor with amegaphone giving instructions that every deskobeys

    MIMD: many desks working at their own pace,

    synchronized through a central database

  • 8/3/2019 b56a2Presentation2

    23/49

    23

    System Topologies

    Topologies

    A system may also be classified by itstopology.

    A topology is the pattern of connectionsbetween processors.

    The cost-performance tradeoff determineswhich topologies to use for a multiprocessor

    system.

  • 8/3/2019 b56a2Presentation2

    24/49

    24

    Topology Classification

    A topology is characterized by its diameter,total bandwidth, and bisection bandwidth

    Diameter the maximum distance between

    two processors in the computer system. Total bandwidth the capacity of a

    communications link multiplied by the numberof such links in the system.

    Bisection bandwidth represents themaximum data transfer that could occur at thebottleneck in the topology.

  • 8/3/2019 b56a2Presentation2

    25/49

    25

    Shared BusTopology

    Processors communicatewith each other via asingle bus that can onlyhandle one datatransmissions at a time.

    In most shared buses,processors directlycommunicate with theirown local memory.

    M

    P

    M

    P

    M

    P

    Global

    memory

    Shared Bus

    System Topologies

  • 8/3/2019 b56a2Presentation2

    26/49

    26

    System Topologies

    Ring Topology Uses direct connections

    between processors

    instead of a shared bus.

    Allows communicationlinks to be activesimultaneously but data

    may have to travelthrough severalprocessors to reach itsdestination.

    P

    P

    P

    P

    P

    P

  • 8/3/2019 b56a2Presentation2

    27/49

    27

    System Topologies

    Tree Topology Uses direct

    connections betweenprocessors; eachhaving threeconnections.

    There is only one

    unique path betweenany pair of processors.

    P

    PPPP

    PP

  • 8/3/2019 b56a2Presentation2

    28/49

    28

    Systems Topologies

    Mesh Topology In the mesh topology,

    every processorconnects to theprocessors above andbelow it, and to its rightand left.

    P

    P

    P PP

    PP

    P P

  • 8/3/2019 b56a2Presentation2

    29/49

    29

    System Topologies

    Hypercube Topology

    Is a multiple meshtopology.

    Each processorconnects to all otherprocessors whosebinary values differ byone bit. For example,

    processor 0(0000)connects to 1(0001) or2(0010).

    P

    PP

    PP

    PP

    P P

    PP

    PP

    PP

    P

  • 8/3/2019 b56a2Presentation2

    30/49

    30

    System Topologies

    Completely Connected Topology

    Every processor has

    n-1 connections, one to each

    of the other processors. There is an increase in

    complexity as the systemgrows but this offers maximum

    communication capabilities. PP

    PP

    PP

    PP

  • 8/3/2019 b56a2Presentation2

    31/49

    31

    MIMD System Architectures

    Finally, the architecture of a MIMD system,contrast to its topology, refers to itsconnections to its system memory.

    A systems may also be classified by theirarchitectures. Two of these are:

    Uniform memory access (UMA)

    Nonuniform memory access (NUMA)

  • 8/3/2019 b56a2Presentation2

    32/49

  • 8/3/2019 b56a2Presentation2

    33/49

    33

    Vector computers has multiple vectorpipelines

    Two families of pipelined vector processor

    memory to memoryregister to register

  • 8/3/2019 b56a2Presentation2

    34/49

    34

    Development layers

    Applications

    Programming environment

    Languages supported

    Communication model

    Addressing space

    Hardware architecture

  • 8/3/2019 b56a2Presentation2

    35/49

    35

    System attributes to performance

    Clock rate and CPI

    Clock has cycle time --

    Clock rate f = 1/

    Size of program is instruction count (Ic)

    Cycles per instruction(CPI) time need forexecuting each instruction

  • 8/3/2019 b56a2Presentation2

    36/49

    36

    Performance factors

    CPU time needed to execute the program

    T = Ic x CPI x

    Or,T = Ic x (p+m+k) x

    Where

    p=no. of processor cycle

    m = no. of memory referencesk = ratio between memory cycle and processor cycle

  • 8/3/2019 b56a2Presentation2

    37/49

    37

    System attributes

    The performance factors (Ic,p,m,k,) areinfluenced by

    Instruction set architecture

    Compiler technology

    CPU implementation and control

    Cache and memory hierarchy

  • 8/3/2019 b56a2Presentation2

    38/49

    38

    MIPS rate

    = Ic/(Tx106) = f/(CPIx106)

    = (f x Ic)/ (C x 106)

    Where C is the total number of clock cyclesneeded to execute a given program

  • 8/3/2019 b56a2Presentation2

    39/49

    39

    Throughput rate i.e how many programs asystem can execute per unit time

    Ws = f/(Ic x CPI)

  • 8/3/2019 b56a2Presentation2

    40/49

    40

    Uniform memory access (UMA)

    The UMA is a type of symmetricmultiprocessor, or SMP, that has two or moreprocessors that perform symmetric functions.

    UMA gives all CPUs equal (uniform) accessto all memory locations in shared memory.They interact with shared memory by somecommunications mechanism like a simple bus

    or a complex multistage interconnectionnetwork.

  • 8/3/2019 b56a2Presentation2

    41/49

    41

    Uniform memory access (UMA)

    Architecture

    Shared

    Memory

    Processor 2

    Processor 1

    Processor n

    Communications

    mechanism

  • 8/3/2019 b56a2Presentation2

    42/49

    42

    Nonuniform memory access

    (NUMA)

    NUMA architectures, unlike UMAarchitectures do not allow uniform access toall shared memory locations. This

    architecture still allows all processors toaccess all shared memory locations but in anonuniform way, each processor can accessits local shared memory more quickly than

    the other memory modules not next to it.

  • 8/3/2019 b56a2Presentation2

    43/49

    43

    Nonuniform memory access

    (NUMA) Architecture

    Memory 1

    Processor 1

    Communications mechanism

    Memory 2

    Processor 2

    Memory n

    Processor n

  • 8/3/2019 b56a2Presentation2

    44/49

    44

    The COMA model

    Cache only memory architecture (COMA)is a computer memory organization for use inmultiprocessors in which the local memories

    (typically DRAM) at each node are used ascache. This is in contrast to using the localmemories as actual main memory, as inNUMA organizations.

    http://en.wikipedia.org/wiki/Computer_memoryhttp://en.wikipedia.org/wiki/Multiprocessorhttp://en.wikipedia.org/wiki/DRAMhttp://en.wikipedia.org/wiki/Non-Uniform_Memory_Accesshttp://en.wikipedia.org/wiki/Non-Uniform_Memory_Accesshttp://en.wikipedia.org/wiki/DRAMhttp://en.wikipedia.org/wiki/Multiprocessorhttp://en.wikipedia.org/wiki/Computer_memory
  • 8/3/2019 b56a2Presentation2

    45/49

    45

    The COMA model is a special case of NUMA machine inwhich distributed memories are converted into caches

    There is no memory hierarchy at each processor node All caches form a global address space

    Remote cache access is assisted by the distributedcache directories

    Depending on interconnection network used sometimeshierarchical directories may be used to help locatecopies of cache blocks

    Initial data placement is not critical because data will

    eventually migrate to where it will be used.

    Th COMA d l

  • 8/3/2019 b56a2Presentation2

    46/49

    46

    The COMA model

    C

    D

    P

    D

    C

    P

    D

    C

    P

    Interconnection network

  • 8/3/2019 b56a2Presentation2

    47/49

    47

    Vector Supercomputers

    Epitomized by Cray-1, 1976:

    Scalar Unit + Vector Extensions

    Load/Store Architecture

    Vector Registers Vector Instructions

    Hardwired Control

    Highly Pipelined Functional Units

    Interleaved Memory System

    No Data Caches

    No Virtual Memory

  • 8/3/2019 b56a2Presentation2

    48/49

    48

  • 8/3/2019 b56a2Presentation2

    49/49

    49

    THE END