CP0804_06-Apr-2011_RM01_unit 5

Embed Size (px)

Citation preview

  • 8/3/2019 CP0804_06-Apr-2011_RM01_unit 5

    1/63

    UNIT-5

    PIPELINE ANDVECTOR PROCESSING

  • 8/3/2019 CP0804_06-Apr-2011_RM01_unit 5

    2/63

    PIPELINING AND VECTOR PROCESSING

    Introduction to pipelining and pipeline hazards

    Design issues of pipeline architecture

    Instruction level parallelism and advanced issues

    Parallel processing concepts-

    Vector processing

    Array processors

    CISC

    RISC

    VLIW

  • 8/3/2019 CP0804_06-Apr-2011_RM01_unit 5

    3/63

    PARALLEL PROCESSING

    1. It is used to provide simultaneous data

    processing tasks for the purpose of

    increasing the computational speed of acomputer system.

    EX

    When an inst is being executed in ALU,

    the next inst can be read from memory.

    2. H/W and cost increase Parallel processing

  • 8/3/2019 CP0804_06-Apr-2011_RM01_unit 5

    4/63

    Multiple functional units

    Adder-Subtracted

    Integer multiply

    Logic unit

    Shift unit

    Incremented

    Floating point add-subtract

    Floating point multiply

    Floating point divide

  • 8/3/2019 CP0804_06-Apr-2011_RM01_unit 5

    5/63

    PROCESSOR WITH MULTIPLE FUNCTIONAL UNITS

    ADDER-SUBTRACTOR

    INTEGERMULTIPLY

    LOGIC UNIT

    SHIFT UNIT

    INCREMENTER

    FLOATING-POINT

    ADD-SUBTRACT

    FLOATING-POINTMULTIPLY

    FLOATING-POINTDIVIDE

    PROCESSORREGISTER

    To memory

  • 8/3/2019 CP0804_06-Apr-2011_RM01_unit 5

    6/63

    PARALLEL COMPUTERS

    Architectural Classification

    Number of Data Streams

    Number of

    Instruction

    Streams

    Single

    Multiple

    Single Multiple

    SISD SIMD

    MISD MIMD

    Parallel Processing

    Flynn's classificationBased on the multiplicity ofInstruction Streams and

    Data Streams

    Instruction Stream

    Sequence of Instructions read from memoryData Stream

    Operations performed on the data in the processor

  • 8/3/2019 CP0804_06-Apr-2011_RM01_unit 5

    7/63

    SISD

    Inst executed sequentially an system may or may

    not have internal parallel processing capabilities.

    SIMD

    Many processing units under supervision of commoncontrol unit. All processors receive the same instfrom control unit but operate on different items ofdata.

    MISD: Theory oriented not practical implemented.

    MIMD: Processing several programs at same time.

  • 8/3/2019 CP0804_06-Apr-2011_RM01_unit 5

    8/63

    Parallel processing techniques

    Pipeline processing

    Vector ,,

    Array ,,

  • 8/3/2019 CP0804_06-Apr-2011_RM01_unit 5

    9/63

    Pipeline processing:

    When arithmetic sub operations on the phases of

    computer inst cycle overlay in execution

    Vector processing:

    Deals with computations involving large

    vectors and matrices.

    Array processing:

    Perform computations on large arrays of data

  • 8/3/2019 CP0804_06-Apr-2011_RM01_unit 5

    10/63

    PIPELINING

    A technique of decomposing a sequential process into

    sub operations, with each sub process being executed in

    a partial dedicated segment that operates concurrently

    with all other segments. Result obtained from computation in each segment is

    transferred to next segment in pipeline.

    Overlapping of computation.

    Register holds data and combinational circuit performs

    sub operation in particular segment.

  • 8/3/2019 CP0804_06-Apr-2011_RM01_unit 5

    11/63

    WHAT IS PIPELINING??

    Pipelining is an implementation techniquewhere multiple instructions are overlappedin execution to make fast CPUs.

    It is an implementation which exploitsparallelism among the instructions in asequential instruction stream.

  • 8/3/2019 CP0804_06-Apr-2011_RM01_unit 5

    12/63

    THE METHODOLOGY

    In a pipeline each step is called a pipe

    stage/pipe segment which completes a

    part of an instruction.

    Each stage is connected with each other to

    form a pipe.

    Instructions enter at one end ,progress

    through each stage and exit at the other

    end.

    Pi li i

  • 8/3/2019 CP0804_06-Apr-2011_RM01_unit 5

    13/63

    PIPELINING

    R1 Ai, R2 Bi Load Ai and BiR3 R1 * R2, R4 Ci Multiply and load CiR5 R3 + R4 Add

    Ai * Bi + Ci for i = 1, 2, 3, ... , 7

    Ai

    R1 R2

    Multiplier

    R3 R4

    Adder

    R5

    Memory

    Pipelining

    Bi Ci

    Segment 1

    Segment 2

    Segment 3

    Example of pipelineprocessing

    Pi li i

  • 8/3/2019 CP0804_06-Apr-2011_RM01_unit 5

    14/63

    OPERATIONS IN EACH PIPELINESTAGE

    ClockPulse Segment 1 Segment 2 Segment 3

    Number R1 R2 R3 R4 R51 A1 B12 A2 B2 A1 * B1 C13 A3 B3 A2 * B2 C2 A1 * B1 + C14 A4 B4 A3 * B3 C3 A2 * B2 + C25 A5 B5 A4 * B4 C4 A3 * B3 + C3

    6 A6 B6 A5 * B5 C5 A4 * B4 + C47 A7 B7 A6 * B6 C6 A5 * B5 + C58 A7 * B7 C7 A6 * B6 + C69 A7 * B7 + C7

    Pipelining

    Pipelining

  • 8/3/2019 CP0804_06-Apr-2011_RM01_unit 5

    15/63

    GENERAL PIPELINEGeneral Structure of a 4-Segment Pipeline

    S R1 1 S R2 2 S R3 3 S R4 4Input

    Clock

    Space-Time Diagram

    1 2 3 4 5 6 7 8 9

    T1

    T1

    T1

    T1

    T2

    T2

    T2

    T2

    T3

    T3

    T3

    T3 T4

    T4

    T4

    T4 T5

    T5

    T5

    T5 T6

    T6

    T6

    T6

    Clock cyclesSegment 1

    2

    3

    4

    Pipelining

    Pipelining

  • 8/3/2019 CP0804_06-Apr-2011_RM01_unit 5

    16/63

    PIPELINE SPEEDUPn: Number of tasks to be performed

    Conventional Machine (Non-Pipelined)tn: Clock cyclet1: Time required to complete the n taskst1 = n * tn

    Pipelined Machine (k stages)tp: Clock cycle (time to complete each suboperation)tk: Time required to complete the n taskstk = (k + n - 1) * tp

    SpeedupS

    k: Speedup

    Sk = n*tn / (k + n - 1)*tp

    n Sk =tntp

    ( = k, if tn = k * tp )lim

    Pipelining

    Pipelining

  • 8/3/2019 CP0804_06-Apr-2011_RM01_unit 5

    17/63

    PIPELINE AND MULTIPLE FUNCTIONUNITS

    P1

    I i

    P2

    I i+1

    P3

    I i+2

    P4

    I i+3

    Multiple Functional Units

    Example- 4-stage pipeline

    - subopertion in each stage; tp = 20nS- 100 tasks to be executed- 1 task in non-pipelined system; 20*4 = 80nS

    Pipelined System(k + n - 1)*tp = (4 + 99) * 20 = 2060nS

    Non-Pipelined Systemn*k*tp = 100 * 80 = 8000nS

    SpeedupSk = 8000 / 2060 = 3.88

    4-Stage Pipeline is basically identical to the systemwith 4 identical function units

    Pipelining

  • 8/3/2019 CP0804_06-Apr-2011_RM01_unit 5

    18/63

    Disadvantage of pipeline

    Different segment may take different times to complete their sub

    operation.

    Clk cycle must be chosen to equal the time delay of segment with

    max propagation time.

    This causes all other segment to waste time waiting for the next

    clk.

    Two area of computer design where pipeline org is applicable:

    1.Arithmetic pipeline : divides arithmetic operation into suboperation for execution in pipeline segment.

    2.Inst pipeline: operates on a stream of inst by overlapping fetch,

    decode and execute phases of inst cycle

  • 8/3/2019 CP0804_06-Apr-2011_RM01_unit 5

    19/63

    PIPELINE HAZARDS

    WHAT ARE PIPELINE HAZARDS ???

    Hazards are those situations ,that prevent

    the next instruction in the instruction

    stream from executing during its

    designated clock cycle. They reduce the

    performance from the ideal speedup

    gained by pipelining.

  • 8/3/2019 CP0804_06-Apr-2011_RM01_unit 5

    20/63

    CLASSIFICATION OF HAZARDS

    Structural Hazards : arise from resource

    conflicts when the hardware cant support

    all possible combinations in simultaneous

    overlapped execution.

    Data hazards : arise when an instruction

    depends upon the results of a previous

    instruction in a way that is exposed by theoverlapping of instructions in the pipeline.

  • 8/3/2019 CP0804_06-Apr-2011_RM01_unit 5

    21/63

    CLASSIFICATION OF HAZARDS

    Control Hazards : arise from the pipelining

    of branches and other instructions that

    change the PC

  • 8/3/2019 CP0804_06-Apr-2011_RM01_unit 5

    22/63

    STRUCTURAL HAZARDS

    For any system to be free from hazards,

    pipelining of functional units and

    duplication of resources is necessary to

    allow all possible combinations ofinstructions in the pipeline.

    Structural hazards arise due to the

    following reasons :

  • 8/3/2019 CP0804_06-Apr-2011_RM01_unit 5

    23/63

    STRUCTURAL HAZARDS

    When a functional unit is not fully pipelined ,then the sequence of instructions using that unitcannot proceed at the rate of one per clockcycle.

    When the resource is not duplicated enough toallow all possible combinations of instructions.

    ex : a machine may have one register file writeport, but it may want to perform 2 writes duringthe same clock cycle.

  • 8/3/2019 CP0804_06-Apr-2011_RM01_unit 5

    24/63

    STRUCTURAL HAZARDS

    A machine with a shared single memory for

    data and instructions . An instruction

    containing data memory reference will

    conflict with the instruction reference fora later instruction.

    This resolved by stalling the pipeline for

    one clock cycle when the data memoryaccess occurs.

  • 8/3/2019 CP0804_06-Apr-2011_RM01_unit 5

    25/63

    DATA HAZARDS

    Data hazards occur when the pipeline

    changes the order of read/write accesses

    to operands so that the order differs from

    the order they see by sequentiallyexecuting instructions on an unpipelined

    machine.

  • 8/3/2019 CP0804_06-Apr-2011_RM01_unit 5

    26/63

    CLASSIFICATION OF DATA HAZARDS

    RAW (read after write ) : consider twoinstructions i and j with i occurring before j.

    j tries to read a source before i actually writesinto it , as a result j gets the old value.

    Ex :ADD R1,R2,R3

    SUB R4,R1,R5

    AND R6,R1,R7

    OR R8,R1,R9

    XOR R10,R1,R11

  • 8/3/2019 CP0804_06-Apr-2011_RM01_unit 5

    27/63

    CLASSIFICATION OF DATA HAZARDS

    This hazard is overcome by a simple hardwaretechnique called forwarding.

    in forwarding ,the ALU result from the EX/MEMregister is always fed back into ALU input latches.

    if the forwarding hardware detects that theprevious ALU operations has written the registercorresponding to a source for the current ALU

    operation, then the control logic selects theforwarded result as the ALU input rather than thevalue read from the register file.

  • 8/3/2019 CP0804_06-Apr-2011_RM01_unit 5

    28/63

    CLASSIFICATION OF DATA HAZARDS

    WAW (write after write) :

    j tries to write an operand before it iswritten by i. Thus the writes are performed

    in the wrong order leaving the value of i asthe final value.

    This hazard is present in pipelines that

    write in more than one pipe stage.However in DLX this isnt a hazard as itwrites only in the WB stage.

  • 8/3/2019 CP0804_06-Apr-2011_RM01_unit 5

    29/63

    CLASSIFICATION OF DATA HAZARDS

    EX :

    LW R1,0(R2)

    ADD R1,R2,R3

  • 8/3/2019 CP0804_06-Apr-2011_RM01_unit 5

    30/63

    CLASSIFICATION OF DATA HAZARDS

    WRITE AFTER READ (WAR) :

    j tries to write a destination before it is read by i.

    This doesnt happen in DLX as all reads occur early

    (ID phase) and all writes occur late (in WB stage).

    EX:

    SW 0(R1),R2

    ADD R2,R3,R4

  • 8/3/2019 CP0804_06-Apr-2011_RM01_unit 5

    31/63

    CONTROL HAZARDS

    Control hazards cause a greater

    performance loss compared to the losses

    posed by data hazards.

    The simplest method of dealing with

    branches is that the pipeline is stalled as

    soon the branch is detected in the ID phase

    and until the MEM stage where the new PCis finally determined.

  • 8/3/2019 CP0804_06-Apr-2011_RM01_unit 5

    32/63

    CONTROL HAZARDS

    Each branch causes a 3 cycle stall in the DLX

    pipeline which is a significant loss as the 30% of

    the instructions used are branch instructions.

    The number of clock cycles in the branch is

    reduced by testing the condition for branching in

    the ID stage and computing the destination

    address in the ID stage using a separate adder.

    Thus there is only clock cycle on branches .

  • 8/3/2019 CP0804_06-Apr-2011_RM01_unit 5

    33/63

    WHAT MAKES PIPELINING HARD

    TO IMPLEMENT???

    EXCEPTIONAL SITUATIONS : are those

    situations in which the normal order of

    execution is changed. This is due to

    instructions that raise exceptions that mayforce the machine to abort the instructions

    in the pipeline before they complete.

    A A S P P G ARD

  • 8/3/2019 CP0804_06-Apr-2011_RM01_unit 5

    34/63

    WHAT MAKES PIPELINING HARD

    TO IMPLEMENT???

    Some of the exceptions include :

    o Integer arithmetic overflow/underflow.o Power failure

    o Hardware malfunctions.

    o I/O device request.

  • 8/3/2019 CP0804_06-Apr-2011_RM01_unit 5

    35/63

    Arithmetic pipeline

    Usually found in high speed computers.

    Used to implement floating-point operations ,multiplication of fixed-point

    number.

    EX

    floating point addition and subtraction.

    A and B are fractions representing mantissas and a and b are the

    exponents.

    Sub operation that are performed in four segment are

    [1] Compare theexponents

    [2] Align the mantissa

    [3] Add/sub the mantissa

    [4] Normalize the result

    Arithmetic Pipeline

  • 8/3/2019 CP0804_06-Apr-2011_RM01_unit 5

    36/63

    ARITHMETIC PIPELINE

    Floating-point adder

    [1] Compare the exponents[2] Align the mantissa[3] Add/sub the mantissa[4] Normalize the result

    X = A x 2aY = B x 2b

    R

    Compareexponents

    by subtraction

    a b

    R

    Choose exponent

    Exponents

    R

    A B

    Align mantissa

    Mantissas

    Difference

    R

    Add or subtractmantissas

    R

    Normalizeresult

    R

    R

    Adjustexponent

    R

    Segment 1:

    Segment 2:

    Segment 3:

    Segment 4:

  • 8/3/2019 CP0804_06-Apr-2011_RM01_unit 5

    37/63

    Instruction pipeline

    Instruction pipeline reads consecutive instructions from

    memory while previous instructions are being executed

    in other segments .

    This causes instructions fetch and execute phases tooverlap and perform simultaneous operation

    Instruction Pipeline

  • 8/3/2019 CP0804_06-Apr-2011_RM01_unit 5

    38/63

    INSTRUCTION CYCLESix Phases* in an Instruction Cycle

    [1] Fetch an instruction from memory[2] Decode the instruction[3] Calculate the effective address of the operand[4] Fetch the operands from memory[5] Execute the operation[6] Store the result in the proper place

    Some instructions skip some phases* Effective address calculation can be done in the part of the decodingphase* Storage of the operation result into a register is done automatically in theexecution phase

    ==> 4-Stage Pipeline

    [1] FI: Fetch an instruction from memory[2] DA: Decode the instruction and calculate the effective address of the

    operand[3] FO: Fetch the operand

    [4] EX: Execute the operation

    Instruction Pipeline

  • 8/3/2019 CP0804_06-Apr-2011_RM01_unit 5

    39/63

    INSTRUCTION EXECUTION IN A 4-STAGE PIPELINE

    Fetch instructionfrom memory

    Decode instructionand calculate

    effective address

    Branch?

    Fetch operandfrom memory

    Execute instruction

    Interrupt?Interrupthandling

    Update PC

    Empty pipe

    no

    yes

    yes no

    Segment1:

    Segment2:

    Segment3:

    Segment4:

    Four segment CPU pipeline

  • 8/3/2019 CP0804_06-Apr-2011_RM01_unit 5

    40/63

    Timing of instruction pipeline

    1 2 3 4 5 6 7 8 9 10 12 1311

    FI DA FO EX1

    FI DA FO EX

    FI DA FO EX

    FI DA FO EX

    FI DA FO EX

    FI DA FO EX

    FI DA FO EX

    2

    3

    4

    5

    6

    7

    FI

    Step:

    Instruction

    (Branch)

  • 8/3/2019 CP0804_06-Apr-2011_RM01_unit 5

    41/63

    Major difficulties

    Resource conflicts:

    caused by access to memory by 2 segment atthe same time. These conflicts can be resolved

    by using separate instruction and data memories.Data dependency conflicts:

    when inst depends on result of previous inst

    but this result not yet available.Branch difficulties arise from branch and other

    inst that change the value of pc.

  • 8/3/2019 CP0804_06-Apr-2011_RM01_unit 5

    42/63

    1.Data dependency

    A data dependency occurs when an inst needs data that are notyet available.

    H/W interlocks:

    circuit that detects inst whose source operands are dest of instfarther up in pipeline.

    Inst whose source is not available to be delayed by enough clockcycle to resolve conflict.

    2.Operand forwarding:Special h/w to detect a conflict an then avoid it by routing datathrough special path b/w pipeline segment.

    3. Delayed load:

    Reorder the inst as necessary to delay the loading of conflictingdata by inserting no-operation inst

  • 8/3/2019 CP0804_06-Apr-2011_RM01_unit 5

    43/63

    Handling of branch instructionsPrefetch Target Instruction

    Fetch instructions in both streams, branch not taken andbranch taken

    Both are saved until branch branch is executed. Then,select the right

    instruction stream and discard the wrong stream

    Branch Target Buffer(BTB; Associative Memory)

    Entry: Addr of previously executed branches; Targetinstruction

    and the next few instructions

    When fetching an instruction, search BTB.

    If found, fetch the instruction stream in BTB;

    If not, new stream is fetched and update BTB

  • 8/3/2019 CP0804_06-Apr-2011_RM01_unit 5

    44/63

    Loop Buffer (High Speed Register file)

    Storage of entire loop that allows to execute a loop without

    accessing memory

    Branch Prediction

    Guessing the branch condition, and fetch an instruction

    stream based onthe guess. Correct guess eliminates the branch penalty

    Delayed Branch

    Compiler detects the branch and rearranges the instructionsequence

    by inserting useful instructions that keep the pipeline busy

    in the presence of a branch instruction

  • 8/3/2019 CP0804_06-Apr-2011_RM01_unit 5

    45/63

    RISC pipeline

    RISC

    - Machine with a very fast clock cycle that executes at the rate

    of one instruction per cycle

  • 8/3/2019 CP0804_06-Apr-2011_RM01_unit 5

    46/63

    RISC PIPELINE

    Instruction Cycles of Three-Stage Instruction Pipeline

    Data Manipulation InstructionsI: Instruction FetchA: Decode, Read Registers, ALU Operations

    E: Write a Register

    Load and Store InstructionsI: Instruction FetchA: Decode, Evaluate Effective AddressE: Register-to-Memory or Memory-to-Register

    Program Control InstructionsI: Instruction FetchA: Decode, Evaluate Branch AddressE: Write Register(PC)

    RISC Pipeline

  • 8/3/2019 CP0804_06-Apr-2011_RM01_unit 5

    47/63

    DELAYED LOAD

    Three-segment pipeline timing

    Pipeline timing with data conflict

    clock cycle 1 2 3 4 5 6

    Load R1 I A ELoad R2 I A EAdd R1+R2 I A EStore R3 I A E

    Pipeline timing with delayed load

    clock cycle 1 2 3 4 5 6 7Load R1 I A ELoad R2 I A ENOP I A EAdd R1+R2 I A EStore R3 I A E

    LOAD: R1 M[address 1]LOAD: R2 M[address 2]ADD: R3 R1 + R2STORE: M[address 3] R3

    The data dependency is takencare by the compiler ratherthan the hardware

    Advantage

    RISC Pipeline

  • 8/3/2019 CP0804_06-Apr-2011_RM01_unit 5

    48/63

    DELAYED BRANCH

    1

    I

    3 4 652Clock cycles:

    1. Load A

    2. Increment

    4. Subtract

    5. Branch to X

    7

    3. Add

    8

    6. NOP

    E

    I A EI A E

    I A E

    I A E

    I A E

    9 10

    7. NOP

    8. Instr. in X

    I A E

    I A E

    1

    I

    3 4 652Clock cycles:

    1. Load A

    2. Increme nt

    4. Add

    5. Subtract

    7

    3. Branch to X

    8

    6. Instr. in X

    E

    I A E

    I A E

    I A E

    I A E

    I A E

    Compiler analyzes the instructions before and afterthe branch and rearranges the program sequence byinserting useful instructions in the delay steps

    Using no-operation instructions

    Rearranging the instructions

    CISC (COMPLEX INSTRUCTION SET COMPUTING)

  • 8/3/2019 CP0804_06-Apr-2011_RM01_unit 5

    49/63

    CISC (COMPLEX INSTRUCTION SET COMPUTING)

    A CISC is a computer where single instruction can execute

    several low-level operations and are capable of multi-stepoperations or addressing modes within a single instruction.

    Some complex instructions are difficult or impossible toexecute in one cycle through the pipeline

    Many different addressing modesInstructions of different lengths

  • 8/3/2019 CP0804_06-Apr-2011_RM01_unit 5

    50/63

    Implementing a CISC Architecture There are simple and complex instructions in a CISC

    architecture One approach:

    Adapt the RISC pipeline Execute the simple, frequently-used CISC instructions as in RISC For the more complex instructions, use microinstructions.

    That means, a sequence of microinstructions is stored on ROM for eachcomplex CISC instruction Complex instructions often involve multiple microoperations or memory

    accesses in sequence When a complex instruction is decoded in the DOF stage of the

    pipeline, the microcode address and control is given to the microcodecounter. Microinstructions are executed until the instruction iscompleted.

    Each microoperation is simply a set of control input signals Example: A certain CISC instruction I has microcode written in the

    microcode ROM at address A. When I is decoded in the main pipeline,stall the main pipeline, and give control to the microcode control (MC).At each subsequent clock cycle, MC will increment and execute the nextmicrooperation (which is a control word that controls the datapath).The last microoperation in the sequence will give control back to themain pipeline, and un-stall the main pipeline.

    CISC Approach

  • 8/3/2019 CP0804_06-Apr-2011_RM01_unit 5

    51/63

    CISC Approach

    The primary goal of CISC architecture is to complete a task in a few lines ofAssembly as possible.

    This is achieved by building processor hardware that is capable ofUnderstanding and executing a series of operation.

    EX: MULT 2:3,5:2

    For this task a CISC processor would come prepare with a specific instruction.

    This instruction loads two values into separate register, multiplies the operandIn execution unit and stores the product in app. Register.

    The entire task of multiplying two numbers can be completed with oneInstruction.

    MULT ----- complex instruction.

    CISC minimizes the number of instruction per program, sacrificing number ofCycles per instruction.

    Vector Processing

  • 8/3/2019 CP0804_06-Apr-2011_RM01_unit 5

    52/63

    VECTOR PROCESSING

    Vector Processing Applications

    Problems that can be efficiently formulated in terms of vectors

    Long-range weather forecasting

    Petroleum explorations

    Seismic data analysis Medical diagnosis

    Aerodynamics and space flight simulations

    Artificial intelligence and expert systems

    Image processing

    Vector Processor (computer)

    Ability to process vectors, and related data structures such as matrices

    and multi-dimensional arrays, much faster than conventional computers

    Vector Processors may also be pipelined

    Vector Processing

  • 8/3/2019 CP0804_06-Apr-2011_RM01_unit 5

    53/63

    VECTOR PROGRAMMING

    DO 20 I = 1, 10020 C(I) = B(I) + A(I)

    Conventional computer

    Initialize I = 0

    20 Read A(I)Read B(I)Store C(I) = A(I) + B(I)Increment I = i + 1If I 100 goto 20

    Vector computer

    C(1:100) = A(1:100) + B(1:100)

    Vector Processing

  • 8/3/2019 CP0804_06-Apr-2011_RM01_unit 5

    54/63

    VECTOR INSTRUCTIONSf1: V * V

    f2: V * S

    f3: V x V * V

    f4: V x S * V

    V: Vector operandS: Scalar operand

    Type Mnemonic Description (I = 1, ..., n)

    f1 VSQR Vector square root B(I) * SQR(A(I))

    VSIN Vector sine B(I) * sin(A(I))

    VCOM Vector complement A(I) * A(I)

    f2 VSUM Vector summation S * S A(I)

    VMAX Vector maximum S * max{A(I)}

    f3 VADD Vector add C(I) * A(I) + B(I)

    VMPY Vector multiply C(I) * A(I) * B(I)

    VAND Vector AND C(I) * A(I) . B(I)

    VLAR Vector larger C(I) * max(A(I),B(I))

    VTGE Vector test > C(I) * 0 if A(I) < B(I)

    C(I) * 1 if A(I) > B(I)

    f4 SADD Vector-scalar add B(I) * S + A(I)

    SDIV Vector-scalar divide B(I) * A(I) / S

    Vector Processing

  • 8/3/2019 CP0804_06-Apr-2011_RM01_unit 5

    55/63

    VECTOR INSTRUCTION FORMAT

    Operationcode

    Base addresssource 1

    Base addresssource 2

    Base addressdestination

    Vectorlength

    Vector Instruction Format

    SourceA

    SourceB

    Multiplierpipeline

    Adderpipeline

    Pipeline for Inner Product

    MULTIPLE MEMORY MODULE AND INTERLEAVINGVector Processing

  • 8/3/2019 CP0804_06-Apr-2011_RM01_unit 5

    56/63

    MULTIPLE MEMORY MODULE AND INTERLEAVING

    Multiple Module Memory

    Address Interleaving

    Different sets of addresses are assigned todifferent memory modules

    AR

    Memory

    array

    DR

    AR

    Memory

    array

    DR

    AR

    Memory

    array

    DR

    AR

    Memory

    array

    DR

    Address bus

    Data bus

    M0 M1 M2 M3

  • 8/3/2019 CP0804_06-Apr-2011_RM01_unit 5

    57/63

    Pipeline & vector processors often require simultaneous Pipeline

    & vector processors often require simultaneous access to

    memory from 2 or more sources.

    Inst pipeline may require fetching of an inst and an operand at

    same time from two different segment.

    Memory can be partitioned into number of modules connected to

    common memory address & data buses. Memory module is a

    memory array together with its own address & data register.

    One module initiates memory access while other modules are inthe process of reading or writing and each modules can honor a

    memory request independent of the state of the other modules.

  • 8/3/2019 CP0804_06-Apr-2011_RM01_unit 5

    58/63

    Advantage of modular memory

    Interleaving:

    In an interleaved memory, different sets of address are assigned

    to different memory modules.

    A vector processor that uses n-way interleaved memory can fetchn operands from n different modules.

    Example

    In 2-modules memory system , even address may be 1 module

    and the odd addresses in the other. A CPU with inst pipeline can take advantage of multiple

    memory modules so that each segment in pipeline can access

    memory independent of memory access from other segments.

  • 8/3/2019 CP0804_06-Apr-2011_RM01_unit 5

    59/63

    Array processors

    It is a processor that performs computations on large-

    arrays of data.

    Attached array processor

    It is an auxiliary processor attached to general purposecomputer.

    SIMD Array Processor.

    It is a processor that has single inst multiple data org.Manipulates vector inst by means of multiple functional

    units responding to a common inst.

  • 8/3/2019 CP0804_06-Apr-2011_RM01_unit 5

    60/63

    Attached array processor

    Enhance the performance of computer by providing vector

    processing for complex scientific applications

    Attached array processor

    General purposecomputer

    I/P - O/P interfacei/pO/p interface Attached array processor

    Main memory Local memory

    High speed

    Memory tomemory bus

    SIMD Array processors

  • 8/3/2019 CP0804_06-Apr-2011_RM01_unit 5

    61/63

    SIMD Array processors

    SIMD array processor org

    PE1

    PE2

    PEn

    M1

    M2

    Mn

    Master ctrl unit

    Main memory

  • 8/3/2019 CP0804_06-Apr-2011_RM01_unit 5

    62/63

    It is a computer with multiple processing units operating inparallel.

    It consist of a set of identical processing element each having alocal memory.

    Each processor element includes an ALU, Floating pointarithmetic unit an working register.

    Master ctl unit controls the operations in the processors elements.

    Main memory is used for storage of program.

    Function of master control unit is to decode the inst anddetermine how the inst is to executed.

    Vector inst are broadcast to all PEs Simultaneously.

    Vector operands are distributed to local memories prior toparallel execution of inst.

  • 8/3/2019 CP0804_06-Apr-2011_RM01_unit 5

    63/63