86
MODULE - 5 INTRODUCTION TO PARALLEL PROCESSING

CO Module 5

Embed Size (px)

Citation preview

Page 1: CO Module 5

MODULE - 5INTRODUCTION TO PARALLEL

PROCESSING

Page 2: CO Module 5

Module 5 Parallel Processing Slide 2

Contents:-1. Parallel Processing2. Architectural Classification 3. Pipeline Computers4. Arithmetic Pipeline5. Instruction Pipeline6. Array Processors7. Vector Processing8. Multiprocessors 9. Comparison of RISC & CISC.

Page 3: CO Module 5

Module 5 Parallel Processing Slide 3

5.1. Parallel Processing• Parallel processing is an efficient form of

information processing which emphasizes the exploitation of concurrent events in the computing process.

• Concurrency implies :1. Parallelism2. Simultaneity

3. Pipelining. – Parallel events may occur in multiple resources during

the same time interval; – simultaneous events may occur at the same time

instant; and – pipelined events may occur in overlapped time spans.

• Parallel processing demands concurrent execution of many programs in the computer.

Page 4: CO Module 5

Module 5 Parallel Processing Slide 4

• The purpose of parallel processing is to speed up the computer processing capability and increase its throughput,

• Through put means the amount of processing that can be accomplished during a given interval of time.

• The amount of hardware increases with parallel processing, and so the cost of the system increases.

Adv and Disadv

Page 5: CO Module 5

Module 5 Parallel Processing Slide 5

Processor registers

Adder-Subtractor

Integer multiply

Logic unit

Shift Unit

Incrementer

Floating-point multiply

Floating-point add-subtract

Floating-point divide

To memory

Processor with multiple functional units operating in parallel

Page 6: CO Module 5

Module 5 Parallel Processing Slide 6

• The operands in the registers are applied to one of the units depending on the operation specified by the instruction associated with the operands.

• The operation performed in each functional unit is indicated in each block of the diagram.

• The adder and integer multiplier perform the arithmetic operations with integer numbers.

• The floating point operations are separated into 3 circuits operating in parallel.

• The logic, shift, and increment operations can be performed concurrently on different data.

• All units are independent of each other, so one no: can be shifted while another number is being incremented.

Page 7: CO Module 5

Module 5 Parallel Processing Slide 7

5.2 Architectural classification schemes

• Three computer architectural classification schemes:1. Flynn’s classification is based on the

multiplicity of instruction stream and data stream in a computer system.

2. Feng’s scheme is based on serial versus parallel processing.

3. Handler’s classification is determined by the degree of parallelism and pipelining in various levels.

Page 8: CO Module 5

Module 5 Parallel Processing Slide 8

Flynn’s Classification

Page 9: CO Module 5

Module 5 Parallel Processing Slide 9

The four categories:1. Single instruction stream-single

data stream (SISD)2. Single instruction stream-multiple

data stream (SIMD)3. Multiple instruction stream-single

data stream (MISD)4. Multiple instruction stream-

multiple data stream (MIMD)

Page 10: CO Module 5

Module 5 Parallel Processing Slide 10

CU PU MM

IS

IS DS

5.2.1. SISD computer organization

• This represents the organization of a single computer containing a CU, a Processor Unit, and a MU.

• Instructions are executed sequentially and the system may or may not have internal parallel processing capabilities.

• Parallel processing may be achieved by means of multiple functional units or by pipeline processing.

Page 11: CO Module 5

Module 5 Parallel Processing Slide 11

5.2.2. SIMD computer organization

• This class corresponds to array processors.

• There are multiple processing elements supervised by the same control unit.

• All PEs receive the same instruction broadcast from the CU but operate on different data sets from distinct data streams.

• The shared memory subsystem may contain multiple modules.

Page 12: CO Module 5

Module 5 Parallel Processing Slide 12

CU

PU1

PU2

PUn

SharedMemory

MM1

MM2

MMn

IS

DS1

DS2

DSn

Page 13: CO Module 5

Module 5 Parallel Processing Slide 13

5.2.3. MISD Computer organization

• There are n processor units, each receiving distinct instructions operating over the same data stream and its derivatives.

• The result of one processor become the input of the next processor in the macro pipe.

• MISD structure is only of theoretical interest since no practical system has been constructed using this organization.

Page 14: CO Module 5

14

CU1

CU2

CUn

PU1

PU2

PUn

MM1 MM2 MMm

IS1

IS2

ISn

IS1

IS2

ISnDS

SMDS

Page 15: CO Module 5

Module 5 Parallel Processing Slide 15

5.2.4. MIMD computer organization

• MIMD organization refers to a computer system capable of processing several programs at the same time.

• Most multiprocessor systems and multiple computer systems can be classified into this category.

• An intrinsic MIMD computer implies interactions among the n processors because all memory streams are derived from the same data space shared by all processors.

Page 16: CO Module 5

16

CU1

CU2

CUn

PU1

PU2

PUn

MM1

MM2

SM

MMm

DS1

DS2

DSn

IS1

IS2

ISn

IS1

IS2

ISn

IS1

IS2

ISn

Page 17: CO Module 5

Module 5 Parallel Processing Slide 17

Parallel Computer Structures• Parallel computers are those systems that emphasize parallel

processing. Parallel computers are divided into three architectural configurations:

1. Pipeline computers2. Array processors3. Multiprocessor systems

– A pipeline computer performs overlapped computations to exploit temporal parallelism.

– An array processor uses multiple synchronized arithmetic logic units to achieve spatial parallelism.

– A multiprocessor system achieves asynchronous parallelism through a set of interactive processors with shared resources.

• The diff. bet. an array processor and a multiprocessor is that the processing elements in an array processor operate synchronously but in multiprocessor system it may operate asynchronously.

Page 18: CO Module 5

Module 5 Parallel Processing Slide 18

5.3 Pipeline Computers• Pipelining is a technique of decomposing a

sequential process into sub operations, with each sub process being executed in a special dedicated segment that operates concurrently with all other segments.

• Each segment performs partial processing dictated by the way the task is partitioned.

• The result obtained from the computation in each segment is transferred to the next segment in the pipeline.

• The final result is obtained after the data have passed through all segments.

• The overlapping of computation is made possible by associating a register with each segment in the pipeline.

Page 19: CO Module 5

Module 5 Parallel Processing Slide 19

Basic Ideas• Parallel processing • Pipelined

processinga1 a2 a3 a4

b1 b2 b3 b4

c1 c2 c3 c4

d1 d2 d3 d4

a1 b1 c1 d1

a2 b2 c2 d2

a3 b3 c3 d3

a4 b4 c4 d4

P1

P2

P3

P4

P1

P2

P3

P4

time

Colors: different types of operations performeda, b, c, d: different data streams processed

Less inter-processor communicationComplicated processor hardware

time

More inter-processor communicationSimpler processor hardware

Page 20: CO Module 5

Module 5 Parallel Processing Slide 20

Data Dependence• Parallel processing requires

NO data dependence between processors

• Pipelined processing will involve inter-processor communication

P1

P2

P3

P4

P1

P2

P3

P4

time time

Page 21: CO Module 5

Module 5 Parallel Processing Slide 21

• The process of executing an instruction involves 4 major steps:

• In a nonpipelined computer, these 4 steps must be completed before the next instruction can be issued.

• In a pipelined computer, successive instructions are executed in an overlapped fashion.

IF ID OF EX

S1 S2 S3 S4 ( stages)

Pipelined processor

Page 22: CO Module 5

Module 5 Parallel Processing Slide 22

• The general structure of a 4 segment pipeline:

• The operands pass through all 4 segments in a fixed sequence. Each segment consists of a combinational circuit Si that performs a sub operation over the data stream flowing through the pipe.

• The segments are separated by Ri that hold the intermediate results between the stages.

• The behavior of a pipeline can be illustrated with a space time diagram.

S1 R1 S2 R2 S3 R3 S4 R4Input

Clock

Page 23: CO Module 5

Module 5 Parallel Processing Slide 23

• Suppose we want to perform – Ai * Bi + Ci for I = 1,2,3,…,7

• Each sub operation is to be implemented in a segment within a pipeline.

• Each segment has one or two registers and a combinational circuit.

• R1 through R5 are registers that receive new data with every clock pulse.

• The sub operations performed in each segment of the pipeline are:

R1 Ai ; R2 Bi,R3 R1 * R2 ; R4 Ci ;R5 R3 + R4

R1 R2

Multiplier

R3 R4

Adder

R5

Ai Bi CiExample:

Page 24: CO Module 5

Module 5 Parallel Processing Slide 24

• The first clock pulse transfers A1 & B1 into R1 & R2.

• The second clock pulse transfers the product of R1 & R2 into R3 and C1 into R4. The same clock pulse transfers A2 and B2 into R1 & R2.

• The third clock pulse operates on all 3 segments simultaneously. It places A3 & B3 into R1 & R2, transfers R1 * R2 into R3, transfers C2 into R4, and places the sum of R3 & R4 into R5.

• It takes 3 clock pulses to fill up the pipe and retrieve the first output from R5. From there on, each clock produces a new output and moves the data one step down the pipeline.

Page 25: CO Module 5

25

I1 I2 I3

I1 I2 I3

I1 I2 I3 I1 I2 I3 I4

EX

OF

ID

IF

a) Pipelined stages O/P

I1 I2 I3 I4 I5

I1 I2 I3 I4 I5

I1 I2 I3 I4 I5

I1 I2 I3 I4 I5

EX

OF

ID

IF

1 2 3 4 5 6 7 8 9

1 2 3 4 5 6 7 8 9 10 11 12 13

O/P O/P O/Pb) NonPipelined stages

Page 26: CO Module 5

Module 5 Parallel Processing Slide 26

• An instruction cycle consists of multiple pipeline cycles. • Pipeline cycle can be set equal to the delay of the

slowest stage. • The flow of data from stage to stage is triggered by a

common clock of the pipeline.• For the nonpipelined computer, it takes four pipeline

cycles to complete one instruction.• Once a pipeline is filled up, an output result is produced

from the pipeline on each cycle.• Instruction cycle is reduced to one-fourth of the original

cycle time.• Due to overlapped instruction and arithmetic execution,

the pipeline machines are better tuned to perform the same operations repeatedly.

Page 27: CO Module 5

Module 5 Parallel Processing Slide 27

• To complete n tasks with clock time tp using a k-segment pipeline requires k+(n-1) clock cycles.

• For eg:, consider 4 segments and 5 tasks. The time required to complete all the operations is 4 + (5-1) = 8 clock cycles.

• Consider a nonpipeline unit that performs the same operation and takes a time tn to complete each task. The total time required for n tasks is ntn.

• The speed up of a pipeline processing over an equivalent nonpipeline processing is defined as

S = ntn / ((k+n-1)tp)

Page 28: CO Module 5

Module 5 Parallel Processing Slide 28

5.4 Arithmetic Pipeline

• An arithmetic pipeline divides an arithmetic operation into sub operations for execution in the pipeline segments.

• They are used to implement floating point operations, multiplication of fixed-point numbers, and similar computations encountered in scientific problems.

Page 29: CO Module 5

Module 5 Parallel Processing Slide 29

Example• The inputs to the floating-point adder pipeline are two

normalized floating-point binary numbers:• X = A x 2a

• Y = B x 2b

• A & B are two fractions that represent the mantissas and a & b are the exponents. The floating point addition and subtraction can be performed in 4 segments.

• The registers labeled R are placed between the segments to store intermediate results.

• The sub operations that are performed in the 4 segments are:

1. Compare the exponents2. Align the mantissas3. Add or subtract the mantissas.4. Normalize the result.

Page 30: CO Module 5

Module 5 Parallel Processing Slide 30

• Exponents are compares by subtracting them to determine their difference.

• The larger exponent is chosen as the exponent of the result.

• The exponent difference determines how many times the mantissa associated with the smaller exponent must be shifted to the right. This produces an alignment of the 2 mantissas.

• The 2 mantissas are added or subtracted in segment 3. The result is normalized in segment 4.

• When an overflow occurs, the mantissa is shifted right and the exponent incremented by one

• If an underflow occurs, the no: of leading zeros in the mantissa determines the no: of left shifts in the mantissa and the number that must be subtracted from the exponent

Page 31: CO Module 5

31

R R

Compare ExponentsBy subtraction

R

Choose exponent

R

R

Align Mantissas

Normalize result

R

Add or Subtract mantissas

R

Adjust exponent

R

Seg 1:

Seg 2:

Seg 3:

Seg 4:

Difference

Exponents Mantissasa b A B

Page 32: CO Module 5

Module 5 Parallel Processing Slide 32

• Consider 2 normalized floating-point numbers:

– X = 0.9504 x 103 ; Y = 0.8200 x 102

• The 2 exponents are subtracted in the segment 1 to obtain 3 -2 = 1. The larger exponent 3 is chosen as the exponent of the result.

• The next segment shifts the mantissa of Y to the right to obtain

– X = 0.9504 x 103 ; Y = 0.0820 x 103

• This aligns the 2 mantissa under the same exponent.

• The addition of the 2 mantissas in segment 3 produces the sum

– Z = 1.0324 x 103

• The sum is adjusted by normalizing the result so that it has a fraction with a nonzero first digit. This is done by shifting the mantissa once to the right and incrementing the exponent by one to obtain the normalized sum.

– Z = 0.10324 x 104

Page 33: CO Module 5

Module 5 Parallel Processing Slide 33

5.5 Instruction Pipeline• An instruction pipeline is a technique used in

the design of computers and other digital electronic devices to increase their instruction throughput (the number of instructions that can be executed in a unit of time).

• The fundamental idea is to split the processing of a computer instruction into a series of independent steps, with storage at the end of each step.

• An instruction pipeline reads consecutive instructions from memory while previous instructions are being executed in other segments.

• This causes the instruction fetch and execute phases to overlap and perform simultaneous operations.

Page 34: CO Module 5

Module 5 Parallel Processing Slide 34

• In the most general case, the computer needs to process each instruction with the following sequence of steps:

1. Fetch the instruction from memory2. Decode the instruction3. Calculate the effective address4. Fetch the operands from memory5. Execute the instruction6. Store the result in the proper place.

– Two or more segments may require memory access at the same time, causing one segment to wait until another is finished with the memory.

– Memory access conflicts can be resolved by using 2 memory buses for accessing data & instructions.

Page 35: CO Module 5

Module 5 Parallel Processing Slide 35

Example

• While an instruction is being executed in segment 4, the next instruction in sequence is busy fetching an operand from memory in segment 3.

• The EA may be calculated in a separate arithmetic circuit for the 3rd instruction, and whenever the memory is available, the 4th and all subsequent instructions can be fetched and placed in an instruction FIFO.

• Thus up to 4 sub operations in the instruction cycle can overlap and up to 4 different instructions can be in progress of being processed at the same time.

Page 36: CO Module 5

Fetch instruction from memory

Decode instruction & calculate EA

Branch?

Fetch Operand from memory

Execute instruction

Interrupt?

Empty pipe

Update PC

Interrupt Handling

Yes

Yes

No

No

Seg 1:

Seg 2:

Seg 3:

Seg 4:

Four segment CPU pipeline

Page 37: CO Module 5

Module 5 Parallel Processing Slide 37

Timing of instruction pipelineSteps 1 2 3 4 5 6 7 8 9 10 11 12 13

1 FI DA FO EX

2 FI DA FO EX

3 FI DA FO EX

4 FI - - FI DA FO EX

5 - - - FI DA FO EX

6 FI DA FO EX

7 FI DA FO EX

INSTRUCTION

BRANCH

Page 38: CO Module 5

Module 5 Parallel Processing Slide 38

• It is assumed that processor has separate instruction and data memories so that the operation in FI & FO can proceed at the same time.

• In the absence of a branch instruction, each segment operates on different instructions.

• Thus in step 4, instruction 1 is being executed in Seg 4; the operand for instruction 2 is fetched in Seg 3; instruction 3 is being decoded in Seg 2; and instruction 4 is being fetched from memory in Seg 2.

• Assume that instruction 3 is a branch instruction. – As soon as this is decoded in seg 2 in step 4, the transfer

from FI to DA of other instructions is halted until the branch instruction is executed in step 6.

– If the branch is not taken, a new instruction is fetched in step 7.

– If the branch is not taken, the instruction fetched previously in step 4 can be used.

– The pipeline then continues until a new branch instruction is encountered.

Page 39: CO Module 5

Module 5 Parallel Processing

Major Difficulties

1. Resource conflicts– Access to memory by two segments at the

same time.

– Soln: Separate memory for Instruction and Data

2. Data dependency– An instruction depends on the result of

previous instruction

3. Branch difficulty

Page 40: CO Module 5

Module 5 Parallel Processing

Data Dependency• It occurs when an instruction needs data

that are not yet available

• For eg:, an instruction in the FO segment may need to fetch an operand that is being generated at the same time by the previous instruction in segment EX. Therefore, the second instruction must wait for data to become available by the first instruction.

Page 41: CO Module 5

Module 5 Parallel Processing

Solutions1. Hardware Interlocks

• Circuit that detects instructions whose source operands are destinations of instructions farther up in the pipeline

• Detection of this situation causes the instruction whose source is not available to be delayed by enough clock cycles to resolve the conflict.

2. Operand forwarding• Special hardware to detect a conflict and then avoid by

routing data through special paths between pipeline segments

• Eg: ALU result forward into ALU input location3. Delayed Load

• Problem is solved in the compilation process itself• Delay the data loading by inserting NO-Operation

instruction

Page 42: CO Module 5

Module 5 Parallel Processing

Handling Branch Instruction1. Pre-fetch target instruction –

• Pre-fetch the target instruction in addition to the instruction following the branch.

• Both are saved until the branch is executed.• if branch condition is successful, pipeline continues

from branch target instruction.2. Branch target buffer or BTB

– It is an associative memory included in the fetch segment of the pipeline.

– Each entry consists of the address of a previously executed branch instruction and the target instruction for the branch.

– It also stores the next few instructions after the branch target instruction.

– Adv is branch instructions that have occurred previously are readily available in the pipeline without interruption

Page 43: CO Module 5

Module 5 Parallel Processing

1.Loop buffer• Small very high speed register file

maintained by the instruction fetch segment of pipeline.

• When a loop is detected, it is stored in the loop buffer in its entirety, including all the branches.

• The loop can be executed directly without having to access memory until the loop is removed

Page 44: CO Module 5

Module 5 Parallel Processing

5.6 Vector Processing• Applies to scientific and engineering data

processing which require vast number of computations

• Vector processing can apply to

– Long-range weather forecasting

– Image processing

– Medical Diagnostics

– Flight simulations

– AI & Expert Systems

– Petroleum Explorations

Page 45: CO Module 5

Module 5 Parallel Processing

• Vector Operations

– A Vector is an ordered set of a 1D array of data items

– It is represented as row vector by V = [V1, V2, ….. Vn] of length ‘n’

– Operations can broken down into single computations with subscripted variables

– The element Vi of vector V is written as V(I) and the index I refers to a memory address or register where the no. is stored.

– Vector processing eliminates the overhead of fetch and execution of each step

Page 46: CO Module 5

Module 5 Parallel Processing

• Vector processor allows operations to be specified with single vector instruction.

• The vector instruction includes the initial address of the operands, the length of the vectors, and the operation to be performed, all in one composite instruction

• Instruction Format:

• This is a 3 address instruction with 3 fields specifying the base address of the operands and an additional field that gives the length of the data items in the vectors. This assumes that the vector operands reside in memory

Operation Code

Base Address Source 1

Base Address Source 2

Base Address

Destination

Vector Length

Page 47: CO Module 5

Module 5 Parallel Processing

Matrix Multiplication

• Computational intensive operations

performed in computers with vector

processors.

• Multiplication two n X n matrix consists n2

inner products or n3 multiply-add operations.

• N x m matrix may be considered as

constituting a set of n row vectors or a set of

m column vectors

• Each multiplication and addition can

implement by using floating-point pipeline

Page 48: CO Module 5

Module 5 Parallel Processing

• Consider the product matrix C ( 3 x 3) as A x B whose elements are related to elements of A & B by the inner product:

Cij = ∑ aik x bkj• For eg: for i=1 & j=1 then C11 = a11b11 + a12b21 + a13b31

• This requires 3 multiplications and 3 additions. • The total number of multiplications or additions required to

compute the matrix product is 9 x 3= 27• If we consider the linked multiply-add operation c + a x b as

a cumulative operation, the product of two n x n matrices requires n3 multiply-add operations. The computation consists of n2 inner products, with each inner product requiring n multiply-add operations.

• In general, the inner product consists of sum of k product terms of the form:

C = A1B1 + A2B2 + A3B3+……. + AkBk

3

k=1

Page 49: CO Module 5

Module 5 Parallel Processing

Pipeline for calculating an inner product

• Values of A & B are either in memory or in processor registers.

• The floating point multiplier pipeline and the floating point adder pipeline are assumed to have 4 segments each.

• All segment registers in the multiplier and adder are initialized to 0.

• Therefore, the output of the adder is 0 for the first 8 cycles until both pipes are full.

• Ai & Bi pairs are brought in and multiplied at a rate of one pair per cycle.

Source A

Source B Multiplier Pipeline Adder Pipeline

Page 50: CO Module 5

Module 5 Parallel Processing

• After the first 4 cycles, the products begin to be added to the output of the adder.

• During the next 4 cycles 0 is added to the products entering the adder pipeline.

• At the end of the 8th cycle, the first 4 products A1B1 through A 4B4 are in the 4 adder segments, and the next 4 products, A5B5 through A 8B 8 , are in the multiplier segments.

• At the beginning of the 9th cycle, the output of the adder is A1B1 and of multiplier is A5B5 and thus it starts addition

A1B1 + A5B5 in the adder pipeline.

• The 10th cycle starts the addition A2B2 + A 6B 6 and so on

Page 51: CO Module 5

Module 5 Parallel Processing

ie C = A1B1 + A5B5 + A9B9 + A13B13 + ……

+ A2B2 + A 6B 6 + A10B10 + A 14B14 + ……

+ A3B3 + A 7B 7 + A11B11 + A 15B15 + ……

+ A4B4 + A 8B 8 + A12B12 + A 16B16 + ……

• When there are no more product terms to be added, system inserts 4 zeros into the multiplier pipeline.

Page 52: CO Module 5

Module 5 Parallel Processing

• Memory Interleaving– Pipeline and vector processors require

simultaneous access to memory from 2 or more sources

– Instead of using 2 memory buses for simultaneous access, the memory can be partitioned into modules connected to a common address and data bus

– Memory Module is a memory array with Address Register and Data Register

– Multiple memory operations are possible in each memory module

Page 53: CO Module 5

Module 5 Parallel Processing

Memory Array Memory Array Memory Array Memory Array

AR AR AR AR

DRDRDRDR

Address Bus

Data Bus

Page 54: CO Module 5

Module 5 Parallel Processing

• Each memory has its own AR & DR. The AR receives information from a common address bus and DR communicates with a bidirectional data bus.

• The two least significant bits of the address can be used to distinguish between the 4 modules.

• In an interleaved memory, different sets of addresses are assigned to different memory modules. For eg:, in a 2 module memory system, the even addresses may be in one module and the odd addresses in the other.

• When the number of modules is a power of 2, the LSB of the address select a memory module and the remaining bits designate the specific location to be accessed within the selected module.

• A vector processor that uses an n-way interleaved memory can fetch n operands from n different modules.

Page 55: CO Module 5

Module 5 Parallel Processing

• Super computers

– A computer with Vector instructions and pipelined floating-point arithmetic operations

– Internal components are packed tightly together to minimize the distance

– Special techniques to remove heat from circuit

– Performance is measured in terms of number of floating-point operations per second (FLOPS)

– The first supercomputer is the CRAY-1 and it uses vector processing with 12 distinct functional units in parallel.

Page 56: CO Module 5

Module 5 Parallel Processing Slide 56

5.7 Array Processors• Array processor is a synchronous parallel

computer with multiple ALUs, called processing elements (PE), that can operate in a lock step fashion.

• The PEs are synchronized to perform the same function at the same time.

• Scalar and control type instructions are directly executed in the CU.

• Each PE consists of an ALU with registers and a local memory.

Page 57: CO Module 5

Module 5 Parallel Processing

SIMD Array Processor• An SIMD array processor is a computer with multiple

processing units operating in parallel.• The PUs are synchronized to perform the same operation

under the control of a common CU, thus providing a single instruction stream, multiple data stream (SIMD) organization.

• It contains a set of PEs each having a local memory M. Each PE includes an ALU, a floating point arithmetic unit, and working registers.

• The main memory is used for storage of the program. • The CU controls the operations in the PEs. It decodes the

instructions and determine how the instruction is to be executed.

• Vector instructions are broadcast to all PEs simultaneously. Each PE uses operands stored in its local memory. Vector operands are distributed to the local memories prior to the parallel execution of the instruction.

Page 58: CO Module 5

Module 5 Parallel Processing

Master control Unit

Main Memory

PE1 M1

PE2 M2

PE3 M3

MnPEn

. . . . ... . . . ..

Page 59: CO Module 5

Module 5 Parallel Processing Slide 59

• Eg: C = A + B– The control unit first stores the ith components ai & bi of

A & B in local memory Mi for i = 1,2,3,...n. It then broadcasts the floating point add instruction ci = ai + bi to all PEs, causing the addition to take place simultaneously.

– The components of Ci are stored in fixed locations in each local memory. This produces the desired vector sum in one add cycle.

• The best known SIMD array processor is the ILLIAC IV computer.

• SIMD processors are highly specialized computers which are suited primarily for numerical problems that can be expressed in vector or matrix form.

Page 60: CO Module 5

Module 5 Parallel Processing Slide 60

5.8. Multiprocessor systems

• The system contains two or more processors of appropriately comparable capabilities.

• All processors share access to common sets of memory modules, I/O channels ,and peripheral devices.

• The entire system must be controlled by a single integrated OS providing interactions between processors and their programs.

• Each processor has its own local memory and I/O devices.

• A multiprocessor system is an interconnection of 2 or more CPUs with memory and input-output devices.

• Multiprocessors are MIMD systems.

Page 61: CO Module 5

Module 5 Parallel Processing Slide 61

• If a fault causes one processor to fail, a second processor can be assigned to perform the functions of the disabled processor.

• The system as a whole can continue to function correctly with perhaps some loss in efficiency.

• the adv from a multiprocessor organization is an improved system performance and reliability

Page 62: CO Module 5

Module 5 Parallel Processing Slide 62

Architectural Aspects• The architecture of multiprocessor system contains a

number of CPU’s and a number of Input Output Processors with so many i/o devices and a memory unit connected and it needs a no of network policies to be considered to get the maximum performance with minimized complexity.

• Three different interconnections :

1. Time-shared common bus

2. Crossbar switch network

3. Multiport memories.

Page 63: CO Module 5

Module 5 Parallel Processing

Time-shared common bus• A common bus multiprocessor system consists of a number

of processors connected through a common path to a memory unit.

• In a time shared common bus, only one processor can communicate with the memory or another processor at any given time.

Memory Unit IOP 1IOP 1

CPU 1 CPU 1 CPU 1

Common Bus

Page 64: CO Module 5

Module 5 Parallel Processing

• Transfer operations are conducted by the processor that is in control of the bus at the time.

• Any other processor wishing to initiate a transfer must first determine the availability status of the bus, and only after the bus becomes available can the processor address the destination unit to initiate the transfer.

• A command is issued to inform the destination unit what operation is to be performed.

• The receiving unit recognizes its address in the bus and responds to the control signals from the sender, after which the transfer is initiated.

• Single common bus system is restricted to one transfer at a time.

Page 65: CO Module 5

Module 5 Parallel Processing

Multiport Memory• A multiport memory system employs separate

buses between each memory module and each CPU.

CPU 1

CPU 3

CPU 4

MM 1 MM 2 MM 3 MM 4

CPU 2

CPU 1

Page 66: CO Module 5

Module 5 Parallel Processing

• Each processor bus is connected to each MM.

• A processor bus consists of address, data and control lines required to communicate with memory.

• The MM is said to have 4 ports and each port accommodates one of the buses.

• The module must have internal control logic to determine which port will have access to memory at any given time.

• Memory access conflicts are resolved by assigning fixed priorities to each memory port.

• The priority for memory access associated with each processor may be established by the physical port position that its bus occupies in each module.

• Thus CPU 1 will have priority over CPU2 and CPU 4 will have the lowest priority

Page 67: CO Module 5

Module 5 Parallel Processing

• The advantage is the high transfer rate that can be achieved because of the multiple paths between processors and memory.

• The disadvantage is that it requires expensive memory control logic and a large no of cables and connectors.

Page 68: CO Module 5

Module 5 Parallel Processing

Crossbar Switch Organization• the organization consists of a number of crosspoints

that are placed at intersections between processor buses and memory module paths.

• The small square in each crosspoint is a switch that determines the path from a processor to a memory module

• Each switch point has control logic to set up the transfer path between a processor and memory.

• It examines the address that is placed in the bus to determine whether its particular module is being addressed.

• It also resolves multiple requests for access to the same memory module on a predetermined priority basis.

Page 69: CO Module 5

Module 5 Parallel Processing

CPU 1

CPU 2

CPU 3

MM 1 MM 2 MM 3

Crossbar Switch

Page 70: CO Module 5

Module 5 Parallel Processing

• the circuit consists of multiplexers that select the data, address and control from one CPU for communication with the memory module.

• priority levels are established by the arbitration logic to select one CPU when two or more CPUs attempt to access the same memory.

• A crossbar switch organization supports simultaneous transfers from all memory modules because there is a separate path associated with each module.

Page 71: CO Module 5

Module 5 Parallel Processing

Comparison of RISC and CISC

Multiplying Two Numbers in Memory

• The main memory is divided into locations numbered from (row) 1: (column) 1 to (row) 6: (column) 4.

• The execution unit is responsible for carrying out all computations.

• However, the execution unit can only operate on data that has been loaded into one of the six registers (A, B, C, D, E, or F).

• to find the product of two numbers - one stored in location 2:3 and another stored in location 5:2 - and then store the product back in the location 2:3.

Page 72: CO Module 5

Module 5 Parallel Processing

The CISC (Complex Instruction Set Computers)

Approach • The primary goal of CISC architecture is to complete

a task in as few lines of assembly as possible.• This is achieved by building processor hardware that

is capable of understanding and executing a series of operations.

• For a particular task, a CISC processor would come prepared with a specific instruction (say "MULT").

• When executed, this instruction loads the two values into separate registers, multiplies the operands in the execution unit, and then stores the product in the appropriate register.

• Thus, the entire task of multiplying two numbers can be completed with one instruction:

MULT 2:3, 5:2

Page 73: CO Module 5

Module 5 Parallel Processing

• MULT is what is known as a "complex instruction."

• It operates directly on the computer's memory banks and does not require the programmer to explicitly call any loading or storing functions.

• It closely resembles a command in a higher level language.

• For instance, if we let "a" represent the value of 2:3 and "b" represent the value of 5:2, then this command is identical to the C statement "a = a * b."

Page 74: CO Module 5

Module 5 Parallel Processing

• One of the primary advantages of this system is that the compiler has to do very little work to translate a high-level language statement into assembly.

• Because the length of the code is relatively short, very little RAM is required to store instructions.

• The emphasis is put on building complex instructions directly into the hardware.

Page 75: CO Module 5

Module 5 Parallel Processing

The RISC (reduced instruction set computer )

Approach • RISC processors only use simple instructions that can be executed within one clock cycle.

• Thus, the "MULT" command described above could be divided into three separate commands: "LOAD," which moves data from the memory bank to a register, "PROD," which finds the product of two operands located within the registers, and "STORE," which moves data from a register to the memory banks.

• In order to perform the exact series of steps described in the CISC approach, a programmer would need to code four lines of assembly:

LOAD A, 2:3LOAD B, 5:2PROD A, BSTORE 2:3, A

Page 76: CO Module 5

Module 5 Parallel Processing

• At first, this may seem like a much less efficient way of completing the operation.– Because there are more lines of code, more RAM is

needed to store the assembly level instructions. – The compiler must also perform more work to convert a

high-level language statement into code of this form. • However, the RISC strategy also brings some very

important advantages.– Because each instruction requires only one clock cycle to

execute, the entire program will execute in approximately the same amount of time as the multi-cycle "MULT" command.

– These RISC "reduced instructions" require less transistors of hardware space than the complex instructions, leaving more room for general purpose registers.

– Because all of the instructions execute in a uniform amount of time (i.e. one clock), pipelining is possible.

Page 77: CO Module 5

Module 5 Parallel Processing

• Separating the "LOAD" and "STORE" instructions actually reduces the amount of work that the computer must perform.

• After a CISC-style "MULT" command is executed, the processor automatically erases the registers. If one of the operands needs to be used for another computation, the processor must re-load the data from the memory bank into a register.

• In RISC, the operand will remain in the register until another value is loaded in its place.

Page 78: CO Module 5

Module 5 Parallel Processing

• The Performance Equation:

• The CISC approach attempts to minimize the number of instructions per program, sacrificing the number of cycles per instruction.

• RISC does the opposite, reducing the cycles per instruction at the cost of the number of instructions per program.

Page 79: CO Module 5

Module 5 Parallel Processing

What is RISC?• RISC, or Reduced Instruction Set

Computer is a type of microprocessor architecture that utilizes a small, highly-optimized set of instructions, rather than a more specialized set of instructions often found in other types of architectures.

Page 80: CO Module 5

Module 5 Parallel Processing

CISC versus RISCCISC RISC

Emphasis on hardware Emphasis on software

Includes multi-clock complex instructions Single-clock, reduced instruction only

Memory-to-memory:"LOAD" and "STORE"incorporated in instructions

Register to register:"LOAD" and "STORE"are independent instructions

Small code sizes, high cycles per second Low cycles per second, large code sizes

Transistors used for storing complex instructions

Spends more transistors on memory registers

variable format instructions Fixed format instructions

Single register setMultiple register sets

Page 81: CO Module 5

Module 5 Parallel Processing

RISC Pipelines• A RISC processor pipeline operates in

much the same way, although the stages in the pipeline are different.

• The stages are: 1. fetch instructions from memory 2. read registers and decode the instruction 3. execute the instruction or calculate an

address 4. access an operand in data memory 5. write the result into a register

Page 82: CO Module 5

Module 5 Parallel Processing

• Because RISC instructions are simpler than those used in CISC processors , they are more conducive to pipelining.

• While CISC instructions varied in length, RISC instructions are all the same length and can be fetched in a single operation.

• Ideally, each of the stages in a RISC processor pipeline should take 1 clock cycle so that the processor finishes an instruction each clock cycle and averages one cycle per instruction (CPI).

Pipeline Problems

• In practice, however, RISC processors operate at more than one cycle per instruction. The processor might occasionally stall a a result of data dependencies and branch instructions.

Page 83: CO Module 5

Module 5 Parallel Processing

Eg: 3-segment instruction pipeline:• The instruction cycle can be divided into 3 sub

operations and implemented in 3 segments:1. Instruction Fetch2. ALU Operation3. Execute instruction

• Segment I fetches the instruction from program memory.

• The instruction is decoded and an ALU operation is performed in the A segment.

• Segment E directs the output of the ALU to one of 3 destinations (register file: result of ALU operation, Memory: transfers the EA for loading or storing, PC: transfers the branch address), depending on the decoded instruction

Page 84: CO Module 5

Module 5 Parallel Processing

Delayed Load:• Consider

1. LOAD: R1<- M[address1]2. LOAD: R2<- M[address2]3. ADD: R3<- R1 + R24. STORE : M[address3]<- R3

• If the 3 segment proceeds without interruptions, there will be a data conflict in instruction 3 because the operand in R2 is not yet available in A segment

Page 85: CO Module 5

Module 5 Parallel Processing

Clock Cycles 1 2 3 4 5 6

1; LOAD R1 I A E

2; LOAD R2 I A E

3. Add R1+R2 I A E

4. Store R3 I A E

Clock Cycles 1 2 3 4 5 6 7

1; LOAD R1 I A E

2; LOAD R2 I A E

3. No-operation I A E

4. Add R1+R2 I A E

5. Store R3 I A E

Pipeline timing with delayed load

Pipeline timing with data conflict

Page 86: CO Module 5

Module 5 Parallel Processing

• E segment in clock cycle 4 is in a process of placing the memory data into R2.

• A segment in clock cycle 4 is using the data from R2, but the value in R2 will not be the correct value since it has not yet been transferred from memory.

• It is up to the compiler to make sure that the instruction following the load instruction uses the data fetched from the memory.

• If the compiler cannot find a useful instruction to put after the load, it inserts a no-operation instruction and thus waiting a clock cycle.

• This concept of delaying the use of the data loaded from memory is referred to as delayed load.