b56a2Presentation2

8/3/2019 b56a2Presentation2

1/49


2/49

2

Topics Covered

An Overview of Parallel Processing

Parallelism in Uniprocessor Systems

Organization of Multiprocessor Flynns Classification

System Topologies

MIMD System Architectures


3/49

3

An Overview of Parallel

Processing

What is parallel processing?

Parallel processing is a method to improve computersystem performance by executing two or more

instructions simultaneously. The goals of parallel processing.

One goal is to reduce the wall-clock time or theamount of real time that you need to wait for a problemto be solved.

Another goal is to solve bigger problems that might notfit in the limited memory of a single CPU.


4/49

4

An Analogy of Parallelism

The task of ordering a shuffled deck of cards bysuit and then by rank can be done faster if the task iscarried out by two or more people. By splitting up thedecks and performing the instructionssimultaneously, then at the end combining the partialsolutions you have performed parallel processing.


5/49

5

Another Analogy of Parallelism

Another analogy is having severalstudents grade quizzes simultaneously.Quizzes are distributed to a few students and

different problems are graded by eachstudent at the same time. After they arecompleted, the graded quizzes are thengathered and the scores are recorded.


6/49

6

Parallelism in Uniprocessor Systems

It is possible to achieve parallelism with auniprocessor system.

Some examples are the instruction pipeline, arithmeticpipeline, I/O processor.

Note that a system that performs different operationson the same instruction is not considered parallel.

Only if the system processes two differentinstructions simultaneously can it be consideredparallel.


7/49

7

Parallelism in a Uniprocessor

System

A reconfigurable arithmetic pipeline is anexample of parallelism in a uniprocessor

system.Each stage of a reconfigurable arithmeticpipeline has a multiplexer at its input. Themultiplexer may pass input data, or the dataoutput from other stages, to the stage inputs.The control unit of the CPU sets the selectsignals of the multiplexer to control the flow ofdata, thus configuring the pipeline.


8/49

8

A Reconfigurable Pipeline With Data Flow

for the Computation

A[i]

B[i] * C[i] + D[i]

0

1

MUX

2

3

S1 S0

0

1

MUX

2

3

S1 S0

0

1

MUX

2

3

S1 S0

0

1

MUX

2

3

S1 S0

*

LATCH

+

LATCH

|

LATCH

To

memory

and

registers

Data

Inputs

0 0 x x 0 1 1 1


9/49

9

Although arithmetic pipelines can performmany iterations of the same operation inparallel, they cannot perform different

operations simultaneously. To performdifferent arithmetic operations in parallel, aCPU may include a vectored arithmetic unit.


10/49

10

Vector Arithmetic Unit

A vector arithmetic unit contains multiplefunctional units that perform addition,subtraction, and other functions. The control

unit routes input values to the differentfunctional units to allow the CPU to executemultiple instructions simultaneously.

For the operations AB+C and DE-F,

the CPU would route B and C to an adderand then route E and F to a subtractor forsimultaneous execution.


11/49

11

A Vectored Arithmetic Unit

Data

Input

Connections

Data

Input

Connections

*

+

-

%

Data

Inputs

AB+C

D

E-F


12/49

12

Organization of Multiprocessor

Systems

Flynns Classification

Was proposed by researcher Michael J. Flynnin 1966.

It is the most commonly accepted taxonomy ofcomputer organization.

In this classification, computers are classifiedby whether it processes a single instruction at

a time or multiple instructions simultaneously,and whether it operates on one or multipledata sets.


13/49

13

Taxonomy of Computer

Architectures

4 categories of Flynns classification of multiprocessor systems by

their instruction and data streams

Architecture Categories

SISD SIMD MISD MIMD


14/49

14

Single Instruction, Single Data

(SISD)

SISD machines executes a single instructionon individual data values using a singleprocessor.

Based on traditional Von Neumannuniprocessor architecture, instructions areexecuted sequentially or serially, one stepafter the next.

Until most recently, most computers are ofSISD type.


15/49

15

SISD

Simple Diagrammatic Representation

C P MIS IS DS


16/49

16

Single Instruction, Multiple Data

(SIMD) An SIMD machine executes a single instruction on

multiple data values simultaneously using manyprocessors.

Since there is only one instruction, each processordoes not have to fetch and decode each instruction.Instead, a single control unit does the fetch anddecoding for all processors.

SIMD architectures include array processors.


17/49


18/49

18

Multiple Instruction, Multiple Data

(MIMD) MIMD machines are usually referred to as

multiprocessors or multicomputers.

It may execute multiple instructions simultaneously,contrary to SIMD machines.

Each processor must include its own control unit thatwill assign to the processors parts of a task or aseparate task.

It has two subclasses: Shared memory anddistributed memory


19/49

19

MIMD

C

C

P

P

M

IS

IS

IS

IS

DS

DS


20/49

20

Multiple Instruction, Single Data

(MISD)

This category does not actually exist. Thiscategory was included in the taxonomy forthe sake of completeness.


21/49

21

C

C

P

P

M

IS

IS

IS

IS

DS

DS

MISD


22/49

22

Analogy of Flynns Classifications

An analogy of Flynns classification is the

check-in desk at an airport

SISD: a single desk

SIMD: many desks and a supervisor with amegaphone giving instructions that every deskobeys

MIMD: many desks working at their own pace,

synchronized through a central database


23/49

23

System Topologies

Topologies

A system may also be classified by itstopology.

A topology is the pattern of connectionsbetween processors.

The cost-performance tradeoff determineswhich topologies to use for a multiprocessor

system.


24/49

24

Topology Classification

A topology is characterized by its diameter,total bandwidth, and bisection bandwidth

Diameter the maximum distance between

two processors in the computer system. Total bandwidth the capacity of a

communications link multiplied by the numberof such links in the system.

Bisection bandwidth represents themaximum data transfer that could occur at thebottleneck in the topology.


25/49

25

Shared BusTopology

Processors communicatewith each other via asingle bus that can onlyhandle one datatransmissions at a time.

In most shared buses,processors directlycommunicate with theirown local memory.

M

P

M

P

M

P

Global

memory

Shared Bus

System Topologies


26/49

26

System Topologies

Ring Topology Uses direct connections

between processors

instead of a shared bus.

Allows communicationlinks to be activesimultaneously but data

may have to travelthrough severalprocessors to reach itsdestination.

P

P

P

P

P

P


27/49

27

System Topologies

Tree Topology Uses direct

connections betweenprocessors; eachhaving threeconnections.

There is only one

unique path betweenany pair of processors.

P

PPPP

PP


28/49

28

Systems Topologies

Mesh Topology In the mesh topology,

every processorconnects to theprocessors above andbelow it, and to its rightand left.

P

P

P PP

PP

P P


29/49

29

System Topologies

Hypercube Topology

Is a multiple meshtopology.

Each processorconnects to all otherprocessors whosebinary values differ byone bit. For example,

processor 0(0000)connects to 1(0001) or2(0010).

P

PP

PP

PP

P P

PP

PP

PP

P


30/49

30

System Topologies

Completely Connected Topology

Every processor has

n-1 connections, one to each

of the other processors. There is an increase in

complexity as the systemgrows but this offers maximum

communication capabilities. PP

PP

PP

PP


31/49

31

MIMD System Architectures

Finally, the architecture of a MIMD system,contrast to its topology, refers to itsconnections to its system memory.

A systems may also be classified by theirarchitectures. Two of these are:

Uniform memory access (UMA)

Nonuniform memory access (NUMA)


32/49


33/49

33

Vector computers has multiple vectorpipelines

Two families of pipelined vector processor

memory to memoryregister to register


34/49

34

Development layers

Applications

Programming environment

Languages supported

Communication model

Addressing space

Hardware architecture


35/49

35

System attributes to performance

Clock rate and CPI

Clock has cycle time --

Clock rate f = 1/

Size of program is instruction count (Ic)

Cycles per instruction(CPI) time need forexecuting each instruction


36/49

36

Performance factors

CPU time needed to execute the program

T = Ic x CPI x

Or,T = Ic x (p+m+k) x

Where

p=no. of processor cycle

m = no. of memory referencesk = ratio between memory cycle and processor cycle


37/49

37

System attributes

The performance factors (Ic,p,m,k,) areinfluenced by

Instruction set architecture

Compiler technology

CPU implementation and control

Cache and memory hierarchy


38/49

38

MIPS rate

= Ic/(Tx106) = f/(CPIx106)

= (f x Ic)/ (C x 106)

Where C is the total number of clock cyclesneeded to execute a given program


39/49

39

Throughput rate i.e how many programs asystem can execute per unit time

Ws = f/(Ic x CPI)


40/49

40


The UMA is a type of symmetricmultiprocessor, or SMP, that has two or moreprocessors that perform symmetric functions.

UMA gives all CPUs equal (uniform) accessto all memory locations in shared memory.They interact with shared memory by somecommunications mechanism like a simple bus

or a complex multistage interconnectionnetwork.


41/49

41


Architecture

Shared

Memory

Processor 2

Processor 1

Processor n

Communications

mechanism


42/49

42

Nonuniform memory access

(NUMA)

NUMA architectures, unlike UMAarchitectures do not allow uniform access toall shared memory locations. This

architecture still allows all processors toaccess all shared memory locations but in anonuniform way, each processor can accessits local shared memory more quickly than

the other memory modules not next to it.


43/49

43

Nonuniform memory access

(NUMA) Architecture

Memory 1

Processor 1

Communications mechanism

Memory 2

Processor 2

Memory n

Processor n


44/49

44

The COMA model

Cache only memory architecture (COMA)is a computer memory organization for use inmultiprocessors in which the local memories

(typically DRAM) at each node are used ascache. This is in contrast to using the localmemories as actual main memory, as inNUMA organizations.
http://en.wikipedia.org/wiki/Computer_memoryhttp://en.wikipedia.org/wiki/Multiprocessorhttp://en.wikipedia.org/wiki/DRAMhttp://en.wikipedia.org/wiki/Non-Uniform_Memory_Accesshttp://en.wikipedia.org/wiki/Non-Uniform_Memory_Accesshttp://en.wikipedia.org/wiki/DRAMhttp://en.wikipedia.org/wiki/Multiprocessorhttp://en.wikipedia.org/wiki/Computer_memory


45/49

45

The COMA model is a special case of NUMA machine inwhich distributed memories are converted into caches

There is no memory hierarchy at each processor node All caches form a global address space

Remote cache access is assisted by the distributedcache directories

Depending on interconnection network used sometimeshierarchical directories may be used to help locatecopies of cache blocks

Initial data placement is not critical because data will

eventually migrate to where it will be used.

Th COMA d l


46/49

46

The COMA model

C

D

P

D

C

P

D

C

P

Interconnection network


47/49

47

Vector Supercomputers

Epitomized by Cray-1, 1976:

Scalar Unit + Vector Extensions

Load/Store Architecture

Vector Registers Vector Instructions

Hardwired Control

Highly Pipelined Functional Units

Interleaved Memory System

No Data Caches

No Virtual Memory


48/49

48


49/49

49

THE END

Documents

b56a2Presentation2