Introduction to Parallel Processing Debbie Hui CS 147 – Prof. Sin-Min Lee 7 / 11 / 2001

Introduction toIntroduction to Parallel Processing Parallel Processing

Debbie HuiDebbie Hui

CS 147 – Prof. Sin-Min LeeCS 147 – Prof. Sin-Min Lee

7 / 11 / 20017 / 11 / 2001

Parallel ProcessingParallel Processing

Parallelism in Uniprocessor Parallelism in Uniprocessor SystemsSystems

Organization of Organization of Multiprocessor SystemsMultiprocessor Systems

Parallelism in Uniprocessor Parallelism in Uniprocessor SystemsSystems

• A computer achieves parallelism A computer achieves parallelism when it performs two or more when it performs two or more unrelated tasks simultaneouslyunrelated tasks simultaneously

Uniprocessor SystemsUniprocessor Systems

Uniprocessor system may Uniprocessor system may incorporate parallelism using:incorporate parallelism using:

• an instruction pipelinean instruction pipeline• a fixed or reconfigurable arithmetic pipelinea fixed or reconfigurable arithmetic pipeline• I/O processorsI/O processors• vector arithmetic unitsvector arithmetic units• multiport memorymultiport memory


Instruction pipeline:Instruction pipeline:

• By overlapping the fetching, decoding, By overlapping the fetching, decoding, and execution of instructionsand execution of instructions

• Allows the CPU to execute one Allows the CPU to execute one instruction per clock cycleinstruction per clock cycle


Reconfigurable Arithmetic Pipeline:Reconfigurable Arithmetic Pipeline:• Better suited for general purpose computingBetter suited for general purpose computing• Each stage has a multiplexer at its inputEach stage has a multiplexer at its input• The control unit of the CPU sets the selected data to The control unit of the CPU sets the selected data to

configure the pipelineconfigure the pipeline• Problem: Although arithmetic pipelines can perform Problem: Although arithmetic pipelines can perform

many iterations of the same operation in parallel, they many iterations of the same operation in parallel, they cannot perform different operations simultaneously.cannot perform different operations simultaneously.


Vectored Arithmetic Unit:Vectored Arithmetic Unit:

• Provides a solution to the reconfigurable Provides a solution to the reconfigurable arithmetic pipeline problemarithmetic pipeline problem

• Purpose: to perform different arithmetic Purpose: to perform different arithmetic operations in paralleloperations in parallel


Vectored Arithmetic Unit (cont.):Vectored Arithmetic Unit (cont.):• Contains multiple functional Contains multiple functional

unitsunits- Some performs addition, - Some performs addition, subtraction, etc.subtraction, etc.

• Input and output switches Input and output switches are needed to route the are needed to route the proper data to their properproper data to their properdestinationsdestinations- Switches are set by the - Switches are set by the control unitcontrol unit


Vectored Arithmetic Unit (cont.):Vectored Arithmetic Unit (cont.):

How do we get all that data to How do we get all that data to the the vector arithmetic unit?vector arithmetic unit?

By transferring several data values By transferring several data values simultaneously using:simultaneously using:- Multiple buses- Multiple buses- Very wide data buses- Very wide data buses


Improve performance:Improve performance:

• Allowing multiple, simultaneous memory Allowing multiple, simultaneous memory accessaccess

- requires multiple address, data, and control buses - requires multiple address, data, and control buses

(one set for each simultaneous memory access)(one set for each simultaneous memory access)

- The memory chip has to be able to handle multiple- The memory chip has to be able to handle multiple

transfers simultaneouslytransfers simultaneously


Multiport Memory:Multiport Memory:

• Has two sets of address, data, and control pins Has two sets of address, data, and control pins to allow simultaneous data transfers to occurto allow simultaneous data transfers to occur

• CPU and DMA controller can transfer data CPU and DMA controller can transfer data concurrentlyconcurrently

• A system with more than one CPU could handle A system with more than one CPU could handle simultaneous requests from two different simultaneous requests from two different processorsprocessors


Multiport Memory (cont.):Multiport Memory (cont.):

CanCan- Multiport memory can handle two requests to read Multiport memory can handle two requests to read data from the same location at the same timedata from the same location at the same time

CannotCannot- Process two simultaneous requests to write data to Process two simultaneous requests to write data to the same memory locationthe same memory location

- Requests to read from and write to the same - Requests to read from and write to the same memory location simultaneouslymemory location simultaneously

Organization of Organization of Multiprocessor SystemsMultiprocessor Systems

Three different ways to Three different ways to organize/classify systems:organize/classify systems:

• Flynn’s Classification

• System Topologies

• MIMD System Architectures

Multiprocessor SystemsMultiprocessor SystemsFlynn’s ClassificationFlynn’s Classification

Flynn’s Classification:Flynn’s Classification:• Based on the flow of instructions and data Based on the flow of instructions and data

processingprocessing

• A computer is classified by:A computer is classified by:

- whether it processes a single instruction at a - whether it processes a single instruction at a time or multiple instructions simultaneouslytime or multiple instructions simultaneously

- whether it operates on one more multiple - whether it operates on one more multiple data setsdata sets


Four Categories of Flynn’s Four Categories of Flynn’s Classification:Classification:

• SISDSISD Single instruction single dataSingle instruction single data• SIMDSIMD Single instruction multiple dataSingle instruction multiple data• MISDMISD Multiple instruction single data **Multiple instruction single data **• MIMDMIMD Multiple instruction multiple dataMultiple instruction multiple data

** The MISD classification is not practical to implement.** The MISD classification is not practical to implement.In fact, no significant MISD computers have ever been build.In fact, no significant MISD computers have ever been build.It is included only for completeness.It is included only for completeness.


Single instruction single data Single instruction single data (SISD):(SISD):

• Consists of a single CPU executing individual Consists of a single CPU executing individual instructions on individual data valuesinstructions on individual data values


Single instruction multiple data (SIMD):Single instruction multiple data (SIMD):

MainMemory

ControlUnit

Processor

Processor

Processor

Memory

Memory

Memory

CommunicationsNetwork

• Executes a single instruction on multiple data values Executes a single instruction on multiple data values simultaneously using many processorssimultaneously using many processors• Since only one instruction is processed at any given time, it Since only one instruction is processed at any given time, it is not necessary for each processor to fetch and decode the is not necessary for each processor to fetch and decode the instructioninstruction• This task is handled by a single control unit that sends the This task is handled by a single control unit that sends the control signals to each processor.control signals to each processor.• Example: Array processorExample: Array processor


Multiple instruction Multiple data Multiple instruction Multiple data (MIMD):(MIMD):

• Executes different instructions simultaneouslyExecutes different instructions simultaneously• Each processor must include its own control unitEach processor must include its own control unit• The processors can be assigned to parts of the The processors can be assigned to parts of the

same task or to completely separate taskssame task or to completely separate tasks• Example: Multiprocessors, multicomputersExample: Multiprocessors, multicomputers

Multiprocessor SystemsMultiprocessor SystemsSystem TopologiesSystem Topologies

System Topologies:System Topologies:• The topology of a multiprocessor system refers to The topology of a multiprocessor system refers to

the pattern of connections between its processorsthe pattern of connections between its processors• Quantified by standard metrics:Quantified by standard metrics:

DiameterDiameter The maximum distance between The maximum distance between two two processors in the processors in the computer systemcomputer system

BandwidthBandwidth The capacity of a The capacity of a communications link communications link multiplied by the multiplied by the number of such links in number of such links in the system (best the system (best case)case)

Bisectional BandwidthBisectional Bandwidth The total bandwidth of the links The total bandwidth of the links connecting the two halves of the connecting the two halves of the processor split so that the number of processor split so that the number of

links between the two halves is links between the two halves is minimized (worst case)minimized (worst case)


Six Categories of System Topologies:Six Categories of System Topologies:

• Shared bus

• Ring

• Tree

• Mesh

• Hypercube

• Completely Connected


Shared bus:Shared bus:• The simplest topologyThe simplest topology• Processors communicate Processors communicate

with each other exclusively with each other exclusively via this busvia this bus

• Can handle only one data Can handle only one data transmission at a timetransmission at a time

• Can be easily expanded by Can be easily expanded by connecting additional connecting additional processors to the shared processors to the shared bus, along with the bus, along with the necessary bus arbitration necessary bus arbitration circuitrycircuitry

Shared Bus

GlobalMemory

M

P

M

P

M

P


Ring:Ring:• Uses direct dedicated Uses direct dedicated

connections between connections between processorsprocessors

• Allows all communication Allows all communication links to be active links to be active simultaneouslysimultaneously

• A piece of data may have A piece of data may have to travel through several to travel through several processors to reach its processors to reach its final destinationfinal destination

• All processors must have All processors must have two communication linkstwo communication links

P

P P

P P

P


Tree topology:Tree topology:• Uses direct Uses direct

connections between connections between processorsprocessors

• Each processor has Each processor has three connectionsthree connections

• Its primary advantage Its primary advantage is its relatively low is its relatively low diameterdiameter

• Example: DADO Example: DADO ComputerComputer

P

P P

P P P


Mesh topology:Mesh topology:• Every processor Every processor

connects to the connects to the processors above, processors above, below, left, and below, left, and rightright

• Left to right and Left to right and top to bottom top to bottom wraparound wraparound connections may or connections may or may not be presentmay not be present

P P P

P P P

P P P


Hypercube:Hypercube:

• Multidimensional Multidimensional meshmesh

• Has n processors, Has n processors, each with log n each with log n connectionsconnections


Completely Connected:Completely Connected:

• Every processor has n-1 connections, one to each of the other processors• The complexity of the processors increases as the system grows• Offers maximum communication capabilities


TOPOLOGYTOPOLOGY DIAMETERDIAMETER BANDWIDTHBANDWIDTH BISECTION BISECTION BANDWIDTHBANDWIDTH

SharedShared ll 1 * l1 * l 1 * l1 * l

RingRing n / 2 n / 2 n * ln * l 2 * l2 * l

TreeTree 2 2 lg n lg n (n – 1) * l(n – 1) * l 1 * l1 * l

Mesh *Mesh * 2 2 n n 2n – 2 2n – 2 n n 2 2 n / 2 n / 2 * l * l

Mesh **Mesh ** nn 2n * l2n * l 2 2 n * l n * l

HypercubeHypercube lg nlg n (n/2) * lg n * l(n/2) * lg n * l (n/2) * l(n/2) * l

Comp. Con.Comp. Con. 11 (n/2)*(n-1) * l(n/2)*(n-1) * l ((n/2n/2**n/2n/2)* l )* l

* Without wraparound** With wraparound

l = bandwidth of the busn = number of processors

Multiprocessor SystemsMultiprocessor SystemsMIMD System ArchitectureMIMD System Architecture

MIMD System Architecture:MIMD System Architecture:

• The architecture of an MIMD system refers to The architecture of an MIMD system refers to its connections with respect to system memoryits connections with respect to system memory

• MultiprocessorMultiprocessor• MulticomputersMulticomputers


Symmetric multiprocessor (SMP):Symmetric multiprocessor (SMP):

• A computer system that has two or more A computer system that has two or more processor with comparable capabilitiesprocessor with comparable capabilities

• Four different types:Four different types:- Uniform memory access (UMA)- Uniform memory access (UMA)- Nonuniform memory access (NUMA)- Nonuniform memory access (NUMA)- Cache coherent NUMA (CC-NUMA)- Cache coherent NUMA (CC-NUMA)- Cache only memory access (COMA)- Cache only memory access (COMA)


Uniform memory access (UMA):Uniform memory access (UMA):• Gives all CPUs equal (uniform) access to all shared Gives all CPUs equal (uniform) access to all shared

memory locationsmemory locations• Each processor may have its own cache memory, Each processor may have its own cache memory,

not directly accessible by the other processorsnot directly accessible by the other processors

Processor 1

Processor 2

Processor n

CommunicationsMechanism

SharedMemory


Nonuniform memory access (NUMA):Nonuniform memory access (NUMA):• Dos not allow uniform access to all shared memory Dos not allow uniform access to all shared memory

locationslocations• It still allows all processors to access all shared It still allows all processors to access all shared

memory locations, however, each processor can memory locations, however, each processor can access the memory module closest to it faster than access the memory module closest to it faster than other modulesother modules

Processor 1 Processor 2 Processor n

CommunicationsMechanism

Memory 1 Memory 2 Memory n


Cache Coherent NUMA (CC-NUMA):Cache Coherent NUMA (CC-NUMA):• Similar to NUMA except each processor Similar to NUMA except each processor

includes cache memoryincludes cache memory• The cache can buffer data from memory The cache can buffer data from memory

modules that are not local to the processor, modules that are not local to the processor, which can reduce the access time of the which can reduce the access time of the memory transfersmemory transfers

• Creates a problem when two or more caches Creates a problem when two or more caches hold the same piece of datahold the same piece of data

• A solution to this problem is Cache only A solution to this problem is Cache only memory access (COMA)memory access (COMA)


Cache Only Memory Access (COMA):Cache Only Memory Access (COMA):

• Each processor’s local memory is treated as a Each processor’s local memory is treated as a cachecache

• When the processor requests data that is not When the processor requests data that is not in its cache (local memory), the system loads in its cache (local memory), the system loads that data into local memory as part of the that data into local memory as part of the memory operationmemory operation


Multicomputer:Multicomputer:• An MIMD machine in which all processors are not An MIMD machine in which all processors are not

under the control of one operating systemunder the control of one operating system• Each processor or group of processors is under Each processor or group of processors is under

the control of a different operating system, or a the control of a different operating system, or a different instantiation of the same operating different instantiation of the same operating systemsystem

• Two different types:Two different types:

- Network or cluster of workstations (NOW or - Network or cluster of workstations (NOW or COW)COW)

- Massively parallel processor (MPP)- Massively parallel processor (MPP)


Network of workstation (NOW) orNetwork of workstation (NOW) or

Cluster of workstation (COW):Cluster of workstation (COW):

• More than a group of workstations on a local More than a group of workstations on a local area network (LAN)area network (LAN)

• Have a master scheduler, which matches Have a master scheduler, which matches tasks and processors togethertasks and processors together


Massively Parallel Processor (MPP):Massively Parallel Processor (MPP):

• Consists of many self-contained nodes, each Consists of many self-contained nodes, each having a processor, memory, and hardware having a processor, memory, and hardware for implementing internal communicationsfor implementing internal communications

• The processors communicate with each other The processors communicate with each other using shared memoryusing shared memory

• Example: IBM’s Blue Gene ComputerExample: IBM’s Blue Gene Computer

Thank you!Thank you!

Any Questions???Any Questions???

Documents

Introduction to Parallel Processing Debbie Hui CS 147 – Prof. Sin-Min Lee 7 / 11 / 2001