CSCI 232© 2005 JW Ryder1 Parallel Processing Large class of techniques used to provide simultaneous data processing tasks Purpose: Increase computational

CSCI 232 © 2005 JW Ryder 1

Parallel Processing

• Large class of techniques used to provide simultaneous data processing tasks

• Purpose: Increase computational speed of the computer

• A parallel processing system is able to process multiple tasks simultaneously


Parallel Processing

• Instruction in ALU, next instr. read from memory

• 2 or more ALUs, 2 or more processors

• Speed up, throughput - amount of processing that can be done in a given amount of time

• Amount of hardware increases, cost increases, complexity increases


Parallel Processing

• Viewed at various levels of complexity

• Lowest - distinguish between serial and parallel load registers

• Higher level - Multiple functional units (FU)– Arithmetic

• Adder-subtractor, Integer multiplier

– Logic• Logic unit, Incrementer, Shifter

– Floating point• add-subtract, multiply, divide


Parallel Processing Classification

• Internal organization of processors

• Interconnection structure between processors

• Flow of information through the system

• Organization of computer system by number of instructions and data items that are manipulated simultaneously


• Normal operation of computer is fetch from memory then execute in processor

• Sequence of instructions read from memory is instruction stream

• Operations performed on the data in the processor is data stream

• Parallel processing may occur in the instruction stream, data stream, or both

Classifications


• SISD - Single Instruction Single Data

• SIMD - Single Instruction Multiple Data

• MISD - Multiple Instruction Single Data

• MIMD - Multiple Instruction Multiple Data

4 Major Groups


• Single computer containing a– Control Unit– Processing Unit– Memory Unit

• Instructions executed sequentially

• System may or may not have internal parallel processing capabilities– Multiple FUs or pipelining

SISD


• Organization including many processing units under supervision of a common control unit

• All processors receive the same instruction from the control unit

• Operate on different items of data

• Shared memory unit must contain multiple modules so that it can communicate with all processors simultaneously

• Array processor

SIMD


• Only of theoretical interest

MISD


• Computer system capable of processing several programs at the same time

• Most multiprocessor and multi-computer systems are in this category

• Flynn’s classification depends on distinction between the performance of the control unit and the data processing unit

• Emphasizes behavioral characteristics of the computer system rather than its operational structures and interconnections

MIMD


• Pipelining does not fit into Flynn’s parallel processing classification scheme

• Only 2 categories used are SIMD, MIMD

Pipelining


• Multiprocessor system is an interconnection of 2 or more CPUs with memory and input-output equipment

• ‘Processor’ in multiprocessor can mean either a central processing unit (CPU) or an input-output processor (IOP)

• System with single CPU and multiple IOPs is not considered (usually) a multiprocessor

Multiprocessors


• Both support concurrent operations• Computers are interconnected with

each other by means of communications lines to form a computer network– Consists of several autonomous

computers that may or may not communicate with each other

• Multiprocessor system controlled by one operating system that provides interaction between processors and all components in the system cooperate to solve the problem at hand

Multiprocessors / Multicomputers


Multiprocessors• Microprocessors major

motivation - cheap, small

• VLSI helps make it possible too

• Improves reliability– mutual funds, some loss of

efficiency

• Benefits– Improved system performance– Computations can proceed in

parallel in 2 ways• Multiple independent jobs run in

parallel

• Single job can be partitioned into multiple parallel tasks


Multiprocessors• Overall functions can be partitioned

into several tasks• System tasks can be allocated to

specialized processors– Designed for optimal performance– Example: One processor

performs standard tasks for an industrial process and others sense and control various parameters such as temperature and flow rate

– Example: One processor takes care of high speed floating point operations while other processes standard operations and tasks


Performance Improvement

• Decompose problem into multiple discrete tasks

• User can explicitly direct computer to split tasks

• Provide a compiler that automatically detects when parts of program can be split– Parallelizing compiler

• Multiprocessors classified by way memory is organized


Tightly Coupled

• A multiprocessor system with common shared memory– Shared memory or Tightly

coupled multiprocessor

• Does not preclude each processor from having own local memory

• Most commercial tightly coupled systems provide cache memory for each CPU

• In addition, global common memory provided that all CPUs can access


Loosely Coupled

• Distributed memory = Loosely coupled

• Each processing element (PE) is a loosely coupled system has its own local memory

• Processors tied together by switching scheme designed to route information between processors through a message passing scheme

• Programs and data relayed in packets consisting of address, data, error detection codes


Loosely Coupled

• Packets either destined for a specific processor or grabbed by first processor that finds it depending on communication system design

• Most efficient when interaction between tasks is minimal

• Tightly coupled tasks can tolerate higher degree of interaction between tasks


Interconnection Structures

• Components forming a multiprocessor are

– CPUs

– IOPs

– A memory unit (may be partitioned into separate modules)

• Interconnections can have different physical configurations

– Depending on number transfer paths available between processors and memory in shared memory system

– Depending on number of transfer paths among PEs in a loosely coupled system


Physical Forms

• Time-Shared Common Bus• Multiported Memory• Crossbar Switch• Multistage Switching Network• Hypercube System


Time-Shared Common Bus

• N processors connected through a common bus to a memory unit

• Only 1 processor can have access (communicate with) the memory unit or another processor at a time

• Transfer operations conducted by processor that is in control of the bus

• Other processors must wait, checking availability

• Command issued to inform destination that communication is requested– What operation, from where

• Destination responds and transfer begins


Common Bus• Bus Contention• Resolved by including a bus

controller– Priorities

• Restricted to a single transfer at a time– When one processor transferring

to/from memory other processors are either busy with internal processing or idle waiting

• System overall transfer rate is limited by speed of bus

• Multiple buses possible but you pay penalty ($$)


Dual Buses

Not more economical• Local buses, local memory• System bus controller is big

coordinator• Local memory can be cache memory

– Coherency problems possible


Multiported Memory

• Separate buses between each memory module (MM) and processor

• Each processor bus connected to each MM

• Processor bus consists of – Address

– Data

– Control lines

• MM has 4 ports, 1 for each bus


Multiported Memory

• MM must have internal logic to determine which bus has control

• Fixed priorities assigned to each memory port (1,2,3,4)

• Advantage: High transfer rate• Disadvantage:

– Expensive memory control logic

– Many cables and connectors

• Usually only appropriate for small number of processors


Crossbar Switch

• Crosspoints placed at intersections of processor buses and memory buses

• See figure 13-4 on page 495• Each switch determines path (control

logic)– Examines address on bus

– Resolves conflicts on predetermined, hardcoded definition

• See figure 13-5 on page 495– Data both directions

– Multiplexers select data (remember select lines??)


Crossbar Switch

• Supports simultaneous transfers from all MM– Separate path associated with each MM

• Hardware can be large and complex • Number switches needed is

Processors x MM


Multistage Switching Network

• Basic Component is a 2-input 2-output interchange switch

• See figure 13-6 on page 496 - explain

• Switch can arbitrate between conflicts

• Can use to build a switching network• See figure 13-7 on page 497 -

explain


Patterns & Omega

• Not all patterns are always available to all processors

• P1 accessing 0xx then P2 can only access 1xx

• Used in both tightly and loosely coupled systems

• Omega Switching Network - see figure 13-8 on page 498– Exactly 1 path from each source to each

MM

– Some patterns cannot be connected simultaneously (000 and 001)

• 1 switch 1 signal at a time


Omega Network

• Tightly Coupled Systems– Sources - Processorrs

– Destinations - MM

• Loosely Coupled Systems– Source - Processor

– Destination - Processor


Hypercube

• Hypercube or binary n-cube• Loosely coupled system composed

of N = 2n processors interconnected in an n-dimensional binary cube

• Each node contains CPU, local memory, I/O interfaces

• Direct communications paths to n other nodes (1 hop)

• There are 2n distinct n-bit binary addresses to be assigned to the processors

• Each neighboring processor address differs by exactly 1 bit position

• See figure 13-9 on page 499


• Will take from 1 to n hops (max source to destination)

• Routing procedure– XOR Source and Destination addresses

• Result will show on which axes addresses differ

– Send along any indicated axis

– Repeat until arrival at destination

Routing Messages

Documents

CSCI 232© 2005 JW Ryder1 Parallel Processing Large class of techniques used to provide simultaneous data processing tasks Purpose: Increase computational