SE363 Computer Architecture MIMD Parallel Processors John Morris Iolanthe II racing in Waitemata Harbour

SE363

Computer Architecture

MIMD Parallel Processors

John Morris

Iolanthe II racing in Waitemata Harbour

MIMD Systems

• Recipe• Buy a few high performance commercial PEs

• DEC Alpha

• MIPS R10000

• UltraSPARC

• Pentium?

• Put them together with some memory and peripherals on a common bus Instant

parallel processor!

• How to program it?

Programming Model

• Problem not unique to MIMD• Even sequential machines need one

• von Neuman (stored program) model

• Parallel - Splitting the work load• Data

• Distribute data to PEs• Instructions

• Distribute tasks to PEs• Synchronization

• Having divided the data & tasks,how do we synchronize tasks?

Programming Model

• Shared Memory Model• Flavour of the year

• Generally thought

to be simplest to manage

• All PEs see a common (virtual) address space

• PEs communicate by writing into the common address space

Data Distribution

• Trivial• All the data sits

in the common addressspace

• Any PE can access it!

• Uniform Memory Access(UMA) systems• All PEs access all data

with same tacc

• Non-UMA (NUMA) systems• Memory is physically distributed• Some PEs are “closer” to some addresses• More later!

Synchronisation

• Read static shared data• No problem!

• Update problem• PE0 writes x

• PE1 reads x

• How to ensure thatPE1 reads the lastvalue written by PE0?

• Semaphores• Lock resources

(memory areas or ...)while being updatedby one PE

Synchronisation

• Semaphore• Data structure in memory

• Count of waiters• -1 = resource free

• >= 0 resource in use

• Pointer to list of waiters• Two operations

• Wait• Proceed immediately if resource free

(waiter count = -1)

• Notify• Advise semaphore that you have finished with resource

• Decrement waiter count

• First waiter will be given control

Semaphores - Implementation

• Scenario• Semaphore free (-1)

• PE0: wait ..

• Resource free, so PE0 uses it (sets 0)

• PE1: wait ..• Reads count (0)• Starts to increment it ..

• PE0 notify ..• Gets bus and writes -1

• PE1: (finishing wait)

• Adds 1 to 0, writes 1 to count, adds PE1 TCB to list

Stalemate!• Who issues notify to free the resource?

Atomic Operations

• Problem• PE0 wrote a new value (-1) after PE1 had read the counter

• PE1 increments the value it read (0) and writes it back

• Solution• PE1’s read and update must be atomic

• No other PE must gain access to counter

while PE1 is updating

• Usually an architecture will provide • Test and set instruction

• Read a memory location, test it,if it’s 0, write a new value,else do nothing

• Atomic or indivisible .. No other PE can access the value until the operation is complete

Atomic Operations

• Test & Set• Read a memory location, test it,

if it’s 0, write a new value,else do nothing

• Can be used to guard a resource• When the location contains 0 -

access to the resource is allowed• Non-zero value means the resource is locked• Semaphore:

• Simple semaphore (no wait list)• Implement directly• Waiter “backs off” and tries again (rather than being queued)

• Complex semaphore (with wait list)• Guards the wait counter

Atomic Operations

• Processor must provide an atomic operation for• Multi-tasking or multi-threading on a single PE

• Multiple processes• Interrupts occur at arbitrary points in time

• including timer interrupts signaling end of time-slice

• Any process can be interrupted in the middle of a read-modify-write sequence

• Shared memory multi-processors• One PE can lose control of the bus after the

read of a read-modify-write• Cache?

• Later!

Atomic Operations

• Variations• Provide equivalent capability

• Sometimes appear in strange guises!

• Read-modify-write bus transactions• Memory location is

read, modified and written back as a single, indivisible operation

• Test and exchange• Check register’s value, if 0, exchange with memory

• Reservation Register (PowerPC)• lwarx - load word and reserve indexed• stwcx - store word conditional indexed• Reservation register stores address of reserved word

• Reservation and use can be separated by sequence of instructions

Barriers

• In shared memoryenvironment

• PEs must know whenanother PE hasproduced a result

• Simplest case:barrier for all PEs

• Must be inserted byprogrammer

• Potentially expensive• All PEs stall and

waste time in the barrier

Cache?

• What happens to cachedlocations?

Multiple Caches

• CoherencePEA reads location x

from memory Copy in cache A

PEB reads location x from memory Copy in cache B

PEA adds 1

Multiple Caches - Inconsistent states




PEA adds 1

A’s copy now 201PEB reads location x

reads 200 from cache B

Multiple Caches - Inconsistent states




PEA adds 1

A’s copy now 201PEB reads location x

reads 200 from cache BCaches and memory are now inconsistent or

not coherent

Cache - Maintaining Coherence

• Invalidate on writePEA reads location x



PEA adds 1

A’s copy now 201Issues invalidate x

Cache B marks x invalid• Invalidate is address transaction only


• Reading the new valuePEB reads location x

Main memoryis wrong also

PEA snoops read

Realises it hasvalid copy

PEA issues retry


• Reading the new valuePEB reads location x

Main memoryis wrong also

PEA snoops read

Realises it hasvalid copy

PEA issues retry

PEA writes x back

Memory now correct PEB reads location x again

• Reads latest version

Coherent Cache - Snooping

• SIU “snoops” bus for transactions• Addresses compared with local cache• Matches

• Initiate retries• Local copy is modified

• Local copy is written to bus

• Invalidate local copies• Another PE is writing

• Mark local copies shared

• second PE is readingsame value

Coherent Cache - MESI protocol

• Cache line has 4 states• Invalid• Modified

• Only valid copy• Memory copy is invalid

• Exclusive• Only cached copy• Memory copy is valid

• Shared• Multiple cached copies• Memory copy is valid

MESI State Diagram

• Note the number of bus transactions needed!

WH Write HitWM Write MissRH Read HitRMS Read Miss SharedRME Read Miss ExclusiveSHW Snoop Hit Write

Coherent Cache - The Cost

• Cache coherency transactions• Additional transactions needed • Shared

• Write Hit• Other caches must be notified

• Modified• Other PE read

• Push-out needed

• Other PE write• Push-out needed - writing one word of n-word line

• Invalid - modified in other cache• Read or write

• Wait for push-out

Clusters

• A bus which is too long becomes slow! eg PCI is limited to 10 TTL loads

• Lots of processors?• On the same bus

• Bus speed must be limited Low communication rate Better to use a single PE!

• Clusters• ~8 processors on a bus

Clusters

8 cache coherent

(CC) processors

on a bus

Interconnectnetwork

~100? clusters

Clusters

Network InterfaceUnit

Detects requests for“remote” memory

Clusters

Messagedespatched to

remote cluster’sNIU

Memory RequestMessage

This memory ismuch closer

than this one!

From PEs inthis cluster

Clusters - Shared Memory

• Non Uniform Memory Access• Access time to memory depends on location!

Clusters - Shared Memory

• Non Uniform Memory Access• Access time to memory depends on location!

Worse!NIU needs to maintain

cache coherenceacross the entire

machine

Clusters - Maintaining Cache Coherence

• NIU (or equivalent) maintains directory • Directory Entries

• All lines from local memory cached elsewhere

• NIU software (firmware) • Checks memory requests against directory• Update directory• Send invalidate messages to other clusters• Fetch modified (dirty) lines from other clusters

• Remote memory access cost• 100s of cycles!

Address Status Clusters 4340 S 1, 3, 8 5260 E 9

Directory(Cluster 2)

Clusters - “Off the shelf”

• Commercial clusters • Provide page migration

• Make copy of a remote page on the local PE• Programmer remains responsible for

coherence• Don’t provide hardware support for cache

coherence (across network)• Fully CC machines may never be available!

• Software Systems• ....

Shared Memory Systems

• Software Systems eg Treadmarks• Provide shared memory on page basis

• Software • detects references to remote pages

• moves copy to local memory

• Reduces shared memory overhead• Provides some of the shared memory model

convenience• Without swamping interconnection network with

messages

• Message overhead is too high for a single word!

• Word basis is too expensive!!

Shared Memory Systems - Granularity

• Granularity• Word basis is too expensive!!• Sharing data at low granularity

• Fine grain sharing• Access / sharing for individual words

• Overheads too high• Number of messages

• Message overhead is high for one word

• Compare• Burst access to memory• Don’t fetch a single word -

• Overhead (bus protocol) is too high

• Amortize cost of access over multiple words

Shared Memory Systems - Granularity

• Coarse Grain Systems• Transferring data from cluster to cluster

• Overhead• Messages

• Updating directory

• Amortise the overhead over a whole pageLower relative overhead

• Applies to thread size also• Split program into small threads of control

Parallel Overhead

• cost of setting up & starting each thread

• cost of synchronising at the end of a set of threads• Can be more efficient to run a single sequential thread!

Coarse Grain Systems

• So far ...• Most experiments suggest that fine grain

systems are impractical• Larger, coarser grain

• Blocks of data• Threads of computation

needed to reduce overall computation time by using multiple processors

• Too Fine grain parallel systems • can run slower than a single processor!

Parallel Overhead• Ideal

• Time = 1/n

• Add Overhead• Time > optimal• No point to use

more than4 PEs!!

0

0.2

0.4

0.6

0.8

1

1.2

0 2 4 6 8 10 12

Number of PEs

Exe

cuti

on

Tim

e

Ideal

"+Parall O'head"

Parallel Overhead• Ideal

• Time = 1/n

• Add Overhead• Time > optimal• No point to use

more than4 PEs!!

0

0.2

0.4

0.6

0.8

1

1.2

0 2 4 6 8 10 12

Number of PEs

Exe

cuti

on

Tim

e

Ideal

"+Parall O'head"

Parallel Overhead• Shared memory systems Best results if you

• Share on large block basis

eg page• Split program into coarse grain

(long running) threads• Give away some parallelism

to achieve any parallel speedup!

• Coarse grain• Data• Computation

There’s parallelism at the instruction level too!The instruction issue unit in a sequential processoris trying to exploit it!

Clusters - Improving multiple PE performance

• Bandwidth to memory • Cache reduces dependency on the memory-

CPU interface• 95% cache hits 5% of memory accesses

crossing the interface

but add • a few PEs and • a few CC transactions

even if the interface was coping before,it won’t in a multiprocessor system!

A major bottleneck!

Clusters - Improving multiple PE performance

• Bus protocols add to access time Request / Grant / Release phases needed

• “Point-to-point” is faster! • Cross-bar switch

interface to memory• No PE contends

with any other for the common bus

Cross-bar?Name taken from old telephone exchanges!

Clusters - Memory Bandwidth

• Modern Clusters• Use “Point-to-point” X-bar interfaces to

memory to get bandwidth!

• Cache coherence?• Now really hard!!• How does each cache

snoop all transactions?

Programming Model

• Distributed Memory• Message passing• Alternative to shared memory• Each PE has

own address space• PEs communicate

with messages• Messages provide

synchronisation• PE can block or

wait for a message

Programming Model - Distributed Memory

• Distributed Memory Systems• Hardware is simple!• Network can be as simple as ethernet• Networks of Workstations model

• Commodity (cheap!) PEs• Commodity Network

• Standard

• Ethernet

• ATM• Proprietary

• Myrinet

• Achilles (UWA!)


• Distributed Memory Systems• Software is considered harder• Programmer responsible for

• Distributing data to individual PEs• Explicit Thread control

• Starting, stopping & synchronising

• At least two commonly available systems• Parallel Virtual Machine (PVM)• Message Passing Interface (MPI)

• Built on two operations• Send data, destPE, block | don’t block

• Receive data, srcPE, block | don’t block

• Blocking ensures synchronisation


• Distributed Memory Systems• Performance generally better

(versus shared memory)• Shared memory has hidden overheads

• Grain size poorly chosen• eg data doesn’t fit into pages

• Unnecessary coherencetransactions

• Updating a shared region (each page)before end of computation

• MP system waits and updates page when computation is complete


• Distributed Memory Systems• Performance generally better

(versus shared memory)

• False sharing

• Severely degrades performance• May not be apparent on superficial analysis

PEa accessesthis data

PEb accessesthis data

This whole pageping-pongs

between PEa and PEb

Memory page

Distributed Memory - Summary

• Simpler (almost trivial) hardware• Software

• More programmer effort• Explicit data distribution• Explicit synchronisation

• Performance generally better • Programmer knows more about the problem• Communicates only when necessary• Communication grain size can be optimum

Lower overheads

Data Flow

• Conventional programming models are control driven• Instruction sequence is precisely specified• Sequence specifies control

• which instruction the CPU will execute next

• Execution rule:• Execute an instruction when its predecessor

has completed s1: r = a*b;s2: s = c*d;s3: y = r + s;

s2 executes when s1 is completes3 executes when s2 is complete

Data Flow• Consider the calculation

• y = a*b + c*d

• Represent it bya graph• Nodes represent

computations• Data flows along

arcs

• Execution rule:• Execute an instruction

when its data is available• Data driven rule

a b

x

+

d c

x

y

Data Flow• Dataflow firing rule

• An instruction fires (executes)when its data is available

• Exposes all possible parallelism• Either multiplication can

fire as soon as data arrives• Addition must wait

• Data dependence analysis!• Instruction issue units:

• Fire (issue) each instructionwhen its operands (registers) have been written

a b

x

+

d c

x

y

Documents

SE363 Computer Architecture MIMD Parallel Processors John Morris Iolanthe II racing in Waitemata Harbour