Upload
gerald-freeman
View
221
Download
0
Tags:
Embed Size (px)
Citation preview
SE363
Computer Architecture
MIMD Parallel Processors
John Morris
Iolanthe II racing in Waitemata Harbour
MIMD Systems
• Recipe• Buy a few high performance commercial PEs
• DEC Alpha
• MIPS R10000
• UltraSPARC
• Pentium?
• Put them together with some memory and peripherals on a common bus Instant
parallel processor!
• How to program it?
Programming Model
• Problem not unique to MIMD• Even sequential machines need one
• von Neuman (stored program) model
• Parallel - Splitting the work load• Data
• Distribute data to PEs• Instructions
• Distribute tasks to PEs• Synchronization
• Having divided the data & tasks,how do we synchronize tasks?
Programming Model
• Shared Memory Model• Flavour of the year
• Generally thought
to be simplest to manage
• All PEs see a common (virtual) address space
• PEs communicate by writing into the common address space
Data Distribution
• Trivial• All the data sits
in the common addressspace
• Any PE can access it!
• Uniform Memory Access(UMA) systems• All PEs access all data
with same tacc
• Non-UMA (NUMA) systems• Memory is physically distributed• Some PEs are “closer” to some addresses• More later!
Synchronisation
• Read static shared data• No problem!
• Update problem• PE0 writes x
• PE1 reads x
• How to ensure thatPE1 reads the lastvalue written by PE0?
• Semaphores• Lock resources
(memory areas or ...)while being updatedby one PE
Synchronisation
• Semaphore• Data structure in memory
• Count of waiters• -1 = resource free
• >= 0 resource in use
• Pointer to list of waiters• Two operations
• Wait• Proceed immediately if resource free
(waiter count = -1)
• Notify• Advise semaphore that you have finished with resource
• Decrement waiter count
• First waiter will be given control
Semaphores - Implementation
• Scenario• Semaphore free (-1)
• PE0: wait ..
• Resource free, so PE0 uses it (sets 0)
• PE1: wait ..• Reads count (0)• Starts to increment it ..
• PE0 notify ..• Gets bus and writes -1
• PE1: (finishing wait)
• Adds 1 to 0, writes 1 to count, adds PE1 TCB to list
Stalemate!• Who issues notify to free the resource?
Atomic Operations
• Problem• PE0 wrote a new value (-1) after PE1 had read the counter
• PE1 increments the value it read (0) and writes it back
• Solution• PE1’s read and update must be atomic
• No other PE must gain access to counter
while PE1 is updating
• Usually an architecture will provide • Test and set instruction
• Read a memory location, test it,if it’s 0, write a new value,else do nothing
• Atomic or indivisible .. No other PE can access the value until the operation is complete
Atomic Operations
• Test & Set• Read a memory location, test it,
if it’s 0, write a new value,else do nothing
• Can be used to guard a resource• When the location contains 0 -
access to the resource is allowed• Non-zero value means the resource is locked• Semaphore:
• Simple semaphore (no wait list)• Implement directly• Waiter “backs off” and tries again (rather than being queued)
• Complex semaphore (with wait list)• Guards the wait counter
Atomic Operations
• Processor must provide an atomic operation for• Multi-tasking or multi-threading on a single PE
• Multiple processes• Interrupts occur at arbitrary points in time
• including timer interrupts signaling end of time-slice
• Any process can be interrupted in the middle of a read-modify-write sequence
• Shared memory multi-processors• One PE can lose control of the bus after the
read of a read-modify-write• Cache?
• Later!
Atomic Operations
• Variations• Provide equivalent capability
• Sometimes appear in strange guises!
• Read-modify-write bus transactions• Memory location is
read, modified and written back as a single, indivisible operation
• Test and exchange• Check register’s value, if 0, exchange with memory
• Reservation Register (PowerPC)• lwarx - load word and reserve indexed• stwcx - store word conditional indexed• Reservation register stores address of reserved word
• Reservation and use can be separated by sequence of instructions
Barriers
• In shared memoryenvironment
• PEs must know whenanother PE hasproduced a result
• Simplest case:barrier for all PEs
• Must be inserted byprogrammer
• Potentially expensive• All PEs stall and
waste time in the barrier
Cache?
• What happens to cachedlocations?
Multiple Caches
• CoherencePEA reads location x
from memory Copy in cache A
PEB reads location x from memory Copy in cache B
PEA adds 1
Multiple Caches - Inconsistent states
• CoherencePEA reads location x
from memory Copy in cache A
PEB reads location x from memory Copy in cache B
PEA adds 1
A’s copy now 201PEB reads location x
reads 200 from cache B
Multiple Caches - Inconsistent states
• CoherencePEA reads location x
from memory Copy in cache A
PEB reads location x from memory Copy in cache B
PEA adds 1
A’s copy now 201PEB reads location x
reads 200 from cache BCaches and memory are now inconsistent or
not coherent
Cache - Maintaining Coherence
• Invalidate on writePEA reads location x
from memory Copy in cache A
PEB reads location x from memory Copy in cache B
PEA adds 1
A’s copy now 201Issues invalidate x
Cache B marks x invalid• Invalidate is address transaction only
Cache - Maintaining Coherence
• Reading the new valuePEB reads location x
Main memoryis wrong also
PEA snoops read
Realises it hasvalid copy
PEA issues retry
Cache - Maintaining Coherence
• Reading the new valuePEB reads location x
Main memoryis wrong also
PEA snoops read
Realises it hasvalid copy
PEA issues retry
PEA writes x back
Memory now correct PEB reads location x again
• Reads latest version
Coherent Cache - Snooping
• SIU “snoops” bus for transactions• Addresses compared with local cache• Matches
• Initiate retries• Local copy is modified
• Local copy is written to bus
• Invalidate local copies• Another PE is writing
• Mark local copies shared
• second PE is readingsame value
Coherent Cache - MESI protocol
• Cache line has 4 states• Invalid• Modified
• Only valid copy• Memory copy is invalid
• Exclusive• Only cached copy• Memory copy is valid
• Shared• Multiple cached copies• Memory copy is valid
MESI State Diagram
• Note the number of bus transactions needed!
WH Write HitWM Write MissRH Read HitRMS Read Miss SharedRME Read Miss ExclusiveSHW Snoop Hit Write
Coherent Cache - The Cost
• Cache coherency transactions• Additional transactions needed • Shared
• Write Hit• Other caches must be notified
• Modified• Other PE read
• Push-out needed
• Other PE write• Push-out needed - writing one word of n-word line
• Invalid - modified in other cache• Read or write
• Wait for push-out
Clusters
• A bus which is too long becomes slow! eg PCI is limited to 10 TTL loads
• Lots of processors?• On the same bus
• Bus speed must be limited Low communication rate Better to use a single PE!
• Clusters• ~8 processors on a bus
Clusters
8 cache coherent
(CC) processors
on a bus
Interconnectnetwork
~100? clusters
Clusters
Network InterfaceUnit
Detects requests for“remote” memory
Clusters
Messagedespatched to
remote cluster’sNIU
Memory RequestMessage
This memory ismuch closer
than this one!
From PEs inthis cluster
Clusters - Shared Memory
• Non Uniform Memory Access• Access time to memory depends on location!
Clusters - Shared Memory
• Non Uniform Memory Access• Access time to memory depends on location!
Worse!NIU needs to maintain
cache coherenceacross the entire
machine
Clusters - Maintaining Cache Coherence
• NIU (or equivalent) maintains directory • Directory Entries
• All lines from local memory cached elsewhere
• NIU software (firmware) • Checks memory requests against directory• Update directory• Send invalidate messages to other clusters• Fetch modified (dirty) lines from other clusters
• Remote memory access cost• 100s of cycles!
Address Status Clusters 4340 S 1, 3, 8 5260 E 9
Directory(Cluster 2)
Clusters - “Off the shelf”
• Commercial clusters • Provide page migration
• Make copy of a remote page on the local PE• Programmer remains responsible for
coherence• Don’t provide hardware support for cache
coherence (across network)• Fully CC machines may never be available!
• Software Systems• ....
Shared Memory Systems
• Software Systems eg Treadmarks• Provide shared memory on page basis
• Software • detects references to remote pages
• moves copy to local memory
• Reduces shared memory overhead• Provides some of the shared memory model
convenience• Without swamping interconnection network with
messages
• Message overhead is too high for a single word!
• Word basis is too expensive!!
Shared Memory Systems - Granularity
• Granularity• Word basis is too expensive!!• Sharing data at low granularity
• Fine grain sharing• Access / sharing for individual words
• Overheads too high• Number of messages
• Message overhead is high for one word
• Compare• Burst access to memory• Don’t fetch a single word -
• Overhead (bus protocol) is too high
• Amortize cost of access over multiple words
Shared Memory Systems - Granularity
• Coarse Grain Systems• Transferring data from cluster to cluster
• Overhead• Messages
• Updating directory
• Amortise the overhead over a whole pageLower relative overhead
• Applies to thread size also• Split program into small threads of control
Parallel Overhead
• cost of setting up & starting each thread
• cost of synchronising at the end of a set of threads• Can be more efficient to run a single sequential thread!
Coarse Grain Systems
• So far ...• Most experiments suggest that fine grain
systems are impractical• Larger, coarser grain
• Blocks of data• Threads of computation
needed to reduce overall computation time by using multiple processors
• Too Fine grain parallel systems • can run slower than a single processor!
Parallel Overhead• Ideal
• Time = 1/n
• Add Overhead• Time > optimal• No point to use
more than4 PEs!!
0
0.2
0.4
0.6
0.8
1
1.2
0 2 4 6 8 10 12
Number of PEs
Exe
cuti
on
Tim
e
Ideal
"+Parall O'head"
Parallel Overhead• Ideal
• Time = 1/n
• Add Overhead• Time > optimal• No point to use
more than4 PEs!!
0
0.2
0.4
0.6
0.8
1
1.2
0 2 4 6 8 10 12
Number of PEs
Exe
cuti
on
Tim
e
Ideal
"+Parall O'head"
Parallel Overhead• Shared memory systems Best results if you
• Share on large block basis
eg page• Split program into coarse grain
(long running) threads• Give away some parallelism
to achieve any parallel speedup!
• Coarse grain• Data• Computation
There’s parallelism at the instruction level too!The instruction issue unit in a sequential processoris trying to exploit it!
Clusters - Improving multiple PE performance
• Bandwidth to memory • Cache reduces dependency on the memory-
CPU interface• 95% cache hits 5% of memory accesses
crossing the interface
but add • a few PEs and • a few CC transactions
even if the interface was coping before,it won’t in a multiprocessor system!
A major bottleneck!
Clusters - Improving multiple PE performance
• Bus protocols add to access time Request / Grant / Release phases needed
• “Point-to-point” is faster! • Cross-bar switch
interface to memory• No PE contends
with any other for the common bus
Cross-bar?Name taken from old telephone exchanges!
Clusters - Memory Bandwidth
• Modern Clusters• Use “Point-to-point” X-bar interfaces to
memory to get bandwidth!
• Cache coherence?• Now really hard!!• How does each cache
snoop all transactions?
Programming Model
• Distributed Memory• Message passing• Alternative to shared memory• Each PE has
own address space• PEs communicate
with messages• Messages provide
synchronisation• PE can block or
wait for a message
Programming Model - Distributed Memory
• Distributed Memory Systems• Hardware is simple!• Network can be as simple as ethernet• Networks of Workstations model
• Commodity (cheap!) PEs• Commodity Network
• Standard
• Ethernet
• ATM• Proprietary
• Myrinet
• Achilles (UWA!)
Programming Model - Distributed Memory
• Distributed Memory Systems• Software is considered harder• Programmer responsible for
• Distributing data to individual PEs• Explicit Thread control
• Starting, stopping & synchronising
• At least two commonly available systems• Parallel Virtual Machine (PVM)• Message Passing Interface (MPI)
• Built on two operations• Send data, destPE, block | don’t block
• Receive data, srcPE, block | don’t block
• Blocking ensures synchronisation
Programming Model - Distributed Memory
• Distributed Memory Systems• Performance generally better
(versus shared memory)• Shared memory has hidden overheads
• Grain size poorly chosen• eg data doesn’t fit into pages
• Unnecessary coherencetransactions
• Updating a shared region (each page)before end of computation
• MP system waits and updates page when computation is complete
Programming Model - Distributed Memory
• Distributed Memory Systems• Performance generally better
(versus shared memory)
• False sharing
• Severely degrades performance• May not be apparent on superficial analysis
PEa accessesthis data
PEb accessesthis data
This whole pageping-pongs
between PEa and PEb
Memory page
Distributed Memory - Summary
• Simpler (almost trivial) hardware• Software
• More programmer effort• Explicit data distribution• Explicit synchronisation
• Performance generally better • Programmer knows more about the problem• Communicates only when necessary• Communication grain size can be optimum
Lower overheads
Data Flow
• Conventional programming models are control driven• Instruction sequence is precisely specified• Sequence specifies control
• which instruction the CPU will execute next
• Execution rule:• Execute an instruction when its predecessor
has completed s1: r = a*b;s2: s = c*d;s3: y = r + s;
s2 executes when s1 is completes3 executes when s2 is complete
Data Flow• Consider the calculation
• y = a*b + c*d
• Represent it bya graph• Nodes represent
computations• Data flows along
arcs
• Execution rule:• Execute an instruction
when its data is available• Data driven rule
a b
x
+
d c
x
y
Data Flow• Dataflow firing rule
• An instruction fires (executes)when its data is available
• Exposes all possible parallelism• Either multiplication can
fire as soon as data arrives• Addition must wait
• Data dependence analysis!• Instruction issue units:
• Fire (issue) each instructionwhen its operands (registers) have been written
a b
x
+
d c
x
y