Upload
lavada
View
45
Download
1
Embed Size (px)
DESCRIPTION
Lecture 12: Other Bus-Based Coherence Protocols Spring 2010 E. F. Gehringer, based on slides by Yan Solihin. Lecture 9 Outline. MESI protocol Dragon update-based protocol Impact of protocol optimizations. Lower-Level Protocol Choice. - PowerPoint PPT Presentation
Citation preview
CSC/ECE 506: Architecture of Parallel Computers
Lecture 12: Other Bus-Based Coherence Protocols
Spring 2010
E. F. Gehringer, based on slides by Yan Solihin
1
CSC/ECE 506: Architecture of Parallel Computers
Lecture 9 Outline
• MESI protocol• Dragon update-based protocol• Impact of protocol optimizations
2
CSC/ECE 506: Architecture of Parallel Computers
Lower-Level Protocol Choice
• What transition should occur when a BusRd is observed in state M?• Change to S? Assume I’ll reread it soon
• Good for mostly read data• But what about “migratory” data? …
• Change to I? Assume another will write it (Synapse)
• I read and write, then you read and write, then X reads and writes...
• Sequent Symmetry and MIT Alewife use adaptive protocols
3
CSC/ECE 506: Architecture of Parallel Computers
MESI (4-state) Invalidation Protocol• Here’s a problem with the MSI protocol:
• A {Rd, Wr} sequence causes two bus transactions • even when no one is sharing (e.g., serial program!)• BusRd (I S) followed by BusRdX or BusUpgr (S M)• In general, coherence traffic from serial programs is unacceptable
• To avoid this, add a fourth state, Exclusive:• Invalid• Modified (dirty)• Shared (two or more caches may have copies)• Exclusive (only this cache has clean copy, same value as in memory)
• How does the protocol decide whether I E or I S? • Need to check whether someone else has a copy• “Shared” signal on bus: wired-or line asserted in response to BusRd
4
CSC/ECE 506: Architecture of Parallel Computers
MESI: Processor-Initiated Transactions
5
M
S
E
PrRd/–PrWr/–
PrRd/–
PrWr/–
I
PrRd/BusRd(~S)
PrRd/BusRd(S)
PrWr/BusRdX
PrWr/BusRdX
PrRd/–
Fill in the last two transitions here.
CSC/ECE 506: Architecture of Parallel Computers
MESI: Bus-Initiated Transactions
6
M
I
E
BusRd/–BusRdX/–
S
BusRd/Flush BusRd/FlushBusRdX/Flush
BusRdX/Flush
BusRdX/Flush׳BusRd/Flush׳
Flush׳ means flush only if cache-to-cache sharing is used; only the cache responsible for supplying the data will do a flush.
CSC/ECE 506: Architecture of Parallel Computers
MESI State Transition Diagram
7
• BusRd(S) means shared line asserted on BusRd transaction
PrWr/—
BusRd/Flush
PrRd/
BusRdX/Flush
PrWr/BusRdX
PrWr/—
PrRd/—
PrRd/—BusRd/Flush
E
M
I
S
PrRd
BusRd(S)
BusRdX/Flush
BusRdX/Flush
BusRd/Flush
PrWr/BusRdX
PrRd/BusRd (S)
CSC/ECE 506: Architecture of Parallel Computers
Flush vs. Flush'
• Flush: mandatory• Flush' happens only when
• Cache-to-cache sharing is used, and,• Only one cache flushes data
8
CSC/ECE 506: Architecture of Parallel Computers
MESI Visualization
9
Start state. All caches empty and main memory has A = 1.
P1
CacheSnooper
P2
CacheSnooper
P3
CacheSnooper
Main memoryA = 1ControllerTraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
CSC/ECE 506: Architecture of Parallel Computers
Processor P1 Reads A
10
Processor P1 attempts to read A from its cache.
P1
CacheSnooper
P2
CacheSnooper
P3
CacheSnooper
Main memoryA = 1ControllerTraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
P1 PrRd AP1 BusRd AMem returns data
CSC/ECE 506: Architecture of Parallel Computers
Processor P1 Reads A
11
Processor P1 issues a BusRd.
P1
CacheSnooper
P2
CacheSnooper
P3
CacheSnooper
Main memoryA = 1ControllerTraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
P1 PrRd AP1 BusRd AMem returns data
CSC/ECE 506: Architecture of Parallel Computers
Processor P1 Reads A
12
Main memory returns data to processor P1 which updates its cache.
P1
CacheA = 1 ESnooper
P2
CacheSnooper
P3
CacheSnooper
Main memoryA = 1ControllerTraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
P1 PrRd AP1 BusRd AMem returns data
CSC/ECE 506: Architecture of Parallel Computers
Processor P1 Reads A
13
Read operation completes. P1
CacheA = 1 ESnooper
P2
CacheSnooper
P3
CacheSnooper
Main memoryA = 1ControllerTraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
CSC/ECE 506: Architecture of Parallel Computers
Processor P1 Writes A = 2
14
Processor P1 writes to its cache.
P1
CacheA = 2 MSnooper
P2
CacheSnooper
P3
CacheSnooper
Main memoryA = 1ControllerTraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
P1 PrWr A
One less bus requestdue to Exclusive state,esp. for serial programs
M
CSC/ECE 506: Architecture of Parallel Computers
Processor P1 Writes A = 2
15
Write operation completes. P1
CacheA = 2 MSnooper
P2
CacheSnooper
P3
CacheSnooper
Main memoryA = 1ControllerTraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
CSC/ECE 506: Architecture of Parallel Computers
Processor P3 Reads A
Processor P3 attempts to read A from its cache.
P1
CacheA = 2 MSnooper
P2
Cache
Snooper
P3
Cache
Snooper
Main memory
A = 1Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
P3 PrRd AP3 BusRd AP1 snoops BusRdP1 Flush
CSC/ECE 506: Architecture of Parallel Computers
Processor P3 Reads A
Processor P3 issues a BusRd.
P1
CacheA = 2 MSnooper
P2
Cache
Snooper
P3
Cache
Snooper
Main memory
A = 1Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
P3 PrRd AP3 BusRd AP1 snoops BusRdP1 Flush
CSC/ECE 506: Architecture of Parallel Computers
Processor P3 Reads A
Processor P1 snoops the BusRd from processor P3.
P1
CacheA = 2 SSnooper
P2
Cache
Snooper
P3
Cache
Snooper
Main memory
A = 1Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
P3 PrRd AP3 BusRd AP1 snoops BusRdP1 Flush
CSC/ECE 506: Architecture of Parallel Computers
Processor P3 Reads A
Processor P1 flushes, sending updated data to P3
and main memory.
P1
CacheA = 2 SSnooper
P2
Cache
Snooper
P3
CacheA = 2 SSnooper
Main memory
A = 2Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
P3 PrRd AP3 BusRd AP1 snoops BusRdP1 Flush
CSC/ECE 506: Architecture of Parallel Computers
Processor P3 Reads A
Read operation completes. P1
CacheA = 2 SSnooper
P2
Cache
Snooper
P3
CacheA = 2 SSnooper
Main memory
A = 2Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
CSC/ECE 506: Architecture of Parallel Computers
Processor P3 Writes A = 3
Processor P3 writes to its cache.
P1
CacheA = 2 SSnooper
P2
Cache
Snooper
P3
CacheA = 2 SSnooper
Main memory
A = 2Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
P3 PrWr AP3 BusUpgrP1 snoops BusUpgr
CSC/ECE 506: Architecture of Parallel Computers
Processor P3 Writes A = 3
Processor P3 issues a BusRd request.
P1
CacheA = 2 SSnooper
P2
Cache
Snooper
P3
CacheA = 2 SSnooper
Main memory
A = 2Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
P3 PrWr AP3 BusUpgrP1 snoops BusUpgr
Note: BusUpgr insteadof BusRdX
CSC/ECE 506: Architecture of Parallel Computers
Processor P3 Writes A = 3
Processor P1 snoops the BusRd and invalidates its cache.
P1
CacheA = 2 ISnooper
P2
Cache
Snooper
P3
CacheA = 3 MSnooper
Main memory
A = 2Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
P3 PrWr AP3 BusUpgrP1 snoops BusUpgr
CSC/ECE 506: Architecture of Parallel Computers
Processor P3 Writes A = 3
Write operation completes. P1
CacheA = 2 ISnooper
P2
Cache
Snooper
P3
CacheA = 3 MSnooper
Main memory
A = 2Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
CSC/ECE 506: Architecture of Parallel Computers
Processor P1 Reads A
Processor P1 reads from its cache.
P1
CacheA = 2 ISnooper
P2
Cache
Snooper
P3
CacheA = 3 MSnooper
Main memory
A = 2Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
P1 PrRd AP1 BusRd AP3 snoops BusRdP3 Flush
CSC/ECE 506: Architecture of Parallel Computers
Processor P1 Reads A
Processor P1 issues a BusRd request.
P1
CacheA = 2 ISnooper
P2
Cache
Snooper
P3
CacheA = 3 MSnooper
Main memory
A = 2Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
P1 PrRd AP1 BusRd AP3 snoops BusRdP3 Flush
CSC/ECE 506: Architecture of Parallel Computers
Processor P1 Reads A
Processor P3 snoops the BusRd.
P1
CacheA = 2 ISnooper
P2
Cache
Snooper
P3
CacheA = 3 MSnooper
Main memory
A = 2Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
P1 PrRd AP1 BusRd AP3 snoops BusRdP3 Flush
CSC/ECE 506: Architecture of Parallel Computers
Processor P1 Reads A
Processor P3 flushes, updating processor P1, main memory and its own cache state.
P1
CacheA = 3 SSnooper
P2
Cache
Snooper
P3
CacheA = 3 SSnooper
Main memory
A = 3Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
P1 PrRd AP1 BusRd AP3 snoops BusRdP3 Flush
CSC/ECE 506: Architecture of Parallel Computers
Processor P1 Reads A
Read operation completes. P1
CacheA = 3 SSnooper
P2
Cache
Snooper
P3
CacheA = 3 SSnooper
Main memory
A = 3Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
CSC/ECE 506: Architecture of Parallel Computers
Processor P3 Reads A
Processor P3 reads from its cache.
P1
CacheA = 3 SSnooper
P2
Cache
Snooper
P3
CacheA = 3 SSnooper
Main memory
A = 3Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
P3 PrRd AP3 returns data
CSC/ECE 506: Architecture of Parallel Computers
Processor P3 Reads A
Processor P3 returns valid data from its cache.
P1
CacheA = 3 SSnooper
P2
Cache
Snooper
P3
CacheA = 3 SSnooper
Main memory
A = 3Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
P3 PrRd AP3 returns data
CSC/ECE 506: Architecture of Parallel Computers
Processor P3 Reads A
Read operation completes. P1
CacheA = 3 SSnooper
P2
Cache
Snooper
P3
CacheA = 3 SSnooper
Main memory
A = 3Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
CSC/ECE 506: Architecture of Parallel Computers
Processor P2 Reads A
Processor P2 reads from its cache.
P1
CacheA = 3 SSnooper
P2
Cache
Snooper
P3
CacheA = 3 SSnooper
Main memory
A = 3Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
P2 PrRd AP2 BusRd AP1 Flush
CSC/ECE 506: Architecture of Parallel Computers
Processor P2 Reads A
Processor P2 issues a BusRd request.
P1
CacheA = 3 SSnooper
P2
Cache
Snooper
P3
CacheA = 3 SSnooper
Main memory
A = 3Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
P2 PrRd AP2 BusRd AP1 Flush
CSC/ECE 506: Architecture of Parallel Computers
Processor P2 Reads A
Main memory controller observes the BusRd.
P1
CacheA = 3 SSnooper
P2
Cache
Snooper
P3
CacheA = 3 SSnooper
Main memory
A = 3Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
P2 PrRd AP2 BusRd AP1 Flush
A = 3 S
X
Referred to as Cache-to-cache transferin Illinois MESI protocol
CSC/ECE 506: Architecture of Parallel Computers
Processor P2 Reads A
Operation completes. P1
CacheA = 3 SSnooper
P2
CacheA = 3 SSnooper
P3
CacheA = 3 SSnooper
Main memory
A = 3Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
CSC/ECE 506: Architecture of Parallel Computers
MESI Example (Cache-to-Cache Transfer)
37* Data from memory if no cache2cache transfer, BusRd/ –
Proc Action State P1 State P2 State P3 Bus Action Data From
R1 E – – BusRd Mem
W1 M – – – Own cache
R3 S – S BusRd/Flush P1 cache
W3 I – M BusRdX Mem
R1 S – S BusRd/Flush P3 cache
R3 S – S – Own cache
R2 S S S BusRd/Flush׳ P1/P3 Cache*
CSC/ECE 506: Architecture of Parallel Computers
MESI Example (Cache-to-Cache Transfer+BusUpgr)
38* Data from memory if no cache2cache transfer, BusRd/ –
Proc Action State P1 State P2 State P3 Bus Action Data From
R1 E - - BusRd Mem
W1 M - - - Own cache
R3 S - S BusRd/Flush P1 cache
W3 I - M BusUpgr Own cache
R1 S - S BusRd/Flush P3 cache
R3 S - S - Own cache
R2 S S S BusRd/Flush׳P1/P3
Cache*
CSC/ECE 506: Architecture of Parallel Computers
Lower-Level Protocol Choices• Who supplies data on miss when not in M state: memory or cache?
• Original, lllinois MESI: cache• assume cache faster than memory (cache-to-cache transfer)• Not necessarily true
• Adds complexity• How does memory know it should supply data? (must wait for
caches)• A selection algorithm is needed if multiple caches have valid data.
• Useful in a distributed-memory system• May be cheaper to obtain from nearby cache than distant memory• Especially when constructed out of SMP nodes (Stanford DASH)
39
CSC/ECE 506: Architecture of Parallel Computers
Lecture 9 Outline
• MESI protocol• Dragon update-based protocol• Impact of protocol optimizations
40
CSC/ECE 506: Architecture of Parallel Computers
Dragon Writeback Update Protocol• Four states
• Exclusive-clean (E): Memory and I have it• Shared clean (Sc): I, others, and maybe memory, but I’m not owner• Shared modified (Sm): I and others but not memory, and I’m the owner
• Sm and Sc can coexist in different caches, with at most one Sm• Modified or dirty (M): I and, no one else• On replacement: Sc can silently drop, Sm has to flush
• No invalid state• If in cache, cannot be invalid• If not present in cache, can view as being in not-present or invalid state
• New processor events: PrRdMiss, PrWrMiss• Introduced to specify actions when block not present in cache
• New bus transaction: BusUpd• Broadcasts single word written on bus; updates other relevant caches
41
CSC/ECE 506: Architecture of Parallel Computers
Dragon: Processor-Initiated Transactions
42
E
M
Sc
Sm
PrRdMiss/BusRd(~S)
PrRd/– PrRd/–
PrWr/BusUpd(S)
PrWr/BusUpd(~S)
PrWrMiss/(BusRd(S);BusUpd)
PrRd/– PrRd/–PrWr/BusUpd(S) PrWr/–
PrWrMiss/BusRd(~S)
PrRdMiss/BusRd(S)
Fill in the last two transitions here.
PrWr/BusUpd(~S)
PrWr/–
CSC/ECE 506: Architecture of Parallel Computers
Dragon: Bus-Initiated Transactions
43
E
M
Sc
Sm
BusRd/–BusUpd/Update
BusRd/–
BusRd/Flush
BusUpd/Update
BusRd/Flush
CSC/ECE 506: Architecture of Parallel Computers
Dragon State Transition Diagram
44
E Sc
Sm M
PrWr/—
PrRd/—
PrRd/—
PrRd/—
PrRdMiss/BusRd(S)
PrRdMiss/BusRd(S) PrWr/—
PrWrMiss/(BusRd(S); BusUpd) PrWrMiss/
BusRd(S)
PrWr/BusUpd(S)
PrWr/BusUpd(S)
BusRd/—
BusRd/Flush
PrRd/— BusUpd/Update
BusUpd/Update
BusRd/Flush
PrWr/BusUpd(S)
PrWr/BusUpd(S)
CSC/ECE 506: Architecture of Parallel Computers
Dragon Visualization
45
Start state. All caches empty and main memory has A = 1.
P1
CacheSnooper
P2
CacheSnooper
P3
CacheSnooper
Main memoryA = 1ControllerTraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
CSC/ECE 506: Architecture of Parallel Computers
Processor P1 Reads A
46
Processor P1 attempts to read A from its cache.
P1
CacheSnooper
P2
CacheSnooper
P3
CacheSnooper
Main memoryA = 1ControllerTraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
P1 PrRd AP1 BusRd AMem returns data
CSC/ECE 506: Architecture of Parallel Computers
Processor P1 Reads A
47
Processor P1 issues a BusRd.
P1
CacheSnooper
P2
CacheSnooper
P3
CacheSnooper
Main memoryA = 1ControllerTraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
P1 PrRd AP1 BusRd AMem returns data
CSC/ECE 506: Architecture of Parallel Computers
Processor P1 Reads A
48
Main memory returns data to processor P1 which updates its cache.
P1
CacheA = 1 ESnooper
P2
CacheSnooper
P3
CacheSnooper
Main memoryA = 1ControllerTraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
P1 PrRd AP1 BusRd AMem returns data
CSC/ECE 506: Architecture of Parallel Computers
Processor P1 Reads A
49
Read operation completes. P1
CacheA = 1 ESnooper
P2
CacheSnooper
P3
CacheSnooper
Main memoryA = 1ControllerTraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
CSC/ECE 506: Architecture of Parallel Computers
Processor P1 Writes A = 2
50
Processor P1 writes to its cache.
P1
CacheA = 2 MSnooper
P2
CacheSnooper
P3
CacheSnooper
Main memoryA = 1ControllerTraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
P1 PrWr A
M
CSC/ECE 506: Architecture of Parallel Computers
Processor P1 Writes A = 2
51
Write operation completes. P1
CacheA = 2 MSnooper
P2
CacheSnooper
P3
CacheSnooper
Main memoryA = 1ControllerTraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
CSC/ECE 506: Architecture of Parallel Computers
Processor P3 Reads A
Processor P3 attempts to read A from its cache.
P1
CacheA = 2 MSnooper
P2
Cache
Snooper
P3
Cache
Snooper
Main memory
A = 1Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
P3 PrRd AP3 BusRd AP1 snoops BusRdP1 Flush
CSC/ECE 506: Architecture of Parallel Computers
Processor P3 Reads A
Processor P3 issues a BusRd.
P1
CacheA = 2 MSnooper
P2
Cache
Snooper
P3
Cache
Snooper
Main memory
A = 1Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
P3 PrRd AP3 BusRd AP1 snoops BusRdP1 Flush
CSC/ECE 506: Architecture of Parallel Computers
Processor P3 Reads A
Processor P1 snoops the BusRd from processor P3.
P1
CacheA = 2 Sm
Snooper
P2
Cache
Snooper
P3
Cache
Snooper
Main memory
A = 1Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
P3 PrRd AP3 BusRd AP1 snoops BusRdP1 Flush
CSC/ECE 506: Architecture of Parallel Computers
Processor P3 Reads A
Processor P1 flushes, sending updated data to P3
and main memory.
P1
CacheA = 2 Sm
Snooper
P2
Cache
Snooper
P3
CacheA = 2 Sc
Snooper
Main memory
A = 1Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
P3 PrRd AP3 BusRd AP1 snoops BusRdP1 Flush
A = 1
Sm
CSC/ECE 506: Architecture of Parallel Computers
Processor P3 Reads A
Read operation completes. P1
CacheA = 2 Sm
Snooper
P2
Cache
Snooper
P3
CacheA = 2 Sc
Snooper
Main memory
A = 2Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
A = 1
CSC/ECE 506: Architecture of Parallel Computers
Processor P3 Writes A = 3
Processor P3 writes to its cache.
P1
CacheA = 2 Sm
Snooper
P2
Cache
Snooper
P3
CacheA = 2 Sc
Snooper
Main memory
A = 1Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
P3 PrWr AP3 BusUpdP1 snoops BusUpd
CSC/ECE 506: Architecture of Parallel Computers
Processor P3 Writes A = 3
Processor P3 issues a BusRd request.
P1
CacheA = 2 Sm
Snooper
P2
Cache
Snooper
P3
CacheA = 2 Sc
Snooper
Main memory
A = 1Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
P3 PrWr AP3 BusUpdP1 snoops BusUpd
Note: BusUpdate insteadof BusUpgr (no inval isperformed)
CSC/ECE 506: Architecture of Parallel Computers
Processor P3 Writes A = 3
Processor P1 snoops the BusUpd and updates its cache.
P1
CacheA = 3 Sc
Snooper
P2
Cache
Snooper
P3
CacheA = 3 Sm
Snooper
Main memory
A = 1Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
P3 PrWr AP3 BusUpgrP1 snoops BusUpd
CSC/ECE 506: Architecture of Parallel Computers
Processor P3 Writes A = 3
Write operation completes. P1
CacheA = 3 Sc
Snooper
P2
Cache
Snooper
P3
CacheA = 3 Sm
Snooper
Main memory
A = 1Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
CSC/ECE 506: Architecture of Parallel Computers
Processor P1 Reads A
Processor P1 reads from its cache.
P1
CacheA = 3 Sc
Snooper
P2
Cache
Snooper
P3
CacheA = 3 Sm
Snooper
Main memory
A = 1Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
P1 PrRd A
CSC/ECE 506: Architecture of Parallel Computers
Processor P1 Reads A
Processor P1 reads from its cache.
P1
CacheA = 3 Sc
Snooper
P2
Cache
Snooper
P3
CacheA = 3 Sm
Snooper
Main memory
A = 1Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
P1 PrRd A
This is a miss in theMESI and MSI protocols
CSC/ECE 506: Architecture of Parallel Computers
Processor P1 Reads A
Read operation completes. P1
CacheA = 3 Sc
Snooper
P2
Cache
Snooper
P3
CacheA = 3 Sm
Snooper
Main memory
A = 1Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
CSC/ECE 506: Architecture of Parallel Computers
Processor P3 Reads A
Processor P3 reads from its cache.
P1
CacheA = 3 Sc
Snooper
P2
Cache
Snooper
P3
CacheA = 3 Sm
Snooper
Main memory
A = 1Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
P1 PrRd A
CSC/ECE 506: Architecture of Parallel Computers
Processor P3 Reads A
Processor P3 reads from its cache.
P1
CacheA = 3 Sc
Snooper
P2
Cache
Snooper
P3
CacheA = 3 Sm
Snooper
Main memory
A = 1Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
P1 PrRd A
CSC/ECE 506: Architecture of Parallel Computers
Processor P3 Reads A
Read operation completes. P1
CacheA = 3 Sc
Snooper
P2
Cache
Snooper
P3
CacheA = 3 Sm
Snooper
Main memory
A = 1Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
CSC/ECE 506: Architecture of Parallel Computers
Processor P2 Reads A
Processor P2 reads from its cache.
P1
CacheA = 3 Sc
Snooper
P2
Cache
Snooper
P3
CacheA = 3 Sm
Snooper
Main memory
A = 1Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
P2 PrRd AP2 BusRd AP1 Flush
CSC/ECE 506: Architecture of Parallel Computers
Processor P2 Reads A
Processor P2 issues a BusRd request.
P1
CacheA = 3 Sc
Snooper
P2
Cache
Snooper
P3
CacheA = 3 Sm
Snooper
Main memory
A = 1Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
P2 PrRd AP2 BusRd AP1 Flush
CSC/ECE 506: Architecture of Parallel Computers
Processor P2 Reads A
Main memory controller observes the BusRd.
P1
CacheA = 3 Sc
Snooper
P2
Cache
Snooper
P3
CacheA = 3 Sm
Snooper
Main memory
A = 1Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
P2 PrRd AP2 BusRd AP1 Flush
A = 3 Sc
Note: Only the cache inState Sm is responsiblefor cache-to-cache transfer
CSC/ECE 506: Architecture of Parallel Computers
Processor P2 Reads A
Operation completes. P1
CacheA = 3 Sc
Snooper
P2
CacheA = 3 Sc
Snooper
P3
CacheA = 3 Sm
Snooper
Main memory
A = 1Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
CSC/ECE 506: Architecture of Parallel Computers
P1 Replaces A
A evicted from P1 P1
CacheA = 3 Sc
Snooper
P2
CacheA = 3 Sc
Snooper
P3
CacheA = 3 Sm
Snooper
Main memory
A = 1Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
A = 3 Sc
CSC/ECE 506: Architecture of Parallel Computers
P1 Replaces A
A evicted from P3. P1
Cache
Snooper
P2
CacheA = 3 Sc
Snooper
P3
CacheA = 3 Sm
Snooper
Main memory
A = 3Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
A = 3 Sc
P3 replaces XOwner responsiblefor writing back to memory
vs. MSI or MESI wherewrite-back only when the line is in M state
CSC/ECE 506: Architecture of Parallel Computers
Dragon Example
73
Proc Action State P1 State P2 State P3 Bus Action Data from
R1 E – – BusRd Mem
W1 M – – – Own cache
R3 Sm – Sc BusRd/Flush P1 cache
W3 Sc – Sm BusUpd/Upd Own cache
R1 Sc – Sm – Own cacheR3 Sc – Sm – Own cache
R2 Sc Sc Sm BusRd/Flush P3 cache
CSC/ECE 506: Architecture of Parallel Computers
Lower-Level Protocol Choices• Can shared-modified state be eliminated?
• If update memory as well on BusUpd transactions (DEC Firefly)• Dragon protocol doesn’t (assumes DRAM memory slow to update)
• Should replacement of an Sc block be broadcast?• Would allow last copy to go to Exclusive state and not generate updates• Replacement bus transaction is not in critical path, but later update may be
• Shouldn’t update local copy on write hit before controller gets bus• Can mess up serialization
• Coherence, consistency considerations much like write-through case
• In general, many subtle race conditions in protocols• But first, let’s illustrate quantitative assessment at logical level
74
CSC/ECE 506: Architecture of Parallel Computers
Lecture 9 Outline
• MESI protocol• Dragon update-based protocol• Impact of protocol optimizations
75
CSC/ECE 506: Architecture of Parallel Computers
Assessing Protocol Tradeoffs• Methodology:
• Use simulator; default 1MB, 4-way cache, 64-byte block, 16 processors. Some runs use 64K cache.
• Focus on frequencies, not end performance for now• transcends architectural details, but not what we’re really
after• Use idealized memory performance model to avoid changes
of reference interleaving across processors with machine parameters
• Cheap simulation: no need to model contention
76
CSC/ECE 506: Architecture of Parallel Computers
Impact of Protocol Optimizations
77
• MSI = MESI• Upgrades instead of read-exclusive helps• Same story when working sets don’t fit for Ocean, Radix, Raytrace
MESI vs. MSI (w/ BusUpgr) vs. MSI (w/ BusRdX)Traffic (MB/s)
Traffic (MB/s)
x d l t x Ill t Ex0
20
40
60
80
100
120
140
160
180
200
Data busAddress bus
E E0
10
20
30
40
50
60
70
80
Data busAddress bus
Bar
nes/
IIIB
arne
s/3S
tB
arne
s/3S
t-R
dEx LU
/III
Rad
ix/3
St-
RdE
xLU/3
St
LU/3
St-R
dEx
Rad
ix/3
St
Oce
an/II
IO
cean
/3S
Rad
iosi
ty/3
St-R
dEx
Oce
an/3
St-R
dEx
Rad
ix/II
I
Rad
iosi
ty/II
I
Rad
iosi
ty/3
St
Ray
trac
e/III
Ray
trac
e/3S
tR
aytr
ace/
3St-R
dEx
App
l-Cod
e/III
App
l-Cod
e/3S
tA
ppl-C
ode/
3St-R
dEx
App
l-Dat
a/III
App
l-Dat
a/3S
tA
ppl-D
ata/
3St-R
dEx
OS-
Cod
e/III
OS-
Cod
e/3S
t
OS-
Dat
a/3S
tO
S-D
ata/
III
OS-
Cod
e/3S
t-RdE
x
OS-
Dat
a/3S
t-RdE
x
CSC/ECE 506: Architecture of Parallel Computers
Impact of Cache-Line Size• Multiprocessors add new kind of miss to cold, capacity, conflict
• Coherence misses: Due to invalidations• True sharing: Write to same word• False sharing: Write to different words
• How to reduce each kind of miss with an invalidation protocol• Capacity: enlarge cache; increase block size (if spatial locality)• Conflict: increase associativity• Cold and coherence: only block size
• Increasing block size has advantages and disadvantages• Can reduce misses if spatial locality is good• Can hurt too
• increase misses due to false sharing if spatial locality not good• increase misses due to conflicts in fixed-size cache• increase traffic due to fetching unnecessary data and due to false sharing• can increase miss penalty and perhaps hit cost
78
CSC/ECE 506: Architecture of Parallel Computers
Impact of Block Size on Miss Rate
79
• For default problem size: vary block/line size from 8-256 Bytes
• Decreases with larger lines: cold, capacity (due to spatial locality), true sharing (due to spatial locality)• Increases with larger lines: false sharing • Working set doesn’t fit: impact of capacity misses large in Ocean and Radix
Cold
Capacity
True sharing
False sharing
Upgrade
8
0
0.1
0.2
0.3
0.4
0.5
0.6
Cold
Capacity
True sharing
False sharing
Upgrade
8 6 2 4 8 6 80
2
4
6
8
10
12
Mis
s ra
te (%
)
Bar
nes/
8B
arne
s/16
Bar
nes/
32
Bar
nes/
64B
arne
s/12
8
Bar
nes/
256
Lu/8
Lu/1
6Lu
/32
Lu/6
4Lu
/128
Lu/2
56
Rad
iosi
ty/8
Rad
iosi
ty/1
6R
adio
sity
/32
Rad
iosi
ty/6
4
Rad
iosi
ty/1
28R
adio
sity
/256
Mis
s ra
te (%
)
Oce
an/8
Oce
an/1
6
Oce
an/3
2O
cean
/64
Oce
an/1
28
Oce
an/2
56
Rad
ix/8
Rad
ix/1
6
Rad
ix/3
2R
adix
/64
Rad
ix/1
28
Rad
ix/2
56
Ray
trac
e/8
Ray
trac
e/16
Ray
trac
e/32
Ray
trac
e/64
Ray
trac
e/12
8R
aytr
ace/
256
CSC/ECE 506: Architecture of Parallel Computers
Impact of Block Size on Traffic
80
• Results different than for miss rate: traffic almost always increases• When working sets fits, overall traffic still small, except for Radix• Fixed overhead is significant component
• So total traffic often minimized with 16–32 byte block, not smaller• Working set doesn’t fit: even 128-byte good for Ocean due to capacity
• Address bus traffic behaves in opposite way as the data bus traffic
Traffic (bytes/inst) affects performance indirectly through contention
Traf
fic (
byte
s/in
stru
ctio
n)
Traf
fic (
byte
s/FL
OP
)
Data bus
Address busData bus
Address bus
Rad
ix/8
Rad
ix/1
6
Rad
ix/3
2
Rad
ix/6
4
Rad
ix/1
28
Rad
ix/2
56
0
1
2
3
4
5
6
7
8
9
10
LU/8
LU/1
6
LU/3
2
LU/6
4
LU/1
28
LU/2
56
Oce
an/8
Oce
an/1
6
Oce
an/3
2
Oce
an/6
4
Oce
an/1
28
Oce
an/2
56
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2 4 280
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
Data busAddress bus
Bar
nes/
16
Traf
fic (b
ytes
/inst
ruct
ions
)
Bar
nes/
8
Bar
nes/
32
Bar
nes/
64B
arne
s/12
8
Bar
nes/
256
Rad
iosi
ty/8
Rad
iosi
ty/1
6R
adio
sity
/32
Rad
iosi
ty/6
4R
adio
sity
/128
Rad
iosi
ty/2
56
Ray
trac
e/8
Ray
trac
e/16
Ray
trac
e/32
Ray
trac
e/64
Ray
trac
e/12
8R
aytr
ace/
256
CSC/ECE 506: Architecture of Parallel Computers
Firefly Visualization
81
Start state. All caches empty and main memory has A = 1.
P1
CacheSnooper
P2
CacheSnooper
P3
CacheSnooper
Main memoryA = 1ControllerTraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
CSC/ECE 506: Architecture of Parallel Computers
Processor P1 Reads A
82
Processor P1 attempts to read A from its cache.
P1
CacheSnooper
P2
CacheSnooper
P3
CacheSnooper
Main memoryA = 1ControllerTraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
P1 PrRdMiss( ~S)P1 BusRd AMem returns data
CSC/ECE 506: Architecture of Parallel Computers
Processor P1 Reads A
83
Processor P1 issues a BusRd.
P1
CacheSnooper
P2
CacheSnooper
P3
CacheSnooper
Main memoryA = 1ControllerTraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
P1 PrRdMiss( ~S)P1 BusRd AMem returns data
CSC/ECE 506: Architecture of Parallel Computers
Processor P1 Reads A
84
Main memory returns data to processor P1 which updates its cache.
P1
CacheA = 1 VSnooper
P2
CacheSnooper
P3
CacheSnooper
Main memoryA = 1ControllerTraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
P1 PrRdMiss( ~S)P1 BusRd AMem returns data
CSC/ECE 506: Architecture of Parallel Computers
Processor P1 Reads A
85
Read operation completes. P1
CacheA = 1 VSnooper
P2
CacheSnooper
P3
CacheSnooper
Main memoryA = 1ControllerTraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
CSC/ECE 506: Architecture of Parallel Computers
Processor P1 Writes A = 2
86
Processor P1 writes to its cache.
P1
CacheA = 2 MSnooper
P2
CacheSnooper
P3
CacheSnooper
Main memoryA = 1ControllerTraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
P1 PrWr A
D
CSC/ECE 506: Architecture of Parallel Computers
Processor P1 Writes A = 2
87
Write operation completes. P1
CacheA = 2 DSnooper
P2
CacheSnooper
P3
CacheSnooper
Main memoryA = 1ControllerTraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
CSC/ECE 506: Architecture of Parallel Computers
Processor P3 Reads A
Processor P3 attempts to read A from its cache.
P1
CacheA = 2 DSnooper
P2
Cache
Snooper
P3
Cache
Snooper
Main memory
A = 1Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
P3 PrRdMiss( S)P3 BusRd AP1 snoops BusRdP1 Flush
CSC/ECE 506: Architecture of Parallel Computers
Processor P3 Reads A
Processor P3 issues a BusRd.
P1
CacheA = 2 DSnooper
P2
Cache
Snooper
P3
Cache
Snooper
Main memory
A = 1Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
P3 PrRdMiss( S)P3 BusRd AP1 snoops BusRdP1 Flush
CSC/ECE 506: Architecture of Parallel Computers
Processor P3 Reads A
Processor P1 snoops the BusRd from processor P3.
P1
CacheA = 2 SSnooper
P2
Cache
Snooper
P3
Cache
Snooper
Main memory
A = 1Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
P3 PrRdMiss( S)P3 BusRd AP1 snoops BusRdP1 Flush
CSC/ECE 506: Architecture of Parallel Computers
Processor P3 Reads A
Processor P1 flushes, sending updated data to P3
and main memory.
P1
CacheA = 2 SSnooper
P2
Cache
Snooper
P3
CacheA = 2 SSnooper
Main memory
A = 2Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
P3 PrRdMiss( S)P3 BusRd AP1 snoops BusRdP1 Flush
CSC/ECE 506: Architecture of Parallel Computers
Processor P3 Reads A
Read operation completes. P1
CacheA = 2 SSnooper
P2
Cache
Snooper
P3
CacheA = 2 SSnooper
Main memory
A = 2Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
CSC/ECE 506: Architecture of Parallel Computers
Processor P3 Writes A = 3
Processor P3 writes to its cache.
P1
CacheA = 2 SSnooper
P2
Cache
Snooper
P3
CacheA = 2 SSnooper
Main memory
A = 2Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
P3 PrWr AP3 BusUpdP1 snoops BusUpd
CSC/ECE 506: Architecture of Parallel Computers
Processor P3 Writes A = 3
Processor P3 issues a BusRd request.
P1
CacheA = 2 SSnooper
P2
Cache
Snooper
P3
CacheA = 2 SSnooper
Main memory
A = 2Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
P3 PrWr AP3 BusUpdP1 snoops BusUpd
Note: BusUpd
CSC/ECE 506: Architecture of Parallel Computers
Processor P3 Writes A = 3
Processor P1 snoops the BusRd and invalidates its cache.
P1
CacheA = 3 SSnooper
P2
Cache
Snooper
P3
Main memory
A = 3Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
P3 PrWr AP3 BusUpdP1 snoops BusUpd
CacheA = 3 SSnooper
CSC/ECE 506: Architecture of Parallel Computers
Processor P3 Writes A = 3
Write operation completes. P1
CacheA = 3 SSnooper
P2
Cache
Snooper
P3
CacheA = 3 SSnooper
Main memory
A = 3Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
CSC/ECE 506: Architecture of Parallel Computers
Processor P1 Reads A
Processor P1 reads from its cache.
P1
CacheA = 3 SSnooper
P2
Cache
Snooper
P3
CacheA = 3 SSnooper
Main memory
A = 3Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
P3 PrRdHitP3 returns data
CSC/ECE 506: Architecture of Parallel Computers
Processor P1 Reads A
Processor P1 reads from its cache.
P1
CacheA = 3 SSnooper
P2
Cache
Snooper
P3
CacheA = 3 SSnooper
Main memory
A = 3Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
P3 PrRdHitP3 returns data
CSC/ECE 506: Architecture of Parallel Computers
Processor P3 Reads A
Processor P3 reads from its cache.
P1
CacheA = 3 SSnooper
P2
Cache
Snooper
P3
CacheA = 3 SSnooper
Main memory
A = 3Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
P3 PrRdHitP3 returns data
CSC/ECE 506: Architecture of Parallel Computers
Processor P3 Reads A
Processor P3 returns valid data from its cache.
P1
CacheA = 3 SSnooper
P2
Cache
Snooper
P3
CacheA = 3 SSnooper
Main memory
A = 3Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
P3 PrRdHitP3 returns data
CSC/ECE 506: Architecture of Parallel Computers
Processor P2 Reads A
Processor P2 reads from its cache.
P1
CacheA = 3 SSnooper
P2
Cache
Snooper
P3
CacheA = 3 SSnooper
Main memory
A = 3Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
P2 PrRdMiss(S)P2 BusRd AP1 Flush
CSC/ECE 506: Architecture of Parallel Computers
Processor P2 Reads A
Processor P2 issues a BusRd request.
P1
CacheA = 3 SSnooper
P2
Cache
Snooper
P3
CacheA = 3 SSnooper
Main memory
A = 3Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
P2 PrRdMiss(S)P2 BusRd AP1 Flush
CSC/ECE 506: Architecture of Parallel Computers
Processor P2 Reads A
Main memory controller observes the BusRd.
P1
CacheA = 3 SSnooper
P2
Cache
Snooper
P3
CacheA = 3 SSnooper
Main memory
A = 3Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
P2 PrRdMis(S)P2 BusRd AP1 Flush
A = 3 S
CSC/ECE 506: Architecture of Parallel Computers
Processor P2 Reads A
Operation completes. P1
CacheA = 3 SSnooper
P2
CacheA = 3 SSnooper
P3
CacheA = 3 SSnooper
Main memory
A = 3Controller
TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A
Bus
CSC/ECE 506: Architecture of Parallel Computers
Firefly Example
105
Proc Action State P1 State P2 State P3 Bus Action Data From
R1 V – – BusRd Mem
W1 D – – – Own cache
R3 S – S BusRd/Flush P1 cache
W3 S – S BusUpd P3 cache
R1 S – S - Own cache
R3 S – S – Own cache
R2 S S S BusRd/Flush P1 Cache