105
CSC/ECE 506: Architecture of Parallel Computers Lecture 12: Other Bus- Based Coherence Protocols Spring 2010 E. F. Gehringer, based on slides by Yan Solihin 1

Lecture 9 Outline

  • Upload
    lavada

  • View
    45

  • Download
    1

Embed Size (px)

DESCRIPTION

Lecture 12: Other Bus-Based Coherence Protocols Spring 2010 E. F. Gehringer, based on slides by Yan Solihin. Lecture 9 Outline. MESI protocol Dragon update-based protocol Impact of protocol optimizations. Lower-Level Protocol Choice. - PowerPoint PPT Presentation

Citation preview

Page 1: Lecture 9 Outline

CSC/ECE 506: Architecture of Parallel Computers

Lecture 12: Other Bus-Based Coherence Protocols

Spring 2010

E. F. Gehringer, based on slides by Yan Solihin

1

Page 2: Lecture 9 Outline

CSC/ECE 506: Architecture of Parallel Computers

Lecture 9 Outline

• MESI protocol• Dragon update-based protocol• Impact of protocol optimizations

2

Page 3: Lecture 9 Outline

CSC/ECE 506: Architecture of Parallel Computers

Lower-Level Protocol Choice

• What transition should occur when a BusRd is observed in state M?• Change to S? Assume I’ll reread it soon

• Good for mostly read data• But what about “migratory” data? …

• Change to I? Assume another will write it (Synapse)

• I read and write, then you read and write, then X reads and writes...

• Sequent Symmetry and MIT Alewife use adaptive protocols

3

Page 4: Lecture 9 Outline

CSC/ECE 506: Architecture of Parallel Computers

MESI (4-state) Invalidation Protocol• Here’s a problem with the MSI protocol:

• A {Rd, Wr} sequence causes two bus transactions • even when no one is sharing (e.g., serial program!)• BusRd (I S) followed by BusRdX or BusUpgr (S M)• In general, coherence traffic from serial programs is unacceptable

• To avoid this, add a fourth state, Exclusive:• Invalid• Modified (dirty)• Shared (two or more caches may have copies)• Exclusive (only this cache has clean copy, same value as in memory)

• How does the protocol decide whether I E or I S? • Need to check whether someone else has a copy• “Shared” signal on bus: wired-or line asserted in response to BusRd

4

Page 5: Lecture 9 Outline

CSC/ECE 506: Architecture of Parallel Computers

MESI: Processor-Initiated Transactions

5

M

S

E

PrRd/–PrWr/–

PrRd/–

PrWr/–

I

PrRd/BusRd(~S)

PrRd/BusRd(S)

PrWr/BusRdX

PrWr/BusRdX

PrRd/–

Fill in the last two transitions here.

Page 6: Lecture 9 Outline

CSC/ECE 506: Architecture of Parallel Computers

MESI: Bus-Initiated Transactions

6

M

I

E

BusRd/–BusRdX/–

S

BusRd/Flush BusRd/FlushBusRdX/Flush

BusRdX/Flush

BusRdX/Flush׳BusRd/Flush׳

Flush׳ means flush only if cache-to-cache sharing is used; only the cache responsible for supplying the data will do a flush.

Page 7: Lecture 9 Outline

CSC/ECE 506: Architecture of Parallel Computers

MESI State Transition Diagram

7

• BusRd(S) means shared line asserted on BusRd transaction

PrWr/—

BusRd/Flush

PrRd/

BusRdX/Flush

PrWr/BusRdX

PrWr/—

PrRd/—

PrRd/—BusRd/Flush

E

M

I

S

PrRd

BusRd(S)

BusRdX/Flush

BusRdX/Flush

BusRd/Flush

PrWr/BusRdX

PrRd/BusRd (S)

Page 8: Lecture 9 Outline

CSC/ECE 506: Architecture of Parallel Computers

Flush vs. Flush'

• Flush: mandatory• Flush' happens only when

• Cache-to-cache sharing is used, and,• Only one cache flushes data

8

Page 9: Lecture 9 Outline

CSC/ECE 506: Architecture of Parallel Computers

MESI Visualization

9

Start state. All caches empty and main memory has A = 1.

P1

CacheSnooper

P2

CacheSnooper

P3

CacheSnooper

Main memoryA = 1ControllerTraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A

Bus

Page 10: Lecture 9 Outline

CSC/ECE 506: Architecture of Parallel Computers

Processor P1 Reads A

10

Processor P1 attempts to read A from its cache.

P1

CacheSnooper

P2

CacheSnooper

P3

CacheSnooper

Main memoryA = 1ControllerTraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A

Bus

P1 PrRd AP1 BusRd AMem returns data

Page 11: Lecture 9 Outline

CSC/ECE 506: Architecture of Parallel Computers

Processor P1 Reads A

11

Processor P1 issues a BusRd.

P1

CacheSnooper

P2

CacheSnooper

P3

CacheSnooper

Main memoryA = 1ControllerTraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A

Bus

P1 PrRd AP1 BusRd AMem returns data

Page 12: Lecture 9 Outline

CSC/ECE 506: Architecture of Parallel Computers

Processor P1 Reads A

12

Main memory returns data to processor P1 which updates its cache.

P1

CacheA = 1 ESnooper

P2

CacheSnooper

P3

CacheSnooper

Main memoryA = 1ControllerTraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A

Bus

P1 PrRd AP1 BusRd AMem returns data

Page 13: Lecture 9 Outline

CSC/ECE 506: Architecture of Parallel Computers

Processor P1 Reads A

13

Read operation completes. P1

CacheA = 1 ESnooper

P2

CacheSnooper

P3

CacheSnooper

Main memoryA = 1ControllerTraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A

Bus

Page 14: Lecture 9 Outline

CSC/ECE 506: Architecture of Parallel Computers

Processor P1 Writes A = 2

14

Processor P1 writes to its cache.

P1

CacheA = 2 MSnooper

P2

CacheSnooper

P3

CacheSnooper

Main memoryA = 1ControllerTraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A

Bus

P1 PrWr A

One less bus requestdue to Exclusive state,esp. for serial programs

M

Page 15: Lecture 9 Outline

CSC/ECE 506: Architecture of Parallel Computers

Processor P1 Writes A = 2

15

Write operation completes. P1

CacheA = 2 MSnooper

P2

CacheSnooper

P3

CacheSnooper

Main memoryA = 1ControllerTraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A

Bus

Page 16: Lecture 9 Outline

CSC/ECE 506: Architecture of Parallel Computers

Processor P3 Reads A

Processor P3 attempts to read A from its cache.

P1

CacheA = 2 MSnooper

P2

Cache

Snooper

P3

Cache

Snooper

Main memory

A = 1Controller

TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A

Bus

P3 PrRd AP3 BusRd AP1 snoops BusRdP1 Flush

Page 17: Lecture 9 Outline

CSC/ECE 506: Architecture of Parallel Computers

Processor P3 Reads A

Processor P3 issues a BusRd.

P1

CacheA = 2 MSnooper

P2

Cache

Snooper

P3

Cache

Snooper

Main memory

A = 1Controller

TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A

Bus

P3 PrRd AP3 BusRd AP1 snoops BusRdP1 Flush

Page 18: Lecture 9 Outline

CSC/ECE 506: Architecture of Parallel Computers

Processor P3 Reads A

Processor P1 snoops the BusRd from processor P3.

P1

CacheA = 2 SSnooper

P2

Cache

Snooper

P3

Cache

Snooper

Main memory

A = 1Controller

TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A

Bus

P3 PrRd AP3 BusRd AP1 snoops BusRdP1 Flush

Page 19: Lecture 9 Outline

CSC/ECE 506: Architecture of Parallel Computers

Processor P3 Reads A

Processor P1 flushes, sending updated data to P3

and main memory.

P1

CacheA = 2 SSnooper

P2

Cache

Snooper

P3

CacheA = 2 SSnooper

Main memory

A = 2Controller

TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A

Bus

P3 PrRd AP3 BusRd AP1 snoops BusRdP1 Flush

Page 20: Lecture 9 Outline

CSC/ECE 506: Architecture of Parallel Computers

Processor P3 Reads A

Read operation completes. P1

CacheA = 2 SSnooper

P2

Cache

Snooper

P3

CacheA = 2 SSnooper

Main memory

A = 2Controller

TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A

Bus

Page 21: Lecture 9 Outline

CSC/ECE 506: Architecture of Parallel Computers

Processor P3 Writes A = 3

Processor P3 writes to its cache.

P1

CacheA = 2 SSnooper

P2

Cache

Snooper

P3

CacheA = 2 SSnooper

Main memory

A = 2Controller

TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A

Bus

P3 PrWr AP3 BusUpgrP1 snoops BusUpgr

Page 22: Lecture 9 Outline

CSC/ECE 506: Architecture of Parallel Computers

Processor P3 Writes A = 3

Processor P3 issues a BusRd request.

P1

CacheA = 2 SSnooper

P2

Cache

Snooper

P3

CacheA = 2 SSnooper

Main memory

A = 2Controller

TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A

Bus

P3 PrWr AP3 BusUpgrP1 snoops BusUpgr

Note: BusUpgr insteadof BusRdX

Page 23: Lecture 9 Outline

CSC/ECE 506: Architecture of Parallel Computers

Processor P3 Writes A = 3

Processor P1 snoops the BusRd and invalidates its cache.

P1

CacheA = 2 ISnooper

P2

Cache

Snooper

P3

CacheA = 3 MSnooper

Main memory

A = 2Controller

TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A

Bus

P3 PrWr AP3 BusUpgrP1 snoops BusUpgr

Page 24: Lecture 9 Outline

CSC/ECE 506: Architecture of Parallel Computers

Processor P3 Writes A = 3

Write operation completes. P1

CacheA = 2 ISnooper

P2

Cache

Snooper

P3

CacheA = 3 MSnooper

Main memory

A = 2Controller

TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A

Bus

Page 25: Lecture 9 Outline

CSC/ECE 506: Architecture of Parallel Computers

Processor P1 Reads A

Processor P1 reads from its cache.

P1

CacheA = 2 ISnooper

P2

Cache

Snooper

P3

CacheA = 3 MSnooper

Main memory

A = 2Controller

TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A

Bus

P1 PrRd AP1 BusRd AP3 snoops BusRdP3 Flush

Page 26: Lecture 9 Outline

CSC/ECE 506: Architecture of Parallel Computers

Processor P1 Reads A

Processor P1 issues a BusRd request.

P1

CacheA = 2 ISnooper

P2

Cache

Snooper

P3

CacheA = 3 MSnooper

Main memory

A = 2Controller

TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A

Bus

P1 PrRd AP1 BusRd AP3 snoops BusRdP3 Flush

Page 27: Lecture 9 Outline

CSC/ECE 506: Architecture of Parallel Computers

Processor P1 Reads A

Processor P3 snoops the BusRd.

P1

CacheA = 2 ISnooper

P2

Cache

Snooper

P3

CacheA = 3 MSnooper

Main memory

A = 2Controller

TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A

Bus

P1 PrRd AP1 BusRd AP3 snoops BusRdP3 Flush

Page 28: Lecture 9 Outline

CSC/ECE 506: Architecture of Parallel Computers

Processor P1 Reads A

Processor P3 flushes, updating processor P1, main memory and its own cache state.

P1

CacheA = 3 SSnooper

P2

Cache

Snooper

P3

CacheA = 3 SSnooper

Main memory

A = 3Controller

TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A

Bus

P1 PrRd AP1 BusRd AP3 snoops BusRdP3 Flush

Page 29: Lecture 9 Outline

CSC/ECE 506: Architecture of Parallel Computers

Processor P1 Reads A

Read operation completes. P1

CacheA = 3 SSnooper

P2

Cache

Snooper

P3

CacheA = 3 SSnooper

Main memory

A = 3Controller

TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A

Bus

Page 30: Lecture 9 Outline

CSC/ECE 506: Architecture of Parallel Computers

Processor P3 Reads A

Processor P3 reads from its cache.

P1

CacheA = 3 SSnooper

P2

Cache

Snooper

P3

CacheA = 3 SSnooper

Main memory

A = 3Controller

TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A

Bus

P3 PrRd AP3 returns data

Page 31: Lecture 9 Outline

CSC/ECE 506: Architecture of Parallel Computers

Processor P3 Reads A

Processor P3 returns valid data from its cache.

P1

CacheA = 3 SSnooper

P2

Cache

Snooper

P3

CacheA = 3 SSnooper

Main memory

A = 3Controller

TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A

Bus

P3 PrRd AP3 returns data

Page 32: Lecture 9 Outline

CSC/ECE 506: Architecture of Parallel Computers

Processor P3 Reads A

Read operation completes. P1

CacheA = 3 SSnooper

P2

Cache

Snooper

P3

CacheA = 3 SSnooper

Main memory

A = 3Controller

TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A

Bus

Page 33: Lecture 9 Outline

CSC/ECE 506: Architecture of Parallel Computers

Processor P2 Reads A

Processor P2 reads from its cache.

P1

CacheA = 3 SSnooper

P2

Cache

Snooper

P3

CacheA = 3 SSnooper

Main memory

A = 3Controller

TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A

Bus

P2 PrRd AP2 BusRd AP1 Flush

Page 34: Lecture 9 Outline

CSC/ECE 506: Architecture of Parallel Computers

Processor P2 Reads A

Processor P2 issues a BusRd request.

P1

CacheA = 3 SSnooper

P2

Cache

Snooper

P3

CacheA = 3 SSnooper

Main memory

A = 3Controller

TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A

Bus

P2 PrRd AP2 BusRd AP1 Flush

Page 35: Lecture 9 Outline

CSC/ECE 506: Architecture of Parallel Computers

Processor P2 Reads A

Main memory controller observes the BusRd.

P1

CacheA = 3 SSnooper

P2

Cache

Snooper

P3

CacheA = 3 SSnooper

Main memory

A = 3Controller

TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A

Bus

P2 PrRd AP2 BusRd AP1 Flush

A = 3 S

X

Referred to as Cache-to-cache transferin Illinois MESI protocol

Page 36: Lecture 9 Outline

CSC/ECE 506: Architecture of Parallel Computers

Processor P2 Reads A

Operation completes. P1

CacheA = 3 SSnooper

P2

CacheA = 3 SSnooper

P3

CacheA = 3 SSnooper

Main memory

A = 3Controller

TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A

Bus

Page 37: Lecture 9 Outline

CSC/ECE 506: Architecture of Parallel Computers

MESI Example (Cache-to-Cache Transfer)

37* Data from memory if no cache2cache transfer, BusRd/ –

Proc Action State P1 State P2 State P3 Bus Action Data From

R1 E – – BusRd Mem

W1 M – – – Own cache

R3 S – S BusRd/Flush P1 cache

W3 I – M BusRdX Mem

R1 S – S BusRd/Flush P3 cache

R3 S – S – Own cache

R2 S S S BusRd/Flush׳ P1/P3 Cache*

Page 38: Lecture 9 Outline

CSC/ECE 506: Architecture of Parallel Computers

MESI Example (Cache-to-Cache Transfer+BusUpgr)

38* Data from memory if no cache2cache transfer, BusRd/ –

Proc Action State P1 State P2 State P3 Bus Action Data From

R1 E - - BusRd Mem

W1 M - - - Own cache

R3 S - S BusRd/Flush P1 cache

W3 I - M BusUpgr Own cache

R1 S - S BusRd/Flush P3 cache

R3 S - S - Own cache

R2 S S S BusRd/Flush׳P1/P3

Cache*

Page 39: Lecture 9 Outline

CSC/ECE 506: Architecture of Parallel Computers

Lower-Level Protocol Choices• Who supplies data on miss when not in M state: memory or cache?

• Original, lllinois MESI: cache• assume cache faster than memory (cache-to-cache transfer)• Not necessarily true

• Adds complexity• How does memory know it should supply data? (must wait for

caches)• A selection algorithm is needed if multiple caches have valid data.

• Useful in a distributed-memory system• May be cheaper to obtain from nearby cache than distant memory• Especially when constructed out of SMP nodes (Stanford DASH)

39

Page 40: Lecture 9 Outline

CSC/ECE 506: Architecture of Parallel Computers

Lecture 9 Outline

• MESI protocol• Dragon update-based protocol• Impact of protocol optimizations

40

Page 41: Lecture 9 Outline

CSC/ECE 506: Architecture of Parallel Computers

Dragon Writeback Update Protocol• Four states

• Exclusive-clean (E): Memory and I have it• Shared clean (Sc): I, others, and maybe memory, but I’m not owner• Shared modified (Sm): I and others but not memory, and I’m the owner

• Sm and Sc can coexist in different caches, with at most one Sm• Modified or dirty (M): I and, no one else• On replacement: Sc can silently drop, Sm has to flush

• No invalid state• If in cache, cannot be invalid• If not present in cache, can view as being in not-present or invalid state

• New processor events: PrRdMiss, PrWrMiss• Introduced to specify actions when block not present in cache

• New bus transaction: BusUpd• Broadcasts single word written on bus; updates other relevant caches

41

Page 42: Lecture 9 Outline

CSC/ECE 506: Architecture of Parallel Computers

Dragon: Processor-Initiated Transactions

42

E

M

Sc

Sm

PrRdMiss/BusRd(~S)

PrRd/– PrRd/–

PrWr/BusUpd(S)

PrWr/BusUpd(~S)

PrWrMiss/(BusRd(S);BusUpd)

PrRd/– PrRd/–PrWr/BusUpd(S) PrWr/–

PrWrMiss/BusRd(~S)

PrRdMiss/BusRd(S)

Fill in the last two transitions here.

PrWr/BusUpd(~S)

PrWr/–

Page 43: Lecture 9 Outline

CSC/ECE 506: Architecture of Parallel Computers

Dragon: Bus-Initiated Transactions

43

E

M

Sc

Sm

BusRd/–BusUpd/Update

BusRd/–

BusRd/Flush

BusUpd/Update

BusRd/Flush

Page 44: Lecture 9 Outline

CSC/ECE 506: Architecture of Parallel Computers

Dragon State Transition Diagram

44

E Sc

Sm M

PrWr/—

PrRd/—

PrRd/—

PrRd/—

PrRdMiss/BusRd(S)

PrRdMiss/BusRd(S) PrWr/—

PrWrMiss/(BusRd(S); BusUpd) PrWrMiss/

BusRd(S)

PrWr/BusUpd(S)

PrWr/BusUpd(S)

BusRd/—

BusRd/Flush

PrRd/— BusUpd/Update

BusUpd/Update

BusRd/Flush

PrWr/BusUpd(S)

PrWr/BusUpd(S)

Page 45: Lecture 9 Outline

CSC/ECE 506: Architecture of Parallel Computers

Dragon Visualization

45

Start state. All caches empty and main memory has A = 1.

P1

CacheSnooper

P2

CacheSnooper

P3

CacheSnooper

Main memoryA = 1ControllerTraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A

Bus

Page 46: Lecture 9 Outline

CSC/ECE 506: Architecture of Parallel Computers

Processor P1 Reads A

46

Processor P1 attempts to read A from its cache.

P1

CacheSnooper

P2

CacheSnooper

P3

CacheSnooper

Main memoryA = 1ControllerTraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A

Bus

P1 PrRd AP1 BusRd AMem returns data

Page 47: Lecture 9 Outline

CSC/ECE 506: Architecture of Parallel Computers

Processor P1 Reads A

47

Processor P1 issues a BusRd.

P1

CacheSnooper

P2

CacheSnooper

P3

CacheSnooper

Main memoryA = 1ControllerTraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A

Bus

P1 PrRd AP1 BusRd AMem returns data

Page 48: Lecture 9 Outline

CSC/ECE 506: Architecture of Parallel Computers

Processor P1 Reads A

48

Main memory returns data to processor P1 which updates its cache.

P1

CacheA = 1 ESnooper

P2

CacheSnooper

P3

CacheSnooper

Main memoryA = 1ControllerTraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A

Bus

P1 PrRd AP1 BusRd AMem returns data

Page 49: Lecture 9 Outline

CSC/ECE 506: Architecture of Parallel Computers

Processor P1 Reads A

49

Read operation completes. P1

CacheA = 1 ESnooper

P2

CacheSnooper

P3

CacheSnooper

Main memoryA = 1ControllerTraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A

Bus

Page 50: Lecture 9 Outline

CSC/ECE 506: Architecture of Parallel Computers

Processor P1 Writes A = 2

50

Processor P1 writes to its cache.

P1

CacheA = 2 MSnooper

P2

CacheSnooper

P3

CacheSnooper

Main memoryA = 1ControllerTraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A

Bus

P1 PrWr A

M

Page 51: Lecture 9 Outline

CSC/ECE 506: Architecture of Parallel Computers

Processor P1 Writes A = 2

51

Write operation completes. P1

CacheA = 2 MSnooper

P2

CacheSnooper

P3

CacheSnooper

Main memoryA = 1ControllerTraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A

Bus

Page 52: Lecture 9 Outline

CSC/ECE 506: Architecture of Parallel Computers

Processor P3 Reads A

Processor P3 attempts to read A from its cache.

P1

CacheA = 2 MSnooper

P2

Cache

Snooper

P3

Cache

Snooper

Main memory

A = 1Controller

TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A

Bus

P3 PrRd AP3 BusRd AP1 snoops BusRdP1 Flush

Page 53: Lecture 9 Outline

CSC/ECE 506: Architecture of Parallel Computers

Processor P3 Reads A

Processor P3 issues a BusRd.

P1

CacheA = 2 MSnooper

P2

Cache

Snooper

P3

Cache

Snooper

Main memory

A = 1Controller

TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A

Bus

P3 PrRd AP3 BusRd AP1 snoops BusRdP1 Flush

Page 54: Lecture 9 Outline

CSC/ECE 506: Architecture of Parallel Computers

Processor P3 Reads A

Processor P1 snoops the BusRd from processor P3.

P1

CacheA = 2 Sm

Snooper

P2

Cache

Snooper

P3

Cache

Snooper

Main memory

A = 1Controller

TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A

Bus

P3 PrRd AP3 BusRd AP1 snoops BusRdP1 Flush

Page 55: Lecture 9 Outline

CSC/ECE 506: Architecture of Parallel Computers

Processor P3 Reads A

Processor P1 flushes, sending updated data to P3

and main memory.

P1

CacheA = 2 Sm

Snooper

P2

Cache

Snooper

P3

CacheA = 2 Sc

Snooper

Main memory

A = 1Controller

TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A

Bus

P3 PrRd AP3 BusRd AP1 snoops BusRdP1 Flush

A = 1

Sm

Page 56: Lecture 9 Outline

CSC/ECE 506: Architecture of Parallel Computers

Processor P3 Reads A

Read operation completes. P1

CacheA = 2 Sm

Snooper

P2

Cache

Snooper

P3

CacheA = 2 Sc

Snooper

Main memory

A = 2Controller

TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A

Bus

A = 1

Page 57: Lecture 9 Outline

CSC/ECE 506: Architecture of Parallel Computers

Processor P3 Writes A = 3

Processor P3 writes to its cache.

P1

CacheA = 2 Sm

Snooper

P2

Cache

Snooper

P3

CacheA = 2 Sc

Snooper

Main memory

A = 1Controller

TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A

Bus

P3 PrWr AP3 BusUpdP1 snoops BusUpd

Page 58: Lecture 9 Outline

CSC/ECE 506: Architecture of Parallel Computers

Processor P3 Writes A = 3

Processor P3 issues a BusRd request.

P1

CacheA = 2 Sm

Snooper

P2

Cache

Snooper

P3

CacheA = 2 Sc

Snooper

Main memory

A = 1Controller

TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A

Bus

P3 PrWr AP3 BusUpdP1 snoops BusUpd

Note: BusUpdate insteadof BusUpgr (no inval isperformed)

Page 59: Lecture 9 Outline

CSC/ECE 506: Architecture of Parallel Computers

Processor P3 Writes A = 3

Processor P1 snoops the BusUpd and updates its cache.

P1

CacheA = 3 Sc

Snooper

P2

Cache

Snooper

P3

CacheA = 3 Sm

Snooper

Main memory

A = 1Controller

TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A

Bus

P3 PrWr AP3 BusUpgrP1 snoops BusUpd

Page 60: Lecture 9 Outline

CSC/ECE 506: Architecture of Parallel Computers

Processor P3 Writes A = 3

Write operation completes. P1

CacheA = 3 Sc

Snooper

P2

Cache

Snooper

P3

CacheA = 3 Sm

Snooper

Main memory

A = 1Controller

TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A

Bus

Page 61: Lecture 9 Outline

CSC/ECE 506: Architecture of Parallel Computers

Processor P1 Reads A

Processor P1 reads from its cache.

P1

CacheA = 3 Sc

Snooper

P2

Cache

Snooper

P3

CacheA = 3 Sm

Snooper

Main memory

A = 1Controller

TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A

Bus

P1 PrRd A

Page 62: Lecture 9 Outline

CSC/ECE 506: Architecture of Parallel Computers

Processor P1 Reads A

Processor P1 reads from its cache.

P1

CacheA = 3 Sc

Snooper

P2

Cache

Snooper

P3

CacheA = 3 Sm

Snooper

Main memory

A = 1Controller

TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A

Bus

P1 PrRd A

This is a miss in theMESI and MSI protocols

Page 63: Lecture 9 Outline

CSC/ECE 506: Architecture of Parallel Computers

Processor P1 Reads A

Read operation completes. P1

CacheA = 3 Sc

Snooper

P2

Cache

Snooper

P3

CacheA = 3 Sm

Snooper

Main memory

A = 1Controller

TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A

Bus

Page 64: Lecture 9 Outline

CSC/ECE 506: Architecture of Parallel Computers

Processor P3 Reads A

Processor P3 reads from its cache.

P1

CacheA = 3 Sc

Snooper

P2

Cache

Snooper

P3

CacheA = 3 Sm

Snooper

Main memory

A = 1Controller

TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A

Bus

P1 PrRd A

Page 65: Lecture 9 Outline

CSC/ECE 506: Architecture of Parallel Computers

Processor P3 Reads A

Processor P3 reads from its cache.

P1

CacheA = 3 Sc

Snooper

P2

Cache

Snooper

P3

CacheA = 3 Sm

Snooper

Main memory

A = 1Controller

TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A

Bus

P1 PrRd A

Page 66: Lecture 9 Outline

CSC/ECE 506: Architecture of Parallel Computers

Processor P3 Reads A

Read operation completes. P1

CacheA = 3 Sc

Snooper

P2

Cache

Snooper

P3

CacheA = 3 Sm

Snooper

Main memory

A = 1Controller

TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A

Bus

Page 67: Lecture 9 Outline

CSC/ECE 506: Architecture of Parallel Computers

Processor P2 Reads A

Processor P2 reads from its cache.

P1

CacheA = 3 Sc

Snooper

P2

Cache

Snooper

P3

CacheA = 3 Sm

Snooper

Main memory

A = 1Controller

TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A

Bus

P2 PrRd AP2 BusRd AP1 Flush

Page 68: Lecture 9 Outline

CSC/ECE 506: Architecture of Parallel Computers

Processor P2 Reads A

Processor P2 issues a BusRd request.

P1

CacheA = 3 Sc

Snooper

P2

Cache

Snooper

P3

CacheA = 3 Sm

Snooper

Main memory

A = 1Controller

TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A

Bus

P2 PrRd AP2 BusRd AP1 Flush

Page 69: Lecture 9 Outline

CSC/ECE 506: Architecture of Parallel Computers

Processor P2 Reads A

Main memory controller observes the BusRd.

P1

CacheA = 3 Sc

Snooper

P2

Cache

Snooper

P3

CacheA = 3 Sm

Snooper

Main memory

A = 1Controller

TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A

Bus

P2 PrRd AP2 BusRd AP1 Flush

A = 3 Sc

Note: Only the cache inState Sm is responsiblefor cache-to-cache transfer

Page 70: Lecture 9 Outline

CSC/ECE 506: Architecture of Parallel Computers

Processor P2 Reads A

Operation completes. P1

CacheA = 3 Sc

Snooper

P2

CacheA = 3 Sc

Snooper

P3

CacheA = 3 Sm

Snooper

Main memory

A = 1Controller

TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A

Bus

Page 71: Lecture 9 Outline

CSC/ECE 506: Architecture of Parallel Computers

P1 Replaces A

A evicted from P1 P1

CacheA = 3 Sc

Snooper

P2

CacheA = 3 Sc

Snooper

P3

CacheA = 3 Sm

Snooper

Main memory

A = 1Controller

TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A

Bus

A = 3 Sc

Page 72: Lecture 9 Outline

CSC/ECE 506: Architecture of Parallel Computers

P1 Replaces A

A evicted from P3. P1

Cache

Snooper

P2

CacheA = 3 Sc

Snooper

P3

CacheA = 3 Sm

Snooper

Main memory

A = 3Controller

TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A

Bus

A = 3 Sc

P3 replaces XOwner responsiblefor writing back to memory

vs. MSI or MESI wherewrite-back only when the line is in M state

Page 73: Lecture 9 Outline

CSC/ECE 506: Architecture of Parallel Computers

Dragon Example

73

Proc Action State P1 State P2 State P3 Bus Action Data from

R1 E – – BusRd Mem

W1 M – – – Own cache

R3 Sm – Sc BusRd/Flush P1 cache

W3 Sc – Sm BusUpd/Upd Own cache

R1 Sc – Sm – Own cacheR3 Sc – Sm – Own cache

R2 Sc Sc Sm BusRd/Flush P3 cache

Page 74: Lecture 9 Outline

CSC/ECE 506: Architecture of Parallel Computers

Lower-Level Protocol Choices• Can shared-modified state be eliminated?

• If update memory as well on BusUpd transactions (DEC Firefly)• Dragon protocol doesn’t (assumes DRAM memory slow to update)

• Should replacement of an Sc block be broadcast?• Would allow last copy to go to Exclusive state and not generate updates• Replacement bus transaction is not in critical path, but later update may be

• Shouldn’t update local copy on write hit before controller gets bus• Can mess up serialization

• Coherence, consistency considerations much like write-through case

• In general, many subtle race conditions in protocols• But first, let’s illustrate quantitative assessment at logical level

74

Page 75: Lecture 9 Outline

CSC/ECE 506: Architecture of Parallel Computers

Lecture 9 Outline

• MESI protocol• Dragon update-based protocol• Impact of protocol optimizations

75

Page 76: Lecture 9 Outline

CSC/ECE 506: Architecture of Parallel Computers

Assessing Protocol Tradeoffs• Methodology:

• Use simulator; default 1MB, 4-way cache, 64-byte block, 16 processors. Some runs use 64K cache.

• Focus on frequencies, not end performance for now• transcends architectural details, but not what we’re really

after• Use idealized memory performance model to avoid changes

of reference interleaving across processors with machine parameters

• Cheap simulation: no need to model contention

76

Page 77: Lecture 9 Outline

CSC/ECE 506: Architecture of Parallel Computers

Impact of Protocol Optimizations

77

• MSI = MESI• Upgrades instead of read-exclusive helps• Same story when working sets don’t fit for Ocean, Radix, Raytrace

MESI vs. MSI (w/ BusUpgr) vs. MSI (w/ BusRdX)Traffic (MB/s)

Traffic (MB/s)

x d l t x Ill t Ex0

20

40

60

80

100

120

140

160

180

200

Data busAddress bus

E E0

10

20

30

40

50

60

70

80

Data busAddress bus

Bar

nes/

IIIB

arne

s/3S

tB

arne

s/3S

t-R

dEx LU

/III

Rad

ix/3

St-

RdE

xLU/3

St

LU/3

St-R

dEx

Rad

ix/3

St

Oce

an/II

IO

cean

/3S

Rad

iosi

ty/3

St-R

dEx

Oce

an/3

St-R

dEx

Rad

ix/II

I

Rad

iosi

ty/II

I

Rad

iosi

ty/3

St

Ray

trac

e/III

Ray

trac

e/3S

tR

aytr

ace/

3St-R

dEx

App

l-Cod

e/III

App

l-Cod

e/3S

tA

ppl-C

ode/

3St-R

dEx

App

l-Dat

a/III

App

l-Dat

a/3S

tA

ppl-D

ata/

3St-R

dEx

OS-

Cod

e/III

OS-

Cod

e/3S

t

OS-

Dat

a/3S

tO

S-D

ata/

III

OS-

Cod

e/3S

t-RdE

x

OS-

Dat

a/3S

t-RdE

x

Page 78: Lecture 9 Outline

CSC/ECE 506: Architecture of Parallel Computers

Impact of Cache-Line Size• Multiprocessors add new kind of miss to cold, capacity, conflict

• Coherence misses: Due to invalidations• True sharing: Write to same word• False sharing: Write to different words

• How to reduce each kind of miss with an invalidation protocol• Capacity: enlarge cache; increase block size (if spatial locality)• Conflict: increase associativity• Cold and coherence: only block size

• Increasing block size has advantages and disadvantages• Can reduce misses if spatial locality is good• Can hurt too

• increase misses due to false sharing if spatial locality not good• increase misses due to conflicts in fixed-size cache• increase traffic due to fetching unnecessary data and due to false sharing• can increase miss penalty and perhaps hit cost

78

Page 79: Lecture 9 Outline

CSC/ECE 506: Architecture of Parallel Computers

Impact of Block Size on Miss Rate

79

• For default problem size: vary block/line size from 8-256 Bytes

• Decreases with larger lines: cold, capacity (due to spatial locality), true sharing (due to spatial locality)• Increases with larger lines: false sharing • Working set doesn’t fit: impact of capacity misses large in Ocean and Radix

Cold

Capacity

True sharing

False sharing

Upgrade

8

0

0.1

0.2

0.3

0.4

0.5

0.6

Cold

Capacity

True sharing

False sharing

Upgrade

8 6 2 4 8 6 80

2

4

6

8

10

12

Mis

s ra

te (%

)

Bar

nes/

8B

arne

s/16

Bar

nes/

32

Bar

nes/

64B

arne

s/12

8

Bar

nes/

256

Lu/8

Lu/1

6Lu

/32

Lu/6

4Lu

/128

Lu/2

56

Rad

iosi

ty/8

Rad

iosi

ty/1

6R

adio

sity

/32

Rad

iosi

ty/6

4

Rad

iosi

ty/1

28R

adio

sity

/256

Mis

s ra

te (%

)

Oce

an/8

Oce

an/1

6

Oce

an/3

2O

cean

/64

Oce

an/1

28

Oce

an/2

56

Rad

ix/8

Rad

ix/1

6

Rad

ix/3

2R

adix

/64

Rad

ix/1

28

Rad

ix/2

56

Ray

trac

e/8

Ray

trac

e/16

Ray

trac

e/32

Ray

trac

e/64

Ray

trac

e/12

8R

aytr

ace/

256

Page 80: Lecture 9 Outline

CSC/ECE 506: Architecture of Parallel Computers

Impact of Block Size on Traffic

80

• Results different than for miss rate: traffic almost always increases• When working sets fits, overall traffic still small, except for Radix• Fixed overhead is significant component

• So total traffic often minimized with 16–32 byte block, not smaller• Working set doesn’t fit: even 128-byte good for Ocean due to capacity

• Address bus traffic behaves in opposite way as the data bus traffic

Traffic (bytes/inst) affects performance indirectly through contention

Traf

fic (

byte

s/in

stru

ctio

n)

Traf

fic (

byte

s/FL

OP

)

Data bus

Address busData bus

Address bus

Rad

ix/8

Rad

ix/1

6

Rad

ix/3

2

Rad

ix/6

4

Rad

ix/1

28

Rad

ix/2

56

0

1

2

3

4

5

6

7

8

9

10

LU/8

LU/1

6

LU/3

2

LU/6

4

LU/1

28

LU/2

56

Oce

an/8

Oce

an/1

6

Oce

an/3

2

Oce

an/6

4

Oce

an/1

28

Oce

an/2

56

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2 4 280

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

Data busAddress bus

Bar

nes/

16

Traf

fic (b

ytes

/inst

ruct

ions

)

Bar

nes/

8

Bar

nes/

32

Bar

nes/

64B

arne

s/12

8

Bar

nes/

256

Rad

iosi

ty/8

Rad

iosi

ty/1

6R

adio

sity

/32

Rad

iosi

ty/6

4R

adio

sity

/128

Rad

iosi

ty/2

56

Ray

trac

e/8

Ray

trac

e/16

Ray

trac

e/32

Ray

trac

e/64

Ray

trac

e/12

8R

aytr

ace/

256

Page 81: Lecture 9 Outline

CSC/ECE 506: Architecture of Parallel Computers

Firefly Visualization

81

Start state. All caches empty and main memory has A = 1.

P1

CacheSnooper

P2

CacheSnooper

P3

CacheSnooper

Main memoryA = 1ControllerTraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A

Bus

Page 82: Lecture 9 Outline

CSC/ECE 506: Architecture of Parallel Computers

Processor P1 Reads A

82

Processor P1 attempts to read A from its cache.

P1

CacheSnooper

P2

CacheSnooper

P3

CacheSnooper

Main memoryA = 1ControllerTraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A

Bus

P1 PrRdMiss( ~S)P1 BusRd AMem returns data

Page 83: Lecture 9 Outline

CSC/ECE 506: Architecture of Parallel Computers

Processor P1 Reads A

83

Processor P1 issues a BusRd.

P1

CacheSnooper

P2

CacheSnooper

P3

CacheSnooper

Main memoryA = 1ControllerTraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A

Bus

P1 PrRdMiss( ~S)P1 BusRd AMem returns data

Page 84: Lecture 9 Outline

CSC/ECE 506: Architecture of Parallel Computers

Processor P1 Reads A

84

Main memory returns data to processor P1 which updates its cache.

P1

CacheA = 1 VSnooper

P2

CacheSnooper

P3

CacheSnooper

Main memoryA = 1ControllerTraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A

Bus

P1 PrRdMiss( ~S)P1 BusRd AMem returns data

Page 85: Lecture 9 Outline

CSC/ECE 506: Architecture of Parallel Computers

Processor P1 Reads A

85

Read operation completes. P1

CacheA = 1 VSnooper

P2

CacheSnooper

P3

CacheSnooper

Main memoryA = 1ControllerTraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A

Bus

Page 86: Lecture 9 Outline

CSC/ECE 506: Architecture of Parallel Computers

Processor P1 Writes A = 2

86

Processor P1 writes to its cache.

P1

CacheA = 2 MSnooper

P2

CacheSnooper

P3

CacheSnooper

Main memoryA = 1ControllerTraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A

Bus

P1 PrWr A

D

Page 87: Lecture 9 Outline

CSC/ECE 506: Architecture of Parallel Computers

Processor P1 Writes A = 2

87

Write operation completes. P1

CacheA = 2 DSnooper

P2

CacheSnooper

P3

CacheSnooper

Main memoryA = 1ControllerTraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A

Bus

Page 88: Lecture 9 Outline

CSC/ECE 506: Architecture of Parallel Computers

Processor P3 Reads A

Processor P3 attempts to read A from its cache.

P1

CacheA = 2 DSnooper

P2

Cache

Snooper

P3

Cache

Snooper

Main memory

A = 1Controller

TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A

Bus

P3 PrRdMiss( S)P3 BusRd AP1 snoops BusRdP1 Flush

Page 89: Lecture 9 Outline

CSC/ECE 506: Architecture of Parallel Computers

Processor P3 Reads A

Processor P3 issues a BusRd.

P1

CacheA = 2 DSnooper

P2

Cache

Snooper

P3

Cache

Snooper

Main memory

A = 1Controller

TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A

Bus

P3 PrRdMiss( S)P3 BusRd AP1 snoops BusRdP1 Flush

Page 90: Lecture 9 Outline

CSC/ECE 506: Architecture of Parallel Computers

Processor P3 Reads A

Processor P1 snoops the BusRd from processor P3.

P1

CacheA = 2 SSnooper

P2

Cache

Snooper

P3

Cache

Snooper

Main memory

A = 1Controller

TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A

Bus

P3 PrRdMiss( S)P3 BusRd AP1 snoops BusRdP1 Flush

Page 91: Lecture 9 Outline

CSC/ECE 506: Architecture of Parallel Computers

Processor P3 Reads A

Processor P1 flushes, sending updated data to P3

and main memory.

P1

CacheA = 2 SSnooper

P2

Cache

Snooper

P3

CacheA = 2 SSnooper

Main memory

A = 2Controller

TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A

Bus

P3 PrRdMiss( S)P3 BusRd AP1 snoops BusRdP1 Flush

Page 92: Lecture 9 Outline

CSC/ECE 506: Architecture of Parallel Computers

Processor P3 Reads A

Read operation completes. P1

CacheA = 2 SSnooper

P2

Cache

Snooper

P3

CacheA = 2 SSnooper

Main memory

A = 2Controller

TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A

Bus

Page 93: Lecture 9 Outline

CSC/ECE 506: Architecture of Parallel Computers

Processor P3 Writes A = 3

Processor P3 writes to its cache.

P1

CacheA = 2 SSnooper

P2

Cache

Snooper

P3

CacheA = 2 SSnooper

Main memory

A = 2Controller

TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A

Bus

P3 PrWr AP3 BusUpdP1 snoops BusUpd

Page 94: Lecture 9 Outline

CSC/ECE 506: Architecture of Parallel Computers

Processor P3 Writes A = 3

Processor P3 issues a BusRd request.

P1

CacheA = 2 SSnooper

P2

Cache

Snooper

P3

CacheA = 2 SSnooper

Main memory

A = 2Controller

TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A

Bus

P3 PrWr AP3 BusUpdP1 snoops BusUpd

Note: BusUpd

Page 95: Lecture 9 Outline

CSC/ECE 506: Architecture of Parallel Computers

Processor P3 Writes A = 3

Processor P1 snoops the BusRd and invalidates its cache.

P1

CacheA = 3 SSnooper

P2

Cache

Snooper

P3

Main memory

A = 3Controller

TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A

Bus

P3 PrWr AP3 BusUpdP1 snoops BusUpd

CacheA = 3 SSnooper

Page 96: Lecture 9 Outline

CSC/ECE 506: Architecture of Parallel Computers

Processor P3 Writes A = 3

Write operation completes. P1

CacheA = 3 SSnooper

P2

Cache

Snooper

P3

CacheA = 3 SSnooper

Main memory

A = 3Controller

TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A

Bus

Page 97: Lecture 9 Outline

CSC/ECE 506: Architecture of Parallel Computers

Processor P1 Reads A

Processor P1 reads from its cache.

P1

CacheA = 3 SSnooper

P2

Cache

Snooper

P3

CacheA = 3 SSnooper

Main memory

A = 3Controller

TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A

Bus

P3 PrRdHitP3 returns data

Page 98: Lecture 9 Outline

CSC/ECE 506: Architecture of Parallel Computers

Processor P1 Reads A

Processor P1 reads from its cache.

P1

CacheA = 3 SSnooper

P2

Cache

Snooper

P3

CacheA = 3 SSnooper

Main memory

A = 3Controller

TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A

Bus

P3 PrRdHitP3 returns data

Page 99: Lecture 9 Outline

CSC/ECE 506: Architecture of Parallel Computers

Processor P3 Reads A

Processor P3 reads from its cache.

P1

CacheA = 3 SSnooper

P2

Cache

Snooper

P3

CacheA = 3 SSnooper

Main memory

A = 3Controller

TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A

Bus

P3 PrRdHitP3 returns data

Page 100: Lecture 9 Outline

CSC/ECE 506: Architecture of Parallel Computers

Processor P3 Reads A

Processor P3 returns valid data from its cache.

P1

CacheA = 3 SSnooper

P2

Cache

Snooper

P3

CacheA = 3 SSnooper

Main memory

A = 3Controller

TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A

Bus

P3 PrRdHitP3 returns data

Page 101: Lecture 9 Outline

CSC/ECE 506: Architecture of Parallel Computers

Processor P2 Reads A

Processor P2 reads from its cache.

P1

CacheA = 3 SSnooper

P2

Cache

Snooper

P3

CacheA = 3 SSnooper

Main memory

A = 3Controller

TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A

Bus

P2 PrRdMiss(S)P2 BusRd AP1 Flush

Page 102: Lecture 9 Outline

CSC/ECE 506: Architecture of Parallel Computers

Processor P2 Reads A

Processor P2 issues a BusRd request.

P1

CacheA = 3 SSnooper

P2

Cache

Snooper

P3

CacheA = 3 SSnooper

Main memory

A = 3Controller

TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A

Bus

P2 PrRdMiss(S)P2 BusRd AP1 Flush

Page 103: Lecture 9 Outline

CSC/ECE 506: Architecture of Parallel Computers

Processor P2 Reads A

Main memory controller observes the BusRd.

P1

CacheA = 3 SSnooper

P2

Cache

Snooper

P3

CacheA = 3 SSnooper

Main memory

A = 3Controller

TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A

Bus

P2 PrRdMis(S)P2 BusRd AP1 Flush

A = 3 S

Page 104: Lecture 9 Outline

CSC/ECE 506: Architecture of Parallel Computers

Processor P2 Reads A

Operation completes. P1

CacheA = 3 SSnooper

P2

CacheA = 3 SSnooper

P3

CacheA = 3 SSnooper

Main memory

A = 3Controller

TraceP1 Read AP1 Write A = 2P3 Read AP3 Write A = 3P1 Read AP3 Read AP2 Read A

Bus

Page 105: Lecture 9 Outline

CSC/ECE 506: Architecture of Parallel Computers

Firefly Example

105

Proc Action State P1 State P2 State P3 Bus Action Data From

R1 V – – BusRd Mem

W1 D – – – Own cache

R3 S – S BusRd/Flush P1 cache

W3 S – S BusUpd P3 cache

R1 S – S - Own cache

R3 S – S – Own cache

R2 S S S BusRd/Flush P1 Cache