Upload
reginald-watts
View
212
Download
0
Embed Size (px)
Citation preview
(C) 2005 Multifacet Project
Token Coherence: A Framework for Implementing
Multiple-CMP Systems
Mike Marty1, Jesse Bingham2, Mark Hill1, Alan Hu2, Milo Martin3, and David Wood1
1University of Wisconsin-Madison2University of British Columbia
3University of Pennsylvania
February 17th, 2005
Slide 2 Improving Multiple-CMP Systems using Token Coherence
Summary
• Microprocessor Chip Multiprocessor (CMP)• Symmetric Multiprocessor (SMP) Multiple CMPs
• Problem: Coherence with Multiple CMPs
• Old Solution: Hierarchical Directory Complex & Slow
• New Solution: Apply Token Coherence– Developed for glueless multiprocessor [2003]– Keep: Flat for Correctness– Exploit: Hierarchical for performance
• Less Complex & Faster than Hierarchical Directory
Slide 3 Improving Multiple-CMP Systems using Token Coherence
Outline
• Motivation and Background– Coherence in Multiple-CMP Systems– Example: DirectoryCMP
• Token Coherence: Flat for Correctness
• Token Coherence: Hierarchical for Performance
• Evaluation
Slide 4 Improving Multiple-CMP Systems using Token Coherence
Coherence in Multiple-CMP Systems
CMP 3 CMP 4
CMP 2CMP 1
interconnect
I D I D I D I D
P P P P
L2 L2 L2 L2
• Chip Multiprocessors (CMPs) emerging• Larger systems will be built with Multiple CMPs
interconnect
Slide 5 Improving Multiple-CMP Systems using Token Coherence
Problem: Hierarchical Coherence
Inter-CMP Coherence
Intra-CMP Coherence
• Intra-CMP protocol for coherence within CMP• Inter-CMP protocol for coherence between CMPs• Interactions between protocols increase complexity
– explodes state space
CMP 3 CMP 4
CMP 2CMP 1
interconnect
Slide 6 Improving Multiple-CMP Systems using Token Coherence
Improving Multiple CMP Systems with Token Coherence
• Token Coherence allows Multiple-CMP systems to be...– Flat for correctness, but– Hierarchical for performance
Correctness Substrate
PerformanceProtocol
Low Complexity
Fast
interconnect
CMP 3 CMP 4
CMP 2CMP 1
Slide 7 Improving Multiple-CMP Systems using Token Coherence
Memory/Directory
Example: DirectoryCMP
CMP 0
P0
Store B
CMP 1
L1 I&D
Shared L2 / directory
P1L1 I&D
P2L1 I&D
P3L1 I&D
P4L1 I&D
P5L1 I&D
P6L1 I&D
P7L1 I&D
getx
getx
fwd
fwd invinvinv
Shared L2 / directory
ackack ackdata/ack
data/ack
data/ack
S
O SSS
2-level MOESI Directory
getxWB
getx
WB
RACE CONDITIONS!
Store B
Memory/Directory
B: [S O] B: [M I]
Slide 8 Improving Multiple-CMP Systems using Token Coherence
Token Coherence Summary
• Token Coherence separates performance from correctness
• Correctness Substrate: Enforces coherence invariant and prevents starvation1. Safety with Token Counting
2. Starvation Avoidance with Persistent Requests
• Performance Policy: Makes the common case fast– Transient requests to seek tokens
• Unordered, untracked, unacknowledged
– Possible prediction, multicast, filters, etc
Slide 9 Improving Multiple-CMP Systems using Token Coherence
Outline
• Motivation and Background
• Token Coherence: Flat for Correctness– Safety– Starvation Avoidance
• Token Coherence: Hierarchical for Performance
• Evaluation
Slide 10 Improving Multiple-CMP Systems using Token Coherence
Store BLoad B
Example: Token Coherence [ISCA 2003]
Load B
• Each memory block initialized with T tokens• Tokens stored in memory, caches, & messages• At least one token to read a block• All tokens to write a block
P0L1 I&D
L2
P1L1 I&D
L2
P2L1 I&D
L2
P3L1 I&D
L2
interconnect
Store B
mem 0 mem 3
Slide 11 Improving Multiple-CMP Systems using Token Coherence
Extending to Multiple-CMP System
P0L1 I&D
L2
P1L1 I&D
L2
P2L1 I&D
L2
P3L1 I&D
L2
interconnectmem 0 mem 1
CMP 0
interconnect
Shared L2
CMP 1
interconnect
Shared L2
Slide 12 Improving Multiple-CMP Systems using Token Coherence
mem 0
Extending to Multiple-CMP SystemCMP 0
interconnect
P0
interconnect
P1
mem 1
CMP 1
interconnect
P2 P3
• Token counting remains flat• Tokens to caches
– Handles shared caches and other complex hierarchies
Shared L2 Shared L2
L1 I&D L1 I&D L1 I&D L1 I&D
Store BStore B
Slide 13 Improving Multiple-CMP Systems using Token Coherence
Safety Recap
• Safety: Maintain coherence invariant– Only one writer, or multiple readers
• Tokens for Safety– T Tokens associated with each memory block
– # tokens encoded in 1+log2T
– Processor acquires all tokens to write, a single token to read
• Tokens passed to nodes in glueless multiprocessor scheme– But CMPs have private and shared caches
• Tokens passed to caches in Multiple-CMP system– Arbitrary cache hierarchy easily handled
– Flat for correctness
Slide 14 Improving Multiple-CMP Systems using Token Coherence
Some Token Counting Implications
• Memory must store tokens– Separate RAM
– Use extra ECC bits
– Token cache
• T sized to # caches to allow read-only copies in all caches
• Replacements cannot be silent– Tokens must not be lost or dropped
• Targeted for invalidate-based protocols– Not a solution for write-through or update protocols
• Tokens must be identified by block address– Address must be in all token-carrying messages
Slide 15 Improving Multiple-CMP Systems using Token Coherence
Starvation Avoidance
• Request messages can miss tokens– In-flight tokens
• Transient Requests are not tracked throughout system
– Incorrect filtering, multicast, destination-set prediction, etc
• Possible Solution: Retries– Retry w/ optional randomized backoff is effective for races
• Guaranteed Solution: Persistent Requests– Heavyweight request guaranteed to succeed– Should be rare (uses more bandwidth)– Locates all tokens in the system– Orders competing requests
Slide 16 Improving Multiple-CMP Systems using Token Coherence
mem 0
Starvation AvoidanceCMP 0
interconnect
P0Store B
interconnect
P1
mem 1
CMP 1
interconnect
P2Store B
P3
• Tokens move freely in the system– Transient requests can miss in-flight tokens– Incorrect speculation, filters, prediction, etc
Shared L2 Shared L2
Store B
GETXGETX GETX
L1 I&D L1 I&D L1 I&D L1 I&D
Slide 17 Improving Multiple-CMP Systems using Token Coherence
mem 0
Starvation AvoidanceCMP 0
interconnect
P0
interconnect
P1
mem 1
CMP 1
interconnect
P2 P3
Shared L2 Shared L2
L1 I&D L1 I&D L1 I&D L1 I&D
• Solution: issue Persistent Request– Heavyweight request guaranteed to succeed– Methods: Centralized [2003] and Distributed (New)
Store B Store BStore B
Slide 18 Improving Multiple-CMP Systems using Token Coherence
mem 0
Old Scheme: Central Arbiter [2003]CMP 0
interconnect
P0Store B
interconnect
P1
mem 1
CMP 1
interconnect
P2Store B
P3
– Processors issue persistent requests
Shared L2 Shared L2
Store B
L1 I&D L1 I&D L1 I&D L1 I&D
arbiter 0
arbiter 0B: P0B: P2B: P1
timeout timeout timeout
Slide 19 Improving Multiple-CMP Systems using Token Coherence
mem 0
Old Scheme: Central Arbiter [2003]CMP 0
interconnect
P0Store B
interconnect
P1
mem 1
CMP 1
interconnect
P2Store B
P3
– Processors issue persistent requests– Arbiter orders and broadcasts activate
Shared L2 Shared L2
Store B
L1 I&D L1 I&D L1 I&D L1 I&D
arbiter 0
arbiter 0B: P0B: P2B: P1
B: P0
B: P0 B: P0 B: P0 B: P0
B: P0
Store B
Slide 20 Improving Multiple-CMP Systems using Token Coherence
mem 0
Old Scheme: Central Arbiter [2003]CMP 0
interconnect
P0
interconnect
P1
mem 1
CMP 1
interconnect
P2Store B
P3
– Processor sends deactivate to arbiter– Arbiter broadcasts deactivate (and next activate)– Bottom Line: handoff is 3 message latencies
Shared L2 Shared L2
Store B
L1 I&D L1 I&D L1 I&D L1 I&D
arbiter 0
arbiter 0
B: P2B: P1
B: P0
B: P0 B: P0 B: P0 B: P0
B: P0
B: P2
B: P2
B: P2 B: P2
B: P2
B: P2B: P2
Store B
B: P0
1 2
3
Slide 21 Improving Multiple-CMP Systems using Token Coherence
mem 0
Improved Scheme: Distributed Arbitration [NEW]
CMP 0
interconnect
P0Store B
interconnect
P1: BP2: B
P0: B
P1: BP2: B
P0: B P1P1: BP2: B
P0: B
mem 1
CMP 1
interconnect
P2Store B
P1: BP2: B
P0: B
P1: BP2: B
P0: B P3P1: BP2: B
P0: B
P1: BP2: B
P0: B
– Processors broadcast persistent requests
Shared L2 Shared L2
Store B
L1 I&D L1 I&D L1 I&D L1 I&D
Slide 22 Improving Multiple-CMP Systems using Token Coherence
mem 0
Improved Scheme: Distributed Arbitration [NEW]
CMP 0
interconnect
P0Store B
interconnect
P1: BP2: B
P0: B
P1: BP2: B
P0: B P1P1: BP2: B
P0: B
mem 1
CMP 1
interconnect
P2Store B
P1: BP2: B
P0: B
P1: BP2: B
P0: B P3P1: BP2: B
P0: B
P1: BP2: B
P0: B
– Processors broadcast persistent requests– Fixed priority (processor number)
Store B
P0: B P0: B
P0: B
P0: B
P0: B P0: B
P0: BShared L2Shared L2
L1 I&D L1 I&D L1 I&D L1 I&D
Slide 23 Improving Multiple-CMP Systems using Token Coherence
mem 0
Improved Scheme: Distributed Arbitration [NEW]
CMP 0
interconnect
P0
interconnect
P1: BP2: B
P0: B
P1: BP2: B
P0: B P1P1: BP2: B
P0: B
mem 1
CMP 1
interconnect
P2Store B
P1: BP2: B
P0: B
P1: BP2: B
P0: B P3P1: BP2: B
P0: B
P1: BP2: B
P0: B
Shared L2 Shared L2
Store B
– Processors broadcast persistent requests– Fixed priority (processor number)– Processors broadcast deactivate
P1: B P1: B P1: B P1: B
P1: B
P1: B P1: B
L1 I&D L1 I&D L1 I&D L1 I&D1
Slide 24 Improving Multiple-CMP Systems using Token Coherence
mem 0
Improved Scheme: Distributed Arbitration [NEW]
CMP 0
interconnect
P0
interconnect
P1: BP2: B
P1: BP2: B
P1P1: BP2: B
mem 1
CMP 1
interconnect
P2
P1: BP2: B
P1: BP2: B
P3P1: BP2: B
P1: BP2: B
Shared L2 Shared L2
– Bottom line: Handoff is a single message latency• Subtle point: P0 and P1 must wait until next “wave”
P1: B P1: B P1: B P1: B
P1: B
P1: B P1: B
L1 I&D L1 I&D L1 I&D L1 I&D
Slide 25 Improving Multiple-CMP Systems using Token Coherence
Implementing Distributed Persistent Requests
• Table at each cache– Sized to N entries for each processor (we use N=1)– Indexed by processor ID– Content-addressable by Address
• Each incoming message must access table– Not on the critical path– can be slow CAM
• Activate/deactivate reordering cannot be allowed– Persistent request virtual channel must be point-to-point
ordered– Or, other solution such as sequence numbers or acks
Slide 26 Improving Multiple-CMP Systems using Token Coherence
Implementing Distributed Persistent Requests
• Should reads be distinguished from writes?– Not necessary, but– Persistent Read request is helpful
• Implications of flat distributed arbitration– Simple flat for correctness– Global broadcast when used
• Fortunately they are rare in typical workloads (0.3%)• Bad workload (very high contention) would burn bandwidth
– Maximum # processors must be architected
• What about a hierarchical persistent request scheme?– Possible, but correctness is no longer flat– Make the common case fast
Slide 27 Improving Multiple-CMP Systems using Token Coherence
Reducing Unnecessary Traffic
• Problem: Which token-holding cache responds with data?
• Solution: Distinguish one token as the owner token
– The owner includes data with token response
– Clean vs. dirty owner distinction also useful for writebacks
Slide 28 Improving Multiple-CMP Systems using Token Coherence
Outline
• Motivation and Background
• Token Coherence: Flat for Correctness
• Token Coherence: Hierarchical for Performance– TokenCMP– Another look at performance policies
• Evaluation
Slide 29 Improving Multiple-CMP Systems using Token Coherence
Hierarchical for Performance: TokenCMP
• Target System:– 2-8 CMPs– Private L1s, shared L2 per CMP– Any interconnect, but high-bandwidth
• Performance Policy Goals: – Aggressively acquire tokens– Exploit on-chip locality and bandwidth– Respect cache hierarchy– Detecting and handling missed tokens
Slide 30 Improving Multiple-CMP Systems using Token Coherence
Hierarchical for Performance: TokenCMP
• Approach:– On L1 miss, broadcast within own CMP
• Local cache responds if possible
– On L2 miss, broadcast to other CMPs– Appropriate L2 bank responds or broadcasts within its CMP
• Optionally filter
– Responses between CMPs carry extra tokensfor future locality
• Handling missed tokens:– Timeout after average memory latency – Invoke persistent request (no retries)
• Larger systems can use filters, multicast, soft-state directories
Slide 31 Improving Multiple-CMP Systems using Token Coherence
Other Optimizations in TokenCMP
• Implementing E-state– Memory responds with all tokens on read request– Use clean/dirty owner distinction to eliminate writing back
unwritten data
• Implementing Migratory Sharing– What is it?
• A processor’s read request results in exclusive permission if responder has exclusive permission and wrote the block
– In TokenCMP, simply return all tokens
• Non-speculative delay– Hold block for some # cycles so permission isn’t stolen
prematurely
Slide 32 Improving Multiple-CMP Systems using Token Coherence
Another Look at Performance Policies
• How to find tokens?– Broadcast– Broadcast w/ filters– Multicast (destination-set prediction)– Directories (soft or hard)
• Who responds with data?– Owner token
• TokenCMP uses Owner token for Inter-CMP responses
– Other heuristics• For TokenCMP intra-CMP responses, cache responds if it has
extra tokens
Slide 33 Improving Multiple-CMP Systems using Token Coherence
Transient Requests May Reduce Complexity
• Processor holds the only required state about request
• L2 controller in TokenCMP very simple:– Re-broadcasts L1 request message on a miss– Re-broadcasts or filters external request messages– Possible states:
• no tokens (I)• all tokens (M) • some tokens (S)
– Bounce unexpected tokens to memory
• DirectoryCMP’s L2 controller is complex– Allocates MSHR on miss and forward– Issues invalidates and receives acks– Orders all intra-CMP requests and writebacks– 57 states in our L2 implementation!
Slide 34 Improving Multiple-CMP Systems using Token Coherence
Writebacks
• DirectoryCMP uses “3-phase writebacks”– L1 issues writeback request– L2 enters transient state or blocks request– L2 responds with writeback ack– L1 sends data
• TokenCMP uses “fire-and-forget” writebacks– Immediately send tokens and data– Heuristic: Only send data if # tokens > 1
Slide 35 Improving Multiple-CMP Systems using Token Coherence
Outline
• Motivation and Background
• Token Coherence: Flat for Correctness
• Token Coherence: Hierarchical for Performance
• Evaluation– Model checking– Performance w/ commercial workloads– Robustness
Slide 36 Improving Multiple-CMP Systems using Token Coherence
TokenCMP Evaluation
• Simple?– Some anecdotal examples and comparisons– Model checking
• Fast?– Full-system simulation w/ commercial workloads
• Robust?– Micro-benchmarks to simulate high contention
Slide 37 Improving Multiple-CMP Systems using Token Coherence
Complexity Evaluation with Model Checking
This work performed by Jesse Bingham and Alan Hu of the University of British Columbia
• Methods:– TLA+ and TLC
– DirectoryCMP omits all intra-CMP details
– TokenCMP’s correctness substrate modeled
• Result:– Complexity similar between TokenCMP and non-hierarchical
DirectoryCMP
– Correctness Substrate verified to be correct and deadlock-free
– All possible performance protocols correct
Slide 38 Improving Multiple-CMP Systems using Token Coherence
Performance Evaluation
• Target System:– 4 CMPs, 4 procs/cmp– 2GHz OoO SPARC, 8MB shared L2 per chip– Directly connected interconnect
• Methods: Multifacet GEMS simulator– Simics augmented with timing models– Released soon: http://www.cs.wisc.edu/gems
• Benchmarks:– Performance: Apache, Spec, OLTP– Robustness: Locking uBenchmark
Slide 39 Improving Multiple-CMP Systems using Token Coherence
Full-system Simulation: Runtime
– TokenCMP performs 9-50% faster than DirectoryCMP
Slide 40 Improving Multiple-CMP Systems using Token Coherence
Full-system Simulation: Runtime
– TokenCMP performs 9-50% faster than DirectoryCMP
DRAM Directory
Perfect L2
Slide 41 Improving Multiple-CMP Systems using Token Coherence
Full-system Simulation: Inter-CMP Traffic
– TokenCMP traffic is reasonable (or better)
• DirectoryCMP control overhead greater than broadcast for small system
Slide 42 Improving Multiple-CMP Systems using Token Coherence
Full-system Simulation: Intra-CMP Traffic
Slide 43 Improving Multiple-CMP Systems using Token Coherence
Performance Robustness
Locking micro-benchmark
less contentionmore contention
(correctness substrate only)
Slide 44 Improving Multiple-CMP Systems using Token Coherence
Performance Robustness
Locking micro-benchmark
less contentionmore contention
(correctness substrate only)
Slide 45 Improving Multiple-CMP Systems using Token Coherence
Performance Robustness
Locking micro-benchmark
less contentionmore contention
Slide 46 Improving Multiple-CMP Systems using Token Coherence
Summary
• Microprocessor Chip Multiprocessor (CMP)• Symmetric Multiprocessor (SMP) Multiple CMPs
• Problem: Coherence with Multiple CMPs
• Old Solution: Hierarchical Directory Complex & Slow
• New Solution: Apply Token Coherence– Developed for glueless multiprocessor [2003]– Keep: Flat for Correctness– Exploit: Hierarchical for performance
• Less Complex & Faster than Hierarchical Directory