46
(C) 2005 Multifacet Project Token Coherence: A Framework for Implementing Multiple- CMP Systems Mike Marty 1 , Jesse Bingham 2 , Mark Hill 1 , Alan Hu 2 , Milo Martin 3 , and David Wood 1 1 University of Wisconsin-Madison 2 University of British Columbia 3 University of Pennsylvania February 17 th , 2005

(C) 2005 Multifacet Project Token Coherence: A Framework for Implementing Multiple-CMP Systems Mike Marty 1, Jesse Bingham 2, Mark Hill 1, Alan Hu 2, Milo

Embed Size (px)

Citation preview

Page 1: (C) 2005 Multifacet Project Token Coherence: A Framework for Implementing Multiple-CMP Systems Mike Marty 1, Jesse Bingham 2, Mark Hill 1, Alan Hu 2, Milo

(C) 2005 Multifacet Project

Token Coherence: A Framework for Implementing

Multiple-CMP Systems

Mike Marty1, Jesse Bingham2, Mark Hill1, Alan Hu2, Milo Martin3, and David Wood1

1University of Wisconsin-Madison2University of British Columbia

3University of Pennsylvania

February 17th, 2005

Page 2: (C) 2005 Multifacet Project Token Coherence: A Framework for Implementing Multiple-CMP Systems Mike Marty 1, Jesse Bingham 2, Mark Hill 1, Alan Hu 2, Milo

Slide 2 Improving Multiple-CMP Systems using Token Coherence

Summary

• Microprocessor Chip Multiprocessor (CMP)• Symmetric Multiprocessor (SMP) Multiple CMPs

• Problem: Coherence with Multiple CMPs

• Old Solution: Hierarchical Directory Complex & Slow

• New Solution: Apply Token Coherence– Developed for glueless multiprocessor [2003]– Keep: Flat for Correctness– Exploit: Hierarchical for performance

• Less Complex & Faster than Hierarchical Directory

Page 3: (C) 2005 Multifacet Project Token Coherence: A Framework for Implementing Multiple-CMP Systems Mike Marty 1, Jesse Bingham 2, Mark Hill 1, Alan Hu 2, Milo

Slide 3 Improving Multiple-CMP Systems using Token Coherence

Outline

• Motivation and Background– Coherence in Multiple-CMP Systems– Example: DirectoryCMP

• Token Coherence: Flat for Correctness

• Token Coherence: Hierarchical for Performance

• Evaluation

Page 4: (C) 2005 Multifacet Project Token Coherence: A Framework for Implementing Multiple-CMP Systems Mike Marty 1, Jesse Bingham 2, Mark Hill 1, Alan Hu 2, Milo

Slide 4 Improving Multiple-CMP Systems using Token Coherence

Coherence in Multiple-CMP Systems

CMP 3 CMP 4

CMP 2CMP 1

interconnect

I D I D I D I D

P P P P

L2 L2 L2 L2

• Chip Multiprocessors (CMPs) emerging• Larger systems will be built with Multiple CMPs

interconnect

Page 5: (C) 2005 Multifacet Project Token Coherence: A Framework for Implementing Multiple-CMP Systems Mike Marty 1, Jesse Bingham 2, Mark Hill 1, Alan Hu 2, Milo

Slide 5 Improving Multiple-CMP Systems using Token Coherence

Problem: Hierarchical Coherence

Inter-CMP Coherence

Intra-CMP Coherence

• Intra-CMP protocol for coherence within CMP• Inter-CMP protocol for coherence between CMPs• Interactions between protocols increase complexity

– explodes state space

CMP 3 CMP 4

CMP 2CMP 1

interconnect

Page 6: (C) 2005 Multifacet Project Token Coherence: A Framework for Implementing Multiple-CMP Systems Mike Marty 1, Jesse Bingham 2, Mark Hill 1, Alan Hu 2, Milo

Slide 6 Improving Multiple-CMP Systems using Token Coherence

Improving Multiple CMP Systems with Token Coherence

• Token Coherence allows Multiple-CMP systems to be...– Flat for correctness, but– Hierarchical for performance

Correctness Substrate

PerformanceProtocol

Low Complexity

Fast

interconnect

CMP 3 CMP 4

CMP 2CMP 1

Page 7: (C) 2005 Multifacet Project Token Coherence: A Framework for Implementing Multiple-CMP Systems Mike Marty 1, Jesse Bingham 2, Mark Hill 1, Alan Hu 2, Milo

Slide 7 Improving Multiple-CMP Systems using Token Coherence

Memory/Directory

Example: DirectoryCMP

CMP 0

P0

Store B

CMP 1

L1 I&D

Shared L2 / directory

P1L1 I&D

P2L1 I&D

P3L1 I&D

P4L1 I&D

P5L1 I&D

P6L1 I&D

P7L1 I&D

getx

getx

fwd

fwd invinvinv

Shared L2 / directory

ackack ackdata/ack

data/ack

data/ack

S

O SSS

2-level MOESI Directory

getxWB

getx

WB

RACE CONDITIONS!

Store B

Memory/Directory

B: [S O] B: [M I]

Page 8: (C) 2005 Multifacet Project Token Coherence: A Framework for Implementing Multiple-CMP Systems Mike Marty 1, Jesse Bingham 2, Mark Hill 1, Alan Hu 2, Milo

Slide 8 Improving Multiple-CMP Systems using Token Coherence

Token Coherence Summary

• Token Coherence separates performance from correctness

• Correctness Substrate: Enforces coherence invariant and prevents starvation1. Safety with Token Counting

2. Starvation Avoidance with Persistent Requests

• Performance Policy: Makes the common case fast– Transient requests to seek tokens

• Unordered, untracked, unacknowledged

– Possible prediction, multicast, filters, etc

Page 9: (C) 2005 Multifacet Project Token Coherence: A Framework for Implementing Multiple-CMP Systems Mike Marty 1, Jesse Bingham 2, Mark Hill 1, Alan Hu 2, Milo

Slide 9 Improving Multiple-CMP Systems using Token Coherence

Outline

• Motivation and Background

• Token Coherence: Flat for Correctness– Safety– Starvation Avoidance

• Token Coherence: Hierarchical for Performance

• Evaluation

Page 10: (C) 2005 Multifacet Project Token Coherence: A Framework for Implementing Multiple-CMP Systems Mike Marty 1, Jesse Bingham 2, Mark Hill 1, Alan Hu 2, Milo

Slide 10 Improving Multiple-CMP Systems using Token Coherence

Store BLoad B

Example: Token Coherence [ISCA 2003]

Load B

• Each memory block initialized with T tokens• Tokens stored in memory, caches, & messages• At least one token to read a block• All tokens to write a block

P0L1 I&D

L2

P1L1 I&D

L2

P2L1 I&D

L2

P3L1 I&D

L2

interconnect

Store B

mem 0 mem 3

Page 11: (C) 2005 Multifacet Project Token Coherence: A Framework for Implementing Multiple-CMP Systems Mike Marty 1, Jesse Bingham 2, Mark Hill 1, Alan Hu 2, Milo

Slide 11 Improving Multiple-CMP Systems using Token Coherence

Extending to Multiple-CMP System

P0L1 I&D

L2

P1L1 I&D

L2

P2L1 I&D

L2

P3L1 I&D

L2

interconnectmem 0 mem 1

CMP 0

interconnect

Shared L2

CMP 1

interconnect

Shared L2

Page 12: (C) 2005 Multifacet Project Token Coherence: A Framework for Implementing Multiple-CMP Systems Mike Marty 1, Jesse Bingham 2, Mark Hill 1, Alan Hu 2, Milo

Slide 12 Improving Multiple-CMP Systems using Token Coherence

mem 0

Extending to Multiple-CMP SystemCMP 0

interconnect

P0

interconnect

P1

mem 1

CMP 1

interconnect

P2 P3

• Token counting remains flat• Tokens to caches

– Handles shared caches and other complex hierarchies

Shared L2 Shared L2

L1 I&D L1 I&D L1 I&D L1 I&D

Store BStore B

Page 13: (C) 2005 Multifacet Project Token Coherence: A Framework for Implementing Multiple-CMP Systems Mike Marty 1, Jesse Bingham 2, Mark Hill 1, Alan Hu 2, Milo

Slide 13 Improving Multiple-CMP Systems using Token Coherence

Safety Recap

• Safety: Maintain coherence invariant– Only one writer, or multiple readers

• Tokens for Safety– T Tokens associated with each memory block

– # tokens encoded in 1+log2T

– Processor acquires all tokens to write, a single token to read

• Tokens passed to nodes in glueless multiprocessor scheme– But CMPs have private and shared caches

• Tokens passed to caches in Multiple-CMP system– Arbitrary cache hierarchy easily handled

– Flat for correctness

Page 14: (C) 2005 Multifacet Project Token Coherence: A Framework for Implementing Multiple-CMP Systems Mike Marty 1, Jesse Bingham 2, Mark Hill 1, Alan Hu 2, Milo

Slide 14 Improving Multiple-CMP Systems using Token Coherence

Some Token Counting Implications

• Memory must store tokens– Separate RAM

– Use extra ECC bits

– Token cache

• T sized to # caches to allow read-only copies in all caches

• Replacements cannot be silent– Tokens must not be lost or dropped

• Targeted for invalidate-based protocols– Not a solution for write-through or update protocols

• Tokens must be identified by block address– Address must be in all token-carrying messages

Page 15: (C) 2005 Multifacet Project Token Coherence: A Framework for Implementing Multiple-CMP Systems Mike Marty 1, Jesse Bingham 2, Mark Hill 1, Alan Hu 2, Milo

Slide 15 Improving Multiple-CMP Systems using Token Coherence

Starvation Avoidance

• Request messages can miss tokens– In-flight tokens

• Transient Requests are not tracked throughout system

– Incorrect filtering, multicast, destination-set prediction, etc

• Possible Solution: Retries– Retry w/ optional randomized backoff is effective for races

• Guaranteed Solution: Persistent Requests– Heavyweight request guaranteed to succeed– Should be rare (uses more bandwidth)– Locates all tokens in the system– Orders competing requests

Page 16: (C) 2005 Multifacet Project Token Coherence: A Framework for Implementing Multiple-CMP Systems Mike Marty 1, Jesse Bingham 2, Mark Hill 1, Alan Hu 2, Milo

Slide 16 Improving Multiple-CMP Systems using Token Coherence

mem 0

Starvation AvoidanceCMP 0

interconnect

P0Store B

interconnect

P1

mem 1

CMP 1

interconnect

P2Store B

P3

• Tokens move freely in the system– Transient requests can miss in-flight tokens– Incorrect speculation, filters, prediction, etc

Shared L2 Shared L2

Store B

GETXGETX GETX

L1 I&D L1 I&D L1 I&D L1 I&D

Page 17: (C) 2005 Multifacet Project Token Coherence: A Framework for Implementing Multiple-CMP Systems Mike Marty 1, Jesse Bingham 2, Mark Hill 1, Alan Hu 2, Milo

Slide 17 Improving Multiple-CMP Systems using Token Coherence

mem 0

Starvation AvoidanceCMP 0

interconnect

P0

interconnect

P1

mem 1

CMP 1

interconnect

P2 P3

Shared L2 Shared L2

L1 I&D L1 I&D L1 I&D L1 I&D

• Solution: issue Persistent Request– Heavyweight request guaranteed to succeed– Methods: Centralized [2003] and Distributed (New)

Store B Store BStore B

Page 18: (C) 2005 Multifacet Project Token Coherence: A Framework for Implementing Multiple-CMP Systems Mike Marty 1, Jesse Bingham 2, Mark Hill 1, Alan Hu 2, Milo

Slide 18 Improving Multiple-CMP Systems using Token Coherence

mem 0

Old Scheme: Central Arbiter [2003]CMP 0

interconnect

P0Store B

interconnect

P1

mem 1

CMP 1

interconnect

P2Store B

P3

– Processors issue persistent requests

Shared L2 Shared L2

Store B

L1 I&D L1 I&D L1 I&D L1 I&D

arbiter 0

arbiter 0B: P0B: P2B: P1

timeout timeout timeout

Page 19: (C) 2005 Multifacet Project Token Coherence: A Framework for Implementing Multiple-CMP Systems Mike Marty 1, Jesse Bingham 2, Mark Hill 1, Alan Hu 2, Milo

Slide 19 Improving Multiple-CMP Systems using Token Coherence

mem 0

Old Scheme: Central Arbiter [2003]CMP 0

interconnect

P0Store B

interconnect

P1

mem 1

CMP 1

interconnect

P2Store B

P3

– Processors issue persistent requests– Arbiter orders and broadcasts activate

Shared L2 Shared L2

Store B

L1 I&D L1 I&D L1 I&D L1 I&D

arbiter 0

arbiter 0B: P0B: P2B: P1

B: P0

B: P0 B: P0 B: P0 B: P0

B: P0

Store B

Page 20: (C) 2005 Multifacet Project Token Coherence: A Framework for Implementing Multiple-CMP Systems Mike Marty 1, Jesse Bingham 2, Mark Hill 1, Alan Hu 2, Milo

Slide 20 Improving Multiple-CMP Systems using Token Coherence

mem 0

Old Scheme: Central Arbiter [2003]CMP 0

interconnect

P0

interconnect

P1

mem 1

CMP 1

interconnect

P2Store B

P3

– Processor sends deactivate to arbiter– Arbiter broadcasts deactivate (and next activate)– Bottom Line: handoff is 3 message latencies

Shared L2 Shared L2

Store B

L1 I&D L1 I&D L1 I&D L1 I&D

arbiter 0

arbiter 0

B: P2B: P1

B: P0

B: P0 B: P0 B: P0 B: P0

B: P0

B: P2

B: P2

B: P2 B: P2

B: P2

B: P2B: P2

Store B

B: P0

1 2

3

Page 21: (C) 2005 Multifacet Project Token Coherence: A Framework for Implementing Multiple-CMP Systems Mike Marty 1, Jesse Bingham 2, Mark Hill 1, Alan Hu 2, Milo

Slide 21 Improving Multiple-CMP Systems using Token Coherence

mem 0

Improved Scheme: Distributed Arbitration [NEW]

CMP 0

interconnect

P0Store B

interconnect

P1: BP2: B

P0: B

P1: BP2: B

P0: B P1P1: BP2: B

P0: B

mem 1

CMP 1

interconnect

P2Store B

P1: BP2: B

P0: B

P1: BP2: B

P0: B P3P1: BP2: B

P0: B

P1: BP2: B

P0: B

– Processors broadcast persistent requests

Shared L2 Shared L2

Store B

L1 I&D L1 I&D L1 I&D L1 I&D

Page 22: (C) 2005 Multifacet Project Token Coherence: A Framework for Implementing Multiple-CMP Systems Mike Marty 1, Jesse Bingham 2, Mark Hill 1, Alan Hu 2, Milo

Slide 22 Improving Multiple-CMP Systems using Token Coherence

mem 0

Improved Scheme: Distributed Arbitration [NEW]

CMP 0

interconnect

P0Store B

interconnect

P1: BP2: B

P0: B

P1: BP2: B

P0: B P1P1: BP2: B

P0: B

mem 1

CMP 1

interconnect

P2Store B

P1: BP2: B

P0: B

P1: BP2: B

P0: B P3P1: BP2: B

P0: B

P1: BP2: B

P0: B

– Processors broadcast persistent requests– Fixed priority (processor number)

Store B

P0: B P0: B

P0: B

P0: B

P0: B P0: B

P0: BShared L2Shared L2

L1 I&D L1 I&D L1 I&D L1 I&D

Page 23: (C) 2005 Multifacet Project Token Coherence: A Framework for Implementing Multiple-CMP Systems Mike Marty 1, Jesse Bingham 2, Mark Hill 1, Alan Hu 2, Milo

Slide 23 Improving Multiple-CMP Systems using Token Coherence

mem 0

Improved Scheme: Distributed Arbitration [NEW]

CMP 0

interconnect

P0

interconnect

P1: BP2: B

P0: B

P1: BP2: B

P0: B P1P1: BP2: B

P0: B

mem 1

CMP 1

interconnect

P2Store B

P1: BP2: B

P0: B

P1: BP2: B

P0: B P3P1: BP2: B

P0: B

P1: BP2: B

P0: B

Shared L2 Shared L2

Store B

– Processors broadcast persistent requests– Fixed priority (processor number)– Processors broadcast deactivate

P1: B P1: B P1: B P1: B

P1: B

P1: B P1: B

L1 I&D L1 I&D L1 I&D L1 I&D1

Page 24: (C) 2005 Multifacet Project Token Coherence: A Framework for Implementing Multiple-CMP Systems Mike Marty 1, Jesse Bingham 2, Mark Hill 1, Alan Hu 2, Milo

Slide 24 Improving Multiple-CMP Systems using Token Coherence

mem 0

Improved Scheme: Distributed Arbitration [NEW]

CMP 0

interconnect

P0

interconnect

P1: BP2: B

P1: BP2: B

P1P1: BP2: B

mem 1

CMP 1

interconnect

P2

P1: BP2: B

P1: BP2: B

P3P1: BP2: B

P1: BP2: B

Shared L2 Shared L2

– Bottom line: Handoff is a single message latency• Subtle point: P0 and P1 must wait until next “wave”

P1: B P1: B P1: B P1: B

P1: B

P1: B P1: B

L1 I&D L1 I&D L1 I&D L1 I&D

Page 25: (C) 2005 Multifacet Project Token Coherence: A Framework for Implementing Multiple-CMP Systems Mike Marty 1, Jesse Bingham 2, Mark Hill 1, Alan Hu 2, Milo

Slide 25 Improving Multiple-CMP Systems using Token Coherence

Implementing Distributed Persistent Requests

• Table at each cache– Sized to N entries for each processor (we use N=1)– Indexed by processor ID– Content-addressable by Address

• Each incoming message must access table– Not on the critical path– can be slow CAM

• Activate/deactivate reordering cannot be allowed– Persistent request virtual channel must be point-to-point

ordered– Or, other solution such as sequence numbers or acks

Page 26: (C) 2005 Multifacet Project Token Coherence: A Framework for Implementing Multiple-CMP Systems Mike Marty 1, Jesse Bingham 2, Mark Hill 1, Alan Hu 2, Milo

Slide 26 Improving Multiple-CMP Systems using Token Coherence

Implementing Distributed Persistent Requests

• Should reads be distinguished from writes?– Not necessary, but– Persistent Read request is helpful

• Implications of flat distributed arbitration– Simple flat for correctness– Global broadcast when used

• Fortunately they are rare in typical workloads (0.3%)• Bad workload (very high contention) would burn bandwidth

– Maximum # processors must be architected

• What about a hierarchical persistent request scheme?– Possible, but correctness is no longer flat– Make the common case fast

Page 27: (C) 2005 Multifacet Project Token Coherence: A Framework for Implementing Multiple-CMP Systems Mike Marty 1, Jesse Bingham 2, Mark Hill 1, Alan Hu 2, Milo

Slide 27 Improving Multiple-CMP Systems using Token Coherence

Reducing Unnecessary Traffic

• Problem: Which token-holding cache responds with data?

• Solution: Distinguish one token as the owner token

– The owner includes data with token response

– Clean vs. dirty owner distinction also useful for writebacks

Page 28: (C) 2005 Multifacet Project Token Coherence: A Framework for Implementing Multiple-CMP Systems Mike Marty 1, Jesse Bingham 2, Mark Hill 1, Alan Hu 2, Milo

Slide 28 Improving Multiple-CMP Systems using Token Coherence

Outline

• Motivation and Background

• Token Coherence: Flat for Correctness

• Token Coherence: Hierarchical for Performance– TokenCMP– Another look at performance policies

• Evaluation

Page 29: (C) 2005 Multifacet Project Token Coherence: A Framework for Implementing Multiple-CMP Systems Mike Marty 1, Jesse Bingham 2, Mark Hill 1, Alan Hu 2, Milo

Slide 29 Improving Multiple-CMP Systems using Token Coherence

Hierarchical for Performance: TokenCMP

• Target System:– 2-8 CMPs– Private L1s, shared L2 per CMP– Any interconnect, but high-bandwidth

• Performance Policy Goals: – Aggressively acquire tokens– Exploit on-chip locality and bandwidth– Respect cache hierarchy– Detecting and handling missed tokens

Page 30: (C) 2005 Multifacet Project Token Coherence: A Framework for Implementing Multiple-CMP Systems Mike Marty 1, Jesse Bingham 2, Mark Hill 1, Alan Hu 2, Milo

Slide 30 Improving Multiple-CMP Systems using Token Coherence

Hierarchical for Performance: TokenCMP

• Approach:– On L1 miss, broadcast within own CMP

• Local cache responds if possible

– On L2 miss, broadcast to other CMPs– Appropriate L2 bank responds or broadcasts within its CMP

• Optionally filter

– Responses between CMPs carry extra tokensfor future locality

• Handling missed tokens:– Timeout after average memory latency – Invoke persistent request (no retries)

• Larger systems can use filters, multicast, soft-state directories

Page 31: (C) 2005 Multifacet Project Token Coherence: A Framework for Implementing Multiple-CMP Systems Mike Marty 1, Jesse Bingham 2, Mark Hill 1, Alan Hu 2, Milo

Slide 31 Improving Multiple-CMP Systems using Token Coherence

Other Optimizations in TokenCMP

• Implementing E-state– Memory responds with all tokens on read request– Use clean/dirty owner distinction to eliminate writing back

unwritten data

• Implementing Migratory Sharing– What is it?

• A processor’s read request results in exclusive permission if responder has exclusive permission and wrote the block

– In TokenCMP, simply return all tokens

• Non-speculative delay– Hold block for some # cycles so permission isn’t stolen

prematurely

Page 32: (C) 2005 Multifacet Project Token Coherence: A Framework for Implementing Multiple-CMP Systems Mike Marty 1, Jesse Bingham 2, Mark Hill 1, Alan Hu 2, Milo

Slide 32 Improving Multiple-CMP Systems using Token Coherence

Another Look at Performance Policies

• How to find tokens?– Broadcast– Broadcast w/ filters– Multicast (destination-set prediction)– Directories (soft or hard)

• Who responds with data?– Owner token

• TokenCMP uses Owner token for Inter-CMP responses

– Other heuristics• For TokenCMP intra-CMP responses, cache responds if it has

extra tokens

Page 33: (C) 2005 Multifacet Project Token Coherence: A Framework for Implementing Multiple-CMP Systems Mike Marty 1, Jesse Bingham 2, Mark Hill 1, Alan Hu 2, Milo

Slide 33 Improving Multiple-CMP Systems using Token Coherence

Transient Requests May Reduce Complexity

• Processor holds the only required state about request

• L2 controller in TokenCMP very simple:– Re-broadcasts L1 request message on a miss– Re-broadcasts or filters external request messages– Possible states:

• no tokens (I)• all tokens (M) • some tokens (S)

– Bounce unexpected tokens to memory

• DirectoryCMP’s L2 controller is complex– Allocates MSHR on miss and forward– Issues invalidates and receives acks– Orders all intra-CMP requests and writebacks– 57 states in our L2 implementation!

Page 34: (C) 2005 Multifacet Project Token Coherence: A Framework for Implementing Multiple-CMP Systems Mike Marty 1, Jesse Bingham 2, Mark Hill 1, Alan Hu 2, Milo

Slide 34 Improving Multiple-CMP Systems using Token Coherence

Writebacks

• DirectoryCMP uses “3-phase writebacks”– L1 issues writeback request– L2 enters transient state or blocks request– L2 responds with writeback ack– L1 sends data

• TokenCMP uses “fire-and-forget” writebacks– Immediately send tokens and data– Heuristic: Only send data if # tokens > 1

Page 35: (C) 2005 Multifacet Project Token Coherence: A Framework for Implementing Multiple-CMP Systems Mike Marty 1, Jesse Bingham 2, Mark Hill 1, Alan Hu 2, Milo

Slide 35 Improving Multiple-CMP Systems using Token Coherence

Outline

• Motivation and Background

• Token Coherence: Flat for Correctness

• Token Coherence: Hierarchical for Performance

• Evaluation– Model checking– Performance w/ commercial workloads– Robustness

Page 36: (C) 2005 Multifacet Project Token Coherence: A Framework for Implementing Multiple-CMP Systems Mike Marty 1, Jesse Bingham 2, Mark Hill 1, Alan Hu 2, Milo

Slide 36 Improving Multiple-CMP Systems using Token Coherence

TokenCMP Evaluation

• Simple?– Some anecdotal examples and comparisons– Model checking

• Fast?– Full-system simulation w/ commercial workloads

• Robust?– Micro-benchmarks to simulate high contention

Page 37: (C) 2005 Multifacet Project Token Coherence: A Framework for Implementing Multiple-CMP Systems Mike Marty 1, Jesse Bingham 2, Mark Hill 1, Alan Hu 2, Milo

Slide 37 Improving Multiple-CMP Systems using Token Coherence

Complexity Evaluation with Model Checking

This work performed by Jesse Bingham and Alan Hu of the University of British Columbia

• Methods:– TLA+ and TLC

– DirectoryCMP omits all intra-CMP details

– TokenCMP’s correctness substrate modeled

• Result:– Complexity similar between TokenCMP and non-hierarchical

DirectoryCMP

– Correctness Substrate verified to be correct and deadlock-free

– All possible performance protocols correct

Page 38: (C) 2005 Multifacet Project Token Coherence: A Framework for Implementing Multiple-CMP Systems Mike Marty 1, Jesse Bingham 2, Mark Hill 1, Alan Hu 2, Milo

Slide 38 Improving Multiple-CMP Systems using Token Coherence

Performance Evaluation

• Target System:– 4 CMPs, 4 procs/cmp– 2GHz OoO SPARC, 8MB shared L2 per chip– Directly connected interconnect

• Methods: Multifacet GEMS simulator– Simics augmented with timing models– Released soon: http://www.cs.wisc.edu/gems

• Benchmarks:– Performance: Apache, Spec, OLTP– Robustness: Locking uBenchmark

Page 39: (C) 2005 Multifacet Project Token Coherence: A Framework for Implementing Multiple-CMP Systems Mike Marty 1, Jesse Bingham 2, Mark Hill 1, Alan Hu 2, Milo

Slide 39 Improving Multiple-CMP Systems using Token Coherence

Full-system Simulation: Runtime

– TokenCMP performs 9-50% faster than DirectoryCMP

Page 40: (C) 2005 Multifacet Project Token Coherence: A Framework for Implementing Multiple-CMP Systems Mike Marty 1, Jesse Bingham 2, Mark Hill 1, Alan Hu 2, Milo

Slide 40 Improving Multiple-CMP Systems using Token Coherence

Full-system Simulation: Runtime

– TokenCMP performs 9-50% faster than DirectoryCMP

DRAM Directory

Perfect L2

Page 41: (C) 2005 Multifacet Project Token Coherence: A Framework for Implementing Multiple-CMP Systems Mike Marty 1, Jesse Bingham 2, Mark Hill 1, Alan Hu 2, Milo

Slide 41 Improving Multiple-CMP Systems using Token Coherence

Full-system Simulation: Inter-CMP Traffic

– TokenCMP traffic is reasonable (or better)

• DirectoryCMP control overhead greater than broadcast for small system

Page 42: (C) 2005 Multifacet Project Token Coherence: A Framework for Implementing Multiple-CMP Systems Mike Marty 1, Jesse Bingham 2, Mark Hill 1, Alan Hu 2, Milo

Slide 42 Improving Multiple-CMP Systems using Token Coherence

Full-system Simulation: Intra-CMP Traffic

Page 43: (C) 2005 Multifacet Project Token Coherence: A Framework for Implementing Multiple-CMP Systems Mike Marty 1, Jesse Bingham 2, Mark Hill 1, Alan Hu 2, Milo

Slide 43 Improving Multiple-CMP Systems using Token Coherence

Performance Robustness

Locking micro-benchmark

less contentionmore contention

(correctness substrate only)

Page 44: (C) 2005 Multifacet Project Token Coherence: A Framework for Implementing Multiple-CMP Systems Mike Marty 1, Jesse Bingham 2, Mark Hill 1, Alan Hu 2, Milo

Slide 44 Improving Multiple-CMP Systems using Token Coherence

Performance Robustness

Locking micro-benchmark

less contentionmore contention

(correctness substrate only)

Page 45: (C) 2005 Multifacet Project Token Coherence: A Framework for Implementing Multiple-CMP Systems Mike Marty 1, Jesse Bingham 2, Mark Hill 1, Alan Hu 2, Milo

Slide 45 Improving Multiple-CMP Systems using Token Coherence

Performance Robustness

Locking micro-benchmark

less contentionmore contention

Page 46: (C) 2005 Multifacet Project Token Coherence: A Framework for Implementing Multiple-CMP Systems Mike Marty 1, Jesse Bingham 2, Mark Hill 1, Alan Hu 2, Milo

Slide 46 Improving Multiple-CMP Systems using Token Coherence

Summary

• Microprocessor Chip Multiprocessor (CMP)• Symmetric Multiprocessor (SMP) Multiple CMPs

• Problem: Coherence with Multiple CMPs

• Old Solution: Hierarchical Directory Complex & Slow

• New Solution: Apply Token Coherence– Developed for glueless multiprocessor [2003]– Keep: Flat for Correctness– Exploit: Hierarchical for performance

• Less Complex & Faster than Hierarchical Directory