Interconnect-Aware Coherence Protocols for Chip Multiprocessors

University of Utah 1

Interconnect-Aware Coherence Protocols for Chip Multiprocessors

Liqun ChengNaveen Muralimanohar

Karthik RamaniRajeev Balasubramonian

John Carter

2 University of Utah

Motivation: Coherence Traffic

CMPs are ubiquitous Requires coherence among

multiple cores

Coherence operations entail

frequent communication

Messages have different

latency and bandwidth needs

Heterogeneous wires 11% better performance

22.5% lower wire power

L2

C1 C2 C3

L1 L1 L1

Read Req

Fwd to owner

Data

Ex ReqInval

Inv Ack

Messages related to read missMessages related to write miss


1 Rd-Ex request from

processor 1

2 Directory sends clean

copy to processor 1

3 Directory sends

invalidate message to

processor 2

4 Cache 2 sends

acknowledgement

back to processor 1

Cache 1

L2 & Directory

Cache 2

Processor 1 Processor 2

12 3

4

Critical

Non-Critical

Exclusive request for a shared copy


Wire Characteristics

Wire Resistance and capacitance per unit length

),()22(0 verthorizverthorizwire fringenglayerspaci

width

spacing

thicknessKC

)2()( BarrierwidthBarrierthicknessRwire

Resistance Capacitance Bandwidth

Width

Spacing


Design Space Exploration

Tuning wire width and spacing Base case

B wires

Fast butLow bandwidth

L wires

(Width & Spacing)

Delay Bandwidth



Tuning Repeater size and spacing

Traditional WiresLarge repeatersOptimum spacing

Power Optimal WiresSmaller repeatersIncreased spacing

Dela

y Po

wer



Base caseB wires8x plane

Base caseW wires4x plane

PoweroptimizedPW wires4x plane

Fast, low bandwidth

L wires8x plane

Latency 1x

Power 1x

Area 1x

Latency 1.6x

Power 0.9x

Area 0.5x

Latency 3.2x

Power 0.3x

Area 0.5x

Latency 0.5x

Power 0.5x

Area 4x


Outline

Overview

Wire Design Space Exploration

Protocol-dependent Techniques

Protocol-independent Techniques

Results

Conclusions


Directory Based Protocol (Write-Invalidate)

Map critical/small messages on L wires and non-

critical messages on PW wires

Read exclusive request for block in shared state

Read request for block in exclusive state

Negative Ack (NACK) messages

Exploit hop

imbalance


Read to an Exclusive Block

Proc 2L1

Proc 1L1

L2 & Directory

Read Req

Spec Reply

Req

ACK

Fwd Dirty Copy

WB Data

(critical)

(non-critical)

(non-critical)


NACK Messages NACK – Negative Acknowledgement generated when directory

state is busy

Can employ MSHR id of the request instead of full

address

Directory load is low

Requests can be served at next try

Sending NACK on L-Wires can improve performance

Directory load is high

Frequent back off and retry cycles

Sending NACK on PW-Wires can reduce power consumption


Snoop Bus Based Protocol

Similar to bus-based SMP system

Signal wires and voting wires

Signal wires

To find the state of the block

Voting wires

To vote for owner of the shared data


Protocol-Independent Techniques

Narrow bit-width operands for synchronization

variables Lock and barrier use small integers

Writeback data to PW-wires Writeback messages are rarely on the critical path

Narrow messages to L-wires Only contain src, dst, operand and MSHR_id

For example: reply for upgrade message


Implementation Complexity

Heterogeneous interconnect incurs

additional complexity Cache coherence protocols

Robust enough to handle message re-

ordering

Decision process

Interconnect implementation


Complexity in the Decision Process

In the directory based system Optimizations that exploit hop imbalance

Check directory state

Dynamic mapping of NACK messages Track directory load

Narrow Messages Compute the width of an operand


Overhead in Interconnect Implementation

Additional Multiplexing/De-multiplexing at

sender and receiver side

Additional latches required for power

optimized wires Power savings in PW-Wires goes down by 5%

Wire area overhead Zero – Equal metal area for base and

heterogeneous case


Router Complexity

Crossbar

VC 1

VC 2

Out 1

Out 2

Base Model

Physical Channel 1


Router Complexity

Crossbar

Out 1

Out 2

B

PW

L

Each Physical channel is split into 3 channels (L, PW & B)

L, PW, B

L, PW, B

64 bytes

32 bytes

24 bits


Outline

Overview

Wire Design Space Exploration

Protocol-dependent Techniques

Protocol-independent Techniques

Results

Conclusions


Evaluation Platform & Simulation Methodology

Virtutech Simics Simulator

Sixteen-Core CMP

Ruby Timing model (GEMS)

NUCA cache architecture

MOESI Directory protocol

Benchmarks

SPLASH2

Opal Timing model (GEMS)

Out-of-Order Processor

Multiple outstanding requests

Processor

L2


Wire Model

MM M

Wire RCV

ores

ocap Icap

Cside-wall

Cadj

Wire Type Relative

Latency

Relative Area Dynamic Power Static Power

B-Wire 8x 1x 1x 2.65 1x

B-Wire 4x 1.6x 0.5x 2.9 1.13x

L-Wire 8x 0.5x 4x 1.46 0.55X

PW-Wire 4x 3.2x 0.5x 0.87 0.3x

Ref: Banerjee et al.

65nm process, 10 Metal Layers – 4 in 1X and 2 in each 2X, 4X and 8X plane


Heterogeneous Interconnects B – Wires

Request carrying address Response that are on critical path

L- Wires (latency optimized) Narrow Messages Unblock & Write-Control Messages NACK

PW-Wires (power optimized) Writeback data Response to read request for an exclusive block


Performance Improvements

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

Bar

nes

Ch

ole

sky

FF

T

FM

M

LU

Co

nt

LU

No

nC

on

t

Oce

an C

on

t

Oce

anN

on

Co

nt

Rad

ix

Ray

trac

e

Vo

lren

d

Wn

sq

Wsp

a

Sp

ee

du

p

Base Model Heter - Interconnect

Average improvement 11%


Percentage of Critical/Noncritical Messages

13%

40%

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Bar

nes

Cho

lesk

y

FF

T

FM

M

LU C

ont

LU N

onC

ont

Oce

an C

ont

Oce

an N

onC

ont

Rad

ix

Ray

trac

e

Vol

rend

Wns

q

Wsp

a

L Wire Traffic

PW Wire Traffic

Performance 11%

Power Saving in wire 22.5%


Power Savings in Wires

0%

5%

10%

15%

20%

25%

30%

Bar

nes

Ch

ole

sky

FF

T

FM

M

LU

Co

nt

LU

No

nC

on

t

Oce

an C

on

t

Oce

anN

on

Co

nt

Rad

ix

Ray

trac

e

Vo

lren

d

Wn

sq

Wsp

a

Ave

rag

e


L-Message Distribution

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Barne

s

Chole

sky

FFTFMM

LU-Cont

LU-Nonc

Ocean

-Cont

Ocean

-Nonc

Radix

Raytra

ce

Volre

nd

Wat

er-N

sq

Wat

er-S

pa HopImbalance

Unblock

& Ctrl

Narrow

Msgs


Sensitivity Analysis Impact of out-of-order core

Average speedup 9.3% Partial simulation (only 100M instructions) OOO core is more tolerant to long latency operations

Link Bandwidth & Routing Algorithm Benchmarks with high link utilization are very

sensitive to bandwidth change

Deterministic routing incurs 3% performance loss

compared to adaptive routing


Conclusions

Coherence messages have diverse needs

Intelligent mapping of messages to heterogeneous wires can improve performance and power

Low bandwidth, high speed links improve performance by 11% for SPLASH benchmarks

Non-critical traffic on power optimized network decreases wire power by 22.5%

Documents

Interconnect-Aware Coherence Protocols for Chip Multiprocessors