28
University of Utah 1 Interconnect-Aware Coherence Protocols for Chip Multiprocessors Liqun Cheng Naveen Muralimanohar Karthik Ramani Rajeev Balasubramonian John Carter

Interconnect-Aware Coherence Protocols for Chip Multiprocessors

Embed Size (px)

DESCRIPTION

Interconnect-Aware Coherence Protocols for Chip Multiprocessors. Liqun Cheng Naveen Muralimanohar Karthik Ramani Rajeev Balasubramonian John Carter. CMPs are ubiquitous Requires coherence among multiple cores Coherence operations entail frequent communication - PowerPoint PPT Presentation

Citation preview

Page 1: Interconnect-Aware Coherence Protocols for Chip Multiprocessors

University of Utah 1

Interconnect-Aware Coherence Protocols for Chip Multiprocessors

Liqun ChengNaveen Muralimanohar

Karthik RamaniRajeev Balasubramonian

John Carter

Page 2: Interconnect-Aware Coherence Protocols for Chip Multiprocessors

2 University of Utah

Motivation: Coherence Traffic

CMPs are ubiquitous Requires coherence among

multiple cores

Coherence operations entail

frequent communication

Messages have different

latency and bandwidth needs

Heterogeneous wires 11% better performance

22.5% lower wire power

L2

C1 C2 C3

L1 L1 L1

Read Req

Fwd to owner

Data

Ex ReqInval

Inv Ack

Messages related to read missMessages related to write miss

Page 3: Interconnect-Aware Coherence Protocols for Chip Multiprocessors

3 University of Utah

1 Rd-Ex request from

processor 1

2 Directory sends clean

copy to processor 1

3 Directory sends

invalidate message to

processor 2

4 Cache 2 sends

acknowledgement

back to processor 1

Cache 1

L2 & Directory

Cache 2

Processor 1 Processor 2

12 3

4

Critical

Non-Critical

Exclusive request for a shared copy

Page 4: Interconnect-Aware Coherence Protocols for Chip Multiprocessors

4 University of Utah

Wire Characteristics

Wire Resistance and capacitance per unit length

),()22(0 verthorizverthorizwire fringenglayerspaci

width

spacing

thicknessKC

)2()( BarrierwidthBarrierthicknessRwire

Resistance Capacitance Bandwidth

Width

Spacing

Page 5: Interconnect-Aware Coherence Protocols for Chip Multiprocessors

5 University of Utah

Design Space Exploration

Tuning wire width and spacing Base case

B wires

Fast butLow bandwidth

L wires

(Width & Spacing)

Delay Bandwidth

Page 6: Interconnect-Aware Coherence Protocols for Chip Multiprocessors

6 University of Utah

Design Space Exploration

Tuning Repeater size and spacing

Traditional WiresLarge repeatersOptimum spacing

Power Optimal WiresSmaller repeatersIncreased spacing

Dela

y Po

wer

Page 7: Interconnect-Aware Coherence Protocols for Chip Multiprocessors

7 University of Utah

Design Space Exploration

Base caseB wires8x plane

Base caseW wires4x plane

PoweroptimizedPW wires4x plane

Fast, low bandwidth

L wires8x plane

Latency 1x

Power 1x

Area 1x

Latency 1.6x

Power 0.9x

Area 0.5x

Latency 3.2x

Power 0.3x

Area 0.5x

Latency 0.5x

Power 0.5x

Area 4x

Page 8: Interconnect-Aware Coherence Protocols for Chip Multiprocessors

8 University of Utah

Outline

Overview

Wire Design Space Exploration

Protocol-dependent Techniques

Protocol-independent Techniques

Results

Conclusions

Page 9: Interconnect-Aware Coherence Protocols for Chip Multiprocessors

9 University of Utah

Directory Based Protocol (Write-Invalidate)

Map critical/small messages on L wires and non-

critical messages on PW wires

Read exclusive request for block in shared state

Read request for block in exclusive state

Negative Ack (NACK) messages

Exploit hop

imbalance

Page 10: Interconnect-Aware Coherence Protocols for Chip Multiprocessors

10 University of Utah

Read to an Exclusive Block

Proc 2L1

Proc 1L1

L2 & Directory

Read Req

Spec Reply

Req

ACK

Fwd Dirty Copy

WB Data

(critical)

(non-critical)

(non-critical)

Page 11: Interconnect-Aware Coherence Protocols for Chip Multiprocessors

11 University of Utah

NACK Messages NACK – Negative Acknowledgement generated when directory

state is busy

Can employ MSHR id of the request instead of full

address

Directory load is low

Requests can be served at next try

Sending NACK on L-Wires can improve performance

Directory load is high

Frequent back off and retry cycles

Sending NACK on PW-Wires can reduce power consumption

Page 12: Interconnect-Aware Coherence Protocols for Chip Multiprocessors

12 University of Utah

Snoop Bus Based Protocol

Similar to bus-based SMP system

Signal wires and voting wires

Signal wires

To find the state of the block

Voting wires

To vote for owner of the shared data

Page 13: Interconnect-Aware Coherence Protocols for Chip Multiprocessors

13 University of Utah

Protocol-Independent Techniques

Narrow bit-width operands for synchronization

variables Lock and barrier use small integers

Writeback data to PW-wires Writeback messages are rarely on the critical path

Narrow messages to L-wires Only contain src, dst, operand and MSHR_id

For example: reply for upgrade message

Page 14: Interconnect-Aware Coherence Protocols for Chip Multiprocessors

14 University of Utah

Implementation Complexity

Heterogeneous interconnect incurs

additional complexity Cache coherence protocols

Robust enough to handle message re-

ordering

Decision process

Interconnect implementation

Page 15: Interconnect-Aware Coherence Protocols for Chip Multiprocessors

15 University of Utah

Complexity in the Decision Process

In the directory based system Optimizations that exploit hop imbalance

Check directory state

Dynamic mapping of NACK messages Track directory load

Narrow Messages Compute the width of an operand

Page 16: Interconnect-Aware Coherence Protocols for Chip Multiprocessors

16 University of Utah

Overhead in Interconnect Implementation

Additional Multiplexing/De-multiplexing at

sender and receiver side

Additional latches required for power

optimized wires Power savings in PW-Wires goes down by 5%

Wire area overhead Zero – Equal metal area for base and

heterogeneous case

Page 17: Interconnect-Aware Coherence Protocols for Chip Multiprocessors

17 University of Utah

Router Complexity

Crossbar

VC 1

VC 2

Out 1

Out 2

Base Model

Physical Channel 1

Page 18: Interconnect-Aware Coherence Protocols for Chip Multiprocessors

18 University of Utah

Router Complexity

Crossbar

Out 1

Out 2

B

PW

L

Each Physical channel is split into 3 channels (L, PW & B)

L, PW, B

L, PW, B

64 bytes

32 bytes

24 bits

Page 19: Interconnect-Aware Coherence Protocols for Chip Multiprocessors

19 University of Utah

Outline

Overview

Wire Design Space Exploration

Protocol-dependent Techniques

Protocol-independent Techniques

Results

Conclusions

Page 20: Interconnect-Aware Coherence Protocols for Chip Multiprocessors

20 University of Utah

Evaluation Platform & Simulation Methodology

Virtutech Simics Simulator

Sixteen-Core CMP

Ruby Timing model (GEMS)

NUCA cache architecture

MOESI Directory protocol

Benchmarks

SPLASH2

Opal Timing model (GEMS)

Out-of-Order Processor

Multiple outstanding requests

Processor

L2

Page 21: Interconnect-Aware Coherence Protocols for Chip Multiprocessors

21 University of Utah

Wire Model

MM M

Wire RCV

ores

ocap Icap

Cside-wall

Cadj

Wire Type Relative

Latency

Relative Area Dynamic Power Static Power

B-Wire 8x 1x 1x 2.65 1x

B-Wire 4x 1.6x 0.5x 2.9 1.13x

L-Wire 8x 0.5x 4x 1.46 0.55X

PW-Wire 4x 3.2x 0.5x 0.87 0.3x

Ref: Banerjee et al.

65nm process, 10 Metal Layers – 4 in 1X and 2 in each 2X, 4X and 8X plane

Page 22: Interconnect-Aware Coherence Protocols for Chip Multiprocessors

22 University of Utah

Heterogeneous Interconnects B – Wires

Request carrying address Response that are on critical path

L- Wires (latency optimized) Narrow Messages Unblock & Write-Control Messages NACK

PW-Wires (power optimized) Writeback data Response to read request for an exclusive block

Page 23: Interconnect-Aware Coherence Protocols for Chip Multiprocessors

23 University of Utah

Performance Improvements

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

Bar

nes

Ch

ole

sky

FF

T

FM

M

LU

Co

nt

LU

No

nC

on

t

Oce

an C

on

t

Oce

anN

on

Co

nt

Rad

ix

Ray

trac

e

Vo

lren

d

Wn

sq

Wsp

a

Sp

ee

du

p

Base Model Heter - Interconnect

Average improvement 11%

Page 24: Interconnect-Aware Coherence Protocols for Chip Multiprocessors

24 University of Utah

Percentage of Critical/Noncritical Messages

13%

40%

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Bar

nes

Cho

lesk

y

FF

T

FM

M

LU C

ont

LU N

onC

ont

Oce

an C

ont

Oce

an N

onC

ont

Rad

ix

Ray

trac

e

Vol

rend

Wns

q

Wsp

a

L Wire Traffic

PW Wire Traffic

Performance 11%

Power Saving in wire 22.5%

Page 25: Interconnect-Aware Coherence Protocols for Chip Multiprocessors

25 University of Utah

Power Savings in Wires

0%

5%

10%

15%

20%

25%

30%

Bar

nes

Ch

ole

sky

FF

T

FM

M

LU

Co

nt

LU

No

nC

on

t

Oce

an C

on

t

Oce

anN

on

Co

nt

Rad

ix

Ray

trac

e

Vo

lren

d

Wn

sq

Wsp

a

Ave

rag

e

Page 26: Interconnect-Aware Coherence Protocols for Chip Multiprocessors

26 University of Utah

L-Message Distribution

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Barne

s

Chole

sky

FFTFMM

LU-Cont

LU-Nonc

Ocean

-Cont

Ocean

-Nonc

Radix

Raytra

ce

Volre

nd

Wat

er-N

sq

Wat

er-S

pa HopImbalance

Unblock

& Ctrl

Narrow

Msgs

Page 27: Interconnect-Aware Coherence Protocols for Chip Multiprocessors

27 University of Utah

Sensitivity Analysis Impact of out-of-order core

Average speedup 9.3% Partial simulation (only 100M instructions) OOO core is more tolerant to long latency operations

Link Bandwidth & Routing Algorithm Benchmarks with high link utilization are very

sensitive to bandwidth change

Deterministic routing incurs 3% performance loss

compared to adaptive routing

Page 28: Interconnect-Aware Coherence Protocols for Chip Multiprocessors

28 University of Utah

Conclusions

Coherence messages have diverse needs

Intelligent mapping of messages to heterogeneous wires can improve performance and power

Low bandwidth, high speed links improve performance by 11% for SPLASH benchmarks

Non-critical traffic on power optimized network decreases wire power by 22.5%