Power Efficient Cache Coherence

Preview:

DESCRIPTION

Power Efficient Cache Coherence. Craig Saldanha Mikko Lipasti. Motivation. Power consumption becoming a serious design constraint. Market demand for faster and more complex servers. Complex coherence protocol and interconnect. f clock + Complexity = P interconnect. Motivation. - PowerPoint PPT Presentation

Citation preview

Power Efficient Cache Coherence

Craig SaldanhaMikko Lipasti

Motivation Power consumption becoming a

serious design constraint. Market demand for faster and

more complex servers. Complex coherence protocol and

interconnect. fclock + Complexity = Pinterconnect

Motivation Traditional power saving

methodologies ineffective. Minimize number of transaction

packets. At the end-points: Jetty. At the source: Serial Snoop.

OUTLINE

Overview of Snoop based protocols and opportunity for power savings

Latency and Power consumption of parallel snooping techniques

Serial Snooping Results Conclusions Future Work

Snoop Based Coherence

Memory

Local Node Tag Lookup Bus Arbitration Snoop Transmission

Remote Node Tag Lookup Snoop Response Combination of Responses Data Fetch Data Transmit

Tag Array

Data Array

P1 P2 P3 P4

Degrees of Speculation 3 Degrees of Freedom to SpeculateSnooping Data Fetch Data Transmit

Snoop

DFetch

DFetch

D-Xmit

D-Xmit

D-Xmit

D-Xmit

Parallel Snoop,Spec Dfetch Spec DXmit

Parallel Snoop,Spec Dfetch Non-Spec DXmit

Parallel Snoop,Non-Spec Dfetch Non-Spec DXmit

Serial Snoop,Spec Dfetch Spec DXmit

Serial Snoop,Spec Dfetch Non-Spec DXmit

Serial Snoop,Non-Spec Dfetch Non-Spec DXmitX

X

Latency and Power assumptions

MEMORY

Consider only load misses

Tree of point-point connections.

Latency to traverse a link: 1Bus cycle (7ns)

Tag Look up :1 bus cycle (7ns)

D-Fetch: 2 bus cycles (14ns)

DRAM access: 10 bus cycles (70ns)

P1 P2 P3 P4

Root Node

Switch1 Switch2

Board1

Backplane

Parallel Snoop Speculative Fetch Speculative Transmit (PS/SF/ST)

MEMORY

Snoop Broadcast

0 7

Latency

Power

Plink

Parallel Snoop Speculative Fetch Speculative Transmit (PS/SF/ST)

MEMORY

Snoop Broadcast

0 7

Latency

Power

Plink+Pswitch

14

Parallel Snoop Speculative Fetch Speculative Transmit (PS/SF/ST)

MEMORY

Snoop BroadcastLatency

Power

2Plink+Pswitch

0 7 14 21

Parallel Snoop Speculative Fetch Speculative Transmit (PS/SF/ST)

MEMORY

Snoop BroadcastLatency

Power

2Plink+2Pswitch

0 7 14 21 28

Parallel Snoop Speculative Fetch Speculative Transmit (PS/SF/ST)

MEMORYSnoop BroadcastLatency

Power

5Plink+2Pswitch

0 7 14 21 28 35

Parallel Snoop Speculative Fetch Speculative Transmit (PS/SF/ST)

MEMORY

Snoop BroadcastLatency

Power

5Plink+4Pswitch

0 7 14 21 28 35 42

Parallel Snoop Speculative Fetch Speculative Transmit (PS/SF/ST)

MEMORY

Snoop BroadcastLatency

Power

8Plink+4Pswitch

0 7 14 21 28 35 42 49

Parallel Snoop Speculative Fetch Speculative Transmit (PS/SF/ST)

Memory Access :Data FetchLatency

Power

Pmem

35 91 105

MEMORY

Parallel Snoop Speculative Fetch Speculative Transmit (PS/SF/ST)

Memory Access :Data TransmitLatency

Power

Pmem+3Plink+2Pswitch

35 91 105 140

MEMORY

Parallel Snoop Speculative Fetch Speculative Transmit (PS/SF/ST)

Remote Node: Tag LookupLatency

Power

3Ptag

49 56

MEMORY

Parallel Snoop Speculative Fetch Speculative Transmit (PS/SF/ST)

MEMORYRemote Node :Snoop Response Latency

Power

3Ptag+3*(Pswitch+2Plink)+Plink+2Pswitch+2Plink

49 56 105

Parallel Snoop Speculative Fetch Speculative Transmit (PS/SF/ST)

Remote Node: Data FetchLatency

Power

3Pcache

49 63

MEMORY

Parallel Snoop Speculative Fetch Speculative Transmit (PS/SF/ST)

MEMORYRemote Node :Data Transmit Latency

Power

3Ptag+3*(4Plink+3Pswitch)

49 63 112

Parallel Snoop Speculative Fetch Speculative Transmit (PS/SF/ST)

Local Node

0TL Snoop BRDCSTTL49

105 14091Data XmitMemory Access

35Memory

105 14091Data XmitMemory Access

35

63DF

49Remote Node11263

DF Data Xmit49

10556 RSP + CMB

49Remote Node

10556 RSP + CMB

49TL

Latency

Power29Plink+18Pswitch+3Ptag+3Pcache+PmemRemote Node supplies the data

Memory supplies the data20Plink+11Pswitch+3Ptag+3Pcache+Pmem

Parallel Snoop Speculative Fetch Non-Speculative Transmit (PS/SF/NT)

0 49

63

105

105 14091

56

Snoop BRDCST

RSP

+ CMB

DF

Data XmitMemory Access35

TL

49

49

Local Node

Remote Node

Memory

Remote Node

105 154

TL

Data Xmit

Power21Plink+12Pswitch+3Ptag+Pcache+PmemRemote Node supplies the data

Memory supplies the data20Plink+11Pswitch+3Ptag+3Pcache+Pmem

Latency

Parallel Snoop Non-Speculative Fetch Non-Speculative Transmit (PS/NF/NT)

49

105

196161

Snoop BRDCST

RSP +

CMB

Data XmitMemory Access91

TL

49

Local Node

Remote Node

Memory

105

TL

119 168

Remote Node

DF Data Xmit

0

Latency

Power21Plink+12Pswitch+3Ptag+PcacheRemote Node supplies the data

Memory supplies the data 20Plink+11Pswitch+3Ptag+Pmem

MEMORY

Serial Snooping

Avoids Speculative transmission of Snoop packets.

MEMORY

Serial Snooping

Avoids Speculative transmission of Snoop packets.

Check the nearest neighbor

Data supplied with minimum latency and power

MEMORY

Serial Snooping

Forward snoop to next level

MEMORY

Serial Snooping

Forward snoop to next level

MEMORY

Serial Snooping

Search other half of tree

MEMORY

Serial Snooping

Search other half of tree

Search leaf nodes serially

MEMORY

Serial Snooping

Search other half of tree

Search leaf nodes serially

Serial Snooping : Features Latency to satisfy a request dependent on

distance from requestor. Data resident at the nearest neighbor

supplied with the lowest latency and power. Requests visible to memory controller only

at root node. Latency is adversely affected when

requested data present at the farthest node Worst case power consumption is still less

than the parallel snooping.

MEMORY

Serial Snooping :

Request satisfied by Nearest NodeLatency

Power

0 21 SNPTL

35 56DF Data Xmit

21

28 49 RSPTL

21

P1

P2

P2

Xmit Snoop: 2Plink + Pswitch

P2 Tag access and snoop response: Ptag + 2Plink + Pswitch

P2 Data Fetch and Xmit: Pcache +2Plink + Pswitch

Ptotal= 6Plink+3Pswitch+Ptag+Pcache

MEMORY

Serial Snooping :

Request satisfied by Next-Nearest NeighborLatency

Power

0 21 SNPTL

28 49

RSPTL21 63 77

77 84 133

77 91 140DF

TL Xmit

Xmit

XmitMemory Access63 133 168

P1

P2

P3

P3

Memory

16Plink+10Pswitch+2Ptag+Pcache

MEMORY

Serial Snooping :

Request satisfied by farthest nodeLatency

Power

0 21

XmitTL

28 49 RSPTL

21 63 77

77 84 10TL Xm

P1

P2

P3

Spec-Memory

P4

P4

Non-Spec-Memory

XmitMemory Access 133 168

RSPTL

105 161

DF Xmit

147

112

105 168119

XmitMemory Access217 252

147

XmitMemory Access133 182147

18Plink+11Pswitch+3Ptag+PcacheRemote Node supplies the data

If Memory supplies the data 17Plink+10Pswitch+3Ptag+Pmem

RESULTS : Load Miss Distributions

Load Miss Distribution

0

5

10

15

20

25

30

35

40

45

% o

f lo

ad

mis

se

s s

ati

sfi

ed

NearestNeighbourNext NearestNodeFarthestNodeMemory

raytrace specwebspecjbb tpc-w

RESULTS: Average Latencies to satisfy load misses

Average Latencies to satisfy load misses

0 20 40 60 80 100 120 140 160 180 200

Avg. Latencies (ns)

SSNFNT

SSSFNT

SSSFST

PSNFNT

PSSFNT

PSSFST

raytrace

specjbb

spe

cwe

btpc-w

RESULTS: Relative Power Savings

Factors contributing to power consumption

0

0.2

0.4

0.6

0.8

1

1.2

PS/SF/ST

SS/SF/ST

PS/SF/NT

SS/SF/NT

PS/NF/NT

SS/NF/NT

Plink Psw Ptag Pcache Pmem

CONCLUSIONS Reducing degree of speculation has

potential for significant power savings

Performance degradation is minimal for the set of benchmarks studied.

Serial Snooping with speculative memory fetch provides optimal latency and power consumption.

Future Work Develop detailed execution-driven

Power Model Explore different interconnect

topologies. Examine the viability of adaptive

mechanisms for protocol policy.

MEMORY

Serial Snooping

Nearest NeighborLatency

Power

2Plink+Pswitch

0 7 14 21

Questions

Parallel Snoop Speculative Fetch Speculative Transmit (PS/SF/ST)

Local Node

0TL Snoop BRDCSTTL49

MEMORY

Memory Access35

Memory Access35 91

Recommended