46
Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors Chinnakrishnan S. Ballapuram Ahmad Sharif Hsien-Hsin S. Lee

Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors

  • Upload
    tab

  • View
    37

  • Download
    0

Embed Size (px)

DESCRIPTION

Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors. Chinnakrishnan S. Ballapuram Ahmad Sharif Hsien-Hsin S. Lee. Shared Last Level Cache. Concurrent Execution in CMP. Single-threaded program. Multi-threaded program. Code, Data. - PowerPoint PPT Presentation

Citation preview

Page 1: Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors

Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors

Chinnakrishnan S. BallapuramAhmad Sharif

Hsien-Hsin S. Lee

Page 2: Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors

2Ballapuram, Sharif, and Lee

Concurrent Execution in CMP

Code, Data

Single-threaded program

Registers, Stack(Local)

Code Data

Multi-threaded program

Registers, Stack(Local)

Registers, Stack(Local)

Registers, Stack(Local)

Thread 2Thread 1Thread 0Thread 0

Shared Last Level Cache

Page 3: Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors

3Ballapuram, Sharif, and Lee

Self-Modifying Code (SMC) Snoop

IL1IL1

Core 0

IL1IL1 DL1

Core 1

IL1 DL1

Core 2

IL1 DL1

Core 3

IL1 DL1

SMC snoop

SMC snoop

SMC snoop

SMC snoop

Page 4: Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors

4Ballapuram, Sharif, and Lee

Snoop for Core 0 DL1 Miss

IL1IL1

L2 queue (FIFO)L2 queue (FIFO)

L2 L2 cachecache

Snoop queue Snoop queue (FIFO)(FIFO)

Other Other logic logic and and

buffersbuffers

External interconnectExternal interconnect

CMP core interconnectCMP core interconnect

Core 0

IL1IL1 DL1

SMC snoop

Core 1

IL1 DL1

SMC snoop

Core 2

IL1 DL1

SMC snoop

Core 3

IL1 DL1

SMC snoop

Page 5: Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors

5Ballapuram, Sharif, and Lee

External Snoop Request

L2 queue (FIFO)L2 queue (FIFO)

L2 L2 cachecache

Snoop queue Snoop queue (FIFO)(FIFO)

Other Other logic logic and and

buffersbuffers

External interconnectExternal interconnect

CMP core interconnectCMP core interconnect

Core 0

IL1IL1 DL1

SMC snoop

Core 1

IL1 DL1

SMC snoop

Core 2

IL1 DL1

SMC snoop

Core 3

IL1 DL1

SMC snoop

Page 6: Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors

6Ballapuram, Sharif, and Lee

Modified L2 Eviction, External Request, etc

IL1IL1

L2 queue (FIFO)L2 queue (FIFO)

L2 L2 cachecache

Snoop queue Snoop queue (FIFO)(FIFO)

Other Other logic logic and and

buffersbuffers

External interconnectExternal interconnect

CMP core interconnectCMP core interconnect

Core 0

IL1IL1 DL1

SMC snoop

Core 1

IL1 DL1

SMC snoop

Core 2

IL1 DL1

SMC snoop

Core 3

IL1 DL1

SMC snoop

Page 7: Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors

7Ballapuram, Sharif, and Lee

Modified L2 Eviction, External Request, etc

L2 queue (FIFO)L2 queue (FIFO)

L2 L2 cachecache

Snoop queue Snoop queue (FIFO)(FIFO)

Other Other logic logic and and

buffersbuffers

External interconnectExternal interconnect

CMP core interconnectCMP core interconnect

Core 0

IL1IL1 DL1

SMC snoop

Core 1

IL1 DL1

SMC snoop

Core 2

IL1 DL1

SMC snoop

Core 3

IL1 DL1

SMC snoop

As # of cores increasesPower

Performance

Page 8: Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors

8Ballapuram, Sharif, and Lee

Number of Snoop Probes

• SMC Snoops to I-Cache > Snoops to D-Cache > Snoops to LSB.

0

1

2

3

4

5

6

7

8

9

10

11

12to

_lsb

to_d

cach

e

to_i

cach

e

to_l

sb

to_d

cach

e

to_i

cach

e

to_l

sb

to_d

cach

e

to_i

cach

e

to_l

sb

to_d

cach

e

to_i

cach

e

to_l

sb

to_d

cach

e

to_i

cach

e

SPEC INT 2006 SPEC FP 2006 games/multi-media server multi-threaded apps

Num

ber o

f sno

op p

robe

s in

Mill

ions

2C

4C

2 x 4C8C

16.4M

Page 9: Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors

9Ballapuram, Sharif, and Lee

Snoop Probe and Snoop Rate

• % of data snoop > % of instruction cache snoop

02468

1012141618202224262830

2C 4C 2Px4C 8C 8C-MT 2Px4C-MT

Num

ber

of s

noop

s in

Mill

ions

0%

200%

400%

600%

800%

1000%

1200%

1400%

1600%

1800%

2000%

2200%

2400%

Processor configuration

% o

f sno

op in

crea

se

to_lsbto_dcacheto_icachetotal snoops% of data snoop increase% of SMC snoop increase% of total snoop increase

~22x increase

~12x increase

Page 10: Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors

10Ballapuram, Sharif, and Lee

We propose two techniques to reduce the power consumed by snoop probes:

1. Selective Snoop Probe (SSP)2. Essential Snoop Probe (ESP)

Page 11: Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors

11Ballapuram, Sharif, and Lee

Selective Snoop Probe (SSP)- SSP for SMC- SSP for Non-Stack Accesses- SSP for Stack Accesses

Page 12: Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors

12Ballapuram, Sharif, and Lee

Selective Snoop Probe (SSP)- SSP for SMC

Page 13: Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors

13Ballapuram, Sharif, and Lee

Normal Operation: To Support SMC

L1 I-Cache

From RS or LSB

dispatch

SMC snoop probe

L1 D-cache MSHR

Core 0

Page 14: Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors

14Ballapuram, Sharif, and Lee

Core 0

SSP (SMC) – No SMC Snoop if BF1 miss

From RS or LSB

dispatch

All store addr

HASH

cntr

MSHR

u1

r1

r1 – read Bloom filteru1 – update Bloom filtercntr- counting Bloom filter

BF1SMC snoop probe

L1 I-Cache

L1 D-cache

To filter SMC/XMC snoops

Page 15: Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors

15Ballapuram, Sharif, and Lee

Core 0

SSP (SMC) – No SMC Snoop if BF1 Hit

From RS or LSB

dispatch

All store addr

HASH

cntr

MSHR

u1

r1

r1 – read Bloom filteru1 – update Bloom filtercntr- counting Bloom filter

BF1SMC snoop probe

L1 I-Cache

L1 D-cache

Page 16: Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors

16Ballapuram, Sharif, and Lee

Selective Snoop Probe (SSP)- SSP for Stack Accesses

Page 17: Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors

17Ballapuram, Sharif, and Lee

Normal Operation: Always Snoop for All Accesses

Snoopprobes

Snoop probes

L2 queue

Last Level Cache

dL1 miss

Core 0

From RS or LSB

dispatch

L1 D-cache MSHR

Snoop controller

Snoop queue

Page 18: Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors

18Ballapuram, Sharif, and Lee

Core 0

SSP – Stack Accesses

All addresses(carry S-bit annotation)

L2 queue

From RS or LSB

dispatch

L1 D-cache MSHR

dL1 miss

Last Level Cache

Snoop controller

0100

Snoop queue

Annotated by

Front-End

Page 19: Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors

19Ballapuram, Sharif, and Lee

Selective Snoop Probe (SSP)- SSP for Non-Stack Accesses

Page 20: Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors

20Ballapuram, Sharif, and Lee

Core 0

SSP – Non-stack Accesses Update BF2

From RS From RS or LSBor LSB

dispatchdispatch

All non-stack addressesAll non-stack addresses

MEME SISI SISIMEME

L1 D-cacheL1 D-cache MSHRMSHR

L2 queueL2 queue

Last Level Cache

Snoop controller

1000

Snoop queuer2 – read Bloom filter

u2 - update Bloom filtercntr - counting Bloom filter

u2u2

Filter snoops to non-stack region

HASH cntr

BF2

Page 21: Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors

21Ballapuram, Sharif, and Lee

SSP – Non-stack Accesses Read BF2

All non-stack addresses

Filter snoops to non-stack region

HASH cntr

u2u2

L2 queue

dL1 miss

r2

r2All addresses(carry S-bit annotation)

r2 – read Bloom filteru2 - update Bloom filtercntr - counting Bloom filter

Last Level Cache

Snoop controller

1000

Snoop queue

BF2

Core 0From RS From RS or LSBor LSB

dispatchdispatch

All non-stack addressesAll non-stack addresses

MEME SISI SISIMEME

L1 D-cacheL1 D-cache MSHRMSHR

Page 22: Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors

22Ballapuram, Sharif, and Lee

SSP - Selectively Send Snoop Probes

Selectively send snoops

L2 queue

Last Level Cache

Snoop controller

1000

Snoop queuer2 – read Bloom filter

u2 - update Bloom filtercntr - counting Bloom filter

u2u2

Selectively send snoops

All non-stack addressesu2u2

All addresses(carry S-bit annotation)

Core 0From RS From RS or LSBor LSB

dispatchdispatch

All non-stack addressesAll non-stack addresses

MEME SISI SISIMEME

L1 D-cacheL1 D-cache MSHRMSHR

Filter snoops to non-stack region

HASH cntr

BF2

dL1 miss

Page 23: Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors

23Ballapuram, Sharif, and Lee

Essential Snoop Probe (ESP)- ESP for SMC- ESP for all variables

Page 24: Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors

24Ballapuram, Sharif, and Lee

Essential Snoop Probe (ESP)- ESP for SMC

Page 25: Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors

25Ballapuram, Sharif, and Lee

Core 0

SMC – Normal Operation

L1 I-$

Every Store SnoopsI-cache

From RS or

LSB dispatch

L1 D-$

Other pipe stages

Page 26: Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors

26Ballapuram, Sharif, and Lee

Core 0

ESP Essential Snoop Probe

From RS or

LSB dispatch

Other pipe stages

L1 I-$ L1 D-$

• OS sets a control register bit (SMC-CR) • SMC-CR=1 Non Self-Modifying Code• SMC-CR=0 Self-Modifying Code

SMC-CR=1

Page 27: Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors

27Ballapuram, Sharif, and Lee

Essential Snoop Probe (ESP)- ESP for all variables

Page 28: Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors

28Ballapuram, Sharif, and Lee

Core 0

Normal Operation – Snoop for All Variables

Snoop probes

L2 queue

From RS or

LSB dispatch

Other pipe stages

CMP interconnect domain

Snoop probes

Snoop controller

Snoop queue

Last Level Cache

L1 I-$ L1 D-$

dL1 miss

Page 29: Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors

29Ballapuram, Sharif, and Lee

Core 0

Essential Snoop Probe (ESP) – SMN bit 1

dL1 misswith SMN bit annotation

L2 queue

From RS or

LSB dispatch

Other pipe stages

CMP interconnect domain

SMN bitSMN bit – Snoop-Me-Not bit is 0/1

Snoop controller

1100

Snoop queue

Last Level Cache

L1 I-$ L1 D-$

Page 30: Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors

30Ballapuram, Sharif, and Lee

Core 0

Essential Snoop Probe (ESP) – SMN bit 0

L2 queue

From RS or

LSB dispatch

ESP

Other pipe stages

CMP interconnect domain

SMN bit – Snoop-Me-Not bit is 0/1

Last Level Cache

SMN bit

Snoop controller

0100

Snoop queue

L1 I-$ L1 D-$

ESPESP

dL1 misswith SMN bit annotation

Page 31: Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors

31Ballapuram, Sharif, and Lee

Energy Savings in D-Cache Using SSP

• In the 2C config 5% - 10% data cache energy savings and in the 8C config 30% - 65% is achieved.

• The data cache energy savings increases with the number of cores on the die as the number of snoops to all the cores increases.

0%

5%

10%

15%

20%

25%

30%

35%

40%

45%

50%

55%

60%

65%

70%

2C 4C 2Px4C 8C

Processor configuration

% o

f dat

a ca

che

ener

gy s

avin

gs p

er c

ore

SPEC INT 2006SPEC FP 2006games/multi-mediaservermulti-threaded application

Page 32: Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors

32Ballapuram, Sharif, and Lee

Energy Savings in I-Cache Using SSP

• There is a 50% - 70% instruction cache tag energy savings is achieved across all processor configurations.

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

2C 4C 2Px4C 8C

Processor configuration

% o

f ica

che

tag

ener

gy s

avin

gs p

er c

ore

SPEC INT 2006SPEC FP 2006games/multi-mediaservermulti-threaded application

Page 33: Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors

33Ballapuram, Sharif, and Lee

Performance Impact with SSP

• On average there is 1% - 2% performance improvement across various benchmark categories and different processor configurations is achieved.

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

110%

120%

SPEC INT 2006 SPEC FP 2006 games/multi-media

server multi-threadedapplication

Harmean acrossbenchmarks

min performanceobserved

maxperformance

observed

2C 4C 2Px4C 8C

Page 34: Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors

34Ballapuram, Sharif, and Lee

Energy Savings with ESP

• It shows that 5% to a maximum of 82% data cache energy is spent on the non-essential snoop probes that can be eliminated using the ESP technique.

• Also, 85% of the snoops to the instruction cache tag energy can be eliminated using ESP.

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

dcache icache dcache icache dcache icache dcache icache dcache icache dcache icache

SPEC INT 2006 SPEC FP 2006 games/multi-media server multi-threadedapplication

Harmonic meanacross benchmarks%

of c

ache

ene

rgy

spen

t on

non-

esse

ntia

l sno

ops

per

core

2C 4C 2Px4C 8C

Page 35: Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors

35Ballapuram, Sharif, and Lee

• Semantics and program behavior are useful indicators

• They are exploited to reduce power due to snoops

• We proposed– Selective Snoop Probe (SSP) – Essential Snoop Probe (ESP)

• Energy Reduction Results– 5% to 65% in D-cache per core– 50% to 70% in I-cache per core

• 1% - 2% performance improvement

• Extensible to optimize integrated platforms with graphics processor

Conclusion

Page 36: Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors

Georgia TechElectrical and Computer Engineering MARS Labshttp://arch.ece.gatech.edu

Thank You !

Page 37: Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors

BACKUP

Page 38: Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors

38Ballapuram, Sharif, and Lee

Simulation InfrastructureExecution Engine 4-wide, Out-of-OrderLoad buf / Store buf / RS / ROB 96 / 64 / 128 / 256 entriesL1 / L2 latency 4 / 8 cyclesL1 I, L1 D cache size 32KB, 8 way, 64BL2 Cache 4MB, 16 way, 64BL1 TLB entries 128, 4 wayMemory 2GB, DDR 2 timingsCACTI 4.2 70nm power modelBenchmark class Example applicationsServer specJBB, TPCCSPEC FP 2006 wrf, namd, lbm, soplexSPEC INT 2006 hmmer, gobmk, omnetpp,

gccGames and multi-media shooters, realtime

strategy, raytracerMulti-threaded applications ray tracer, cinebench

Page 39: Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors

39Ballapuram, Sharif, and Lee

Number of Modified Lines

• It shows the number of modified lines that needs to be evicted to the last level cache.

0

20

40

60

80

100

120

140

160

180

200

220

SPEC INT 2006 SPEC FP 2006 games/multi-media server multi-threadedapplication

Average acrossbenchmarks

Num

ber o

f mod

ified

line

s at

com

plet

ion

2C4C2Px4C8C

Page 40: Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors

40Ballapuram, Sharif, and Lee

Cache access Vs Snoop access

• Cache access – Read one sub-bank (8 bytes)• Snoop access – Need to read all sub-banks to ship the data to other cores

or other processor in an MP system. (all 64 bytes, cache line size)

Page 41: Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors

41Ballapuram, Sharif, and Lee

Hash functions

Cache LineCache Line(physical address)(physical address)

(48-bits)(48-bits)

MESIMESIstatestate

Tag + Tag + Index Index bitsbits

DataData

cntrcntr cntrcntrHASH HASH 33

HASH HASH 33

If M/E stateIf M/E state If S stateIf S state

Unused bitsUnused bits BBCC AATag + Index bits [6-32]Tag + Index bits [6-32]

cntcntrr

cntcntrr

cntcntrr

HASH HASH 33

If bit-10 is 0, HASH3 = A ^ B ^ CIf bit-10 is 0, HASH3 = A ^ B ^ CIf bit-10 is 1, HASH3 = (A ^ 0x22) ^ B ^ CIf bit-10 is 1, HASH3 = (A ^ 0x22) ^ B ^ C

6153347

Page 42: Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors

42Ballapuram, Sharif, and Lee

Incoming Events to LLCIncoming events to the last level cache

RFO

Data Read

Code fetch

Shared L2 evict

Page 43: Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors

43Ballapuram, Sharif, and Lee

Incoming Events to LLC and Sources of Snoop TriggersIncoming events to the last level cache

iL1 of thiscore

dL1 ofthiscore

RFO - Event trigger

Data Read - Event trigger

Code fetch

Event trigger

Shared L2 evict

Page 44: Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors

44Ballapuram, Sharif, and Lee

Snooped Units in the Triggered CoreIncoming events to the last level cache

iL1 of thiscore

dL1 ofthiscore

LSB of thiscore

MSHR,WBB of this core

RFO - Event trigger

- -

Data Read - Event trigger

- -

Code fetch

Event trigger

SMC snoop

Snoop store buffer only (updated writes)

Snoop (update writes)

Shared L2 evict

- Snoop - Snoop

Page 45: Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors

45Ballapuram, Sharif, and Lee

Snoop Probes for Incoming Data ReadIncoming events to the last level cache

iL1 of thiscore

dL1 ofthiscore

LSB of thiscore

MSHR,WBB of this core

iL1 ofother 3cores

dL1 ofother 3cores

LSB of other 3cores

MSHR,WBB of other 3 cores

Shared L2queue

RFO - Event trigger

- - XMC snoop to invalidate line

Snoop snoop load buffer only to invalidate

Snoop to invalidate pending requests

Snoop to invalidate

Data Read - Event trigger

- - XMC snoop to invalidate line

Snoop - Snoop Snoop

Code fetch

Event trigger

SMC snoop

Snoop store buffer only (updated writes)

Snoop (update writes)

- XMC snoop

Snoop store buffer only (update writes)

Snoop SMC Snoop

Shared L2 evict

- Snoop - Snoop - Snoop - Snoop Snoop

Page 46: Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors

46Ballapuram, Sharif, and Lee

Snoop Triggers and Snoop UnitsIncoming events to the last level cache

iL1 of thiscore

dL1 ofthiscore

LSB of thiscore

MSHR,WBB of this core

iL1 ofother 3cores

dL1 ofother 3cores

LSB of other 3cores

MSHR,WBB of other 3 cores

Shared L2queue

RFO - Event trigger

- - XMC snoop to invalidate line

Snoop snoop load buffer only to invalidate

Snoop to invalidate pending requests

Snoop to invalidate

Data Read - Event trigger

- - XMC snoop to invalidate line

Snoop - Snoop Snoop

Code fetch

Event trigger

SMC snoop

Snoop store buffer only (updated writes)

Snoop (update writes)

- XMC snoop

Snoop store buffer only (update writes)

Snoop SMC Snoop

Shared L2 evict

- Snoop - Snoop - Snoop - Snoop Snoop

SMC snoop to iL1

On all store addr disp

- - SMC snoop to iL1

On all store addr disp

- - -