Upload
tab
View
37
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors. Chinnakrishnan S. Ballapuram Ahmad Sharif Hsien-Hsin S. Lee. Shared Last Level Cache. Concurrent Execution in CMP. Single-threaded program. Multi-threaded program. Code, Data. - PowerPoint PPT Presentation
Citation preview
Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors
Chinnakrishnan S. BallapuramAhmad Sharif
Hsien-Hsin S. Lee
2Ballapuram, Sharif, and Lee
Concurrent Execution in CMP
Code, Data
Single-threaded program
Registers, Stack(Local)
Code Data
Multi-threaded program
Registers, Stack(Local)
Registers, Stack(Local)
Registers, Stack(Local)
Thread 2Thread 1Thread 0Thread 0
Shared Last Level Cache
3Ballapuram, Sharif, and Lee
Self-Modifying Code (SMC) Snoop
IL1IL1
Core 0
IL1IL1 DL1
Core 1
IL1 DL1
Core 2
IL1 DL1
Core 3
IL1 DL1
SMC snoop
SMC snoop
SMC snoop
SMC snoop
4Ballapuram, Sharif, and Lee
Snoop for Core 0 DL1 Miss
IL1IL1
L2 queue (FIFO)L2 queue (FIFO)
L2 L2 cachecache
Snoop queue Snoop queue (FIFO)(FIFO)
Other Other logic logic and and
buffersbuffers
External interconnectExternal interconnect
CMP core interconnectCMP core interconnect
Core 0
IL1IL1 DL1
SMC snoop
Core 1
IL1 DL1
SMC snoop
Core 2
IL1 DL1
SMC snoop
Core 3
IL1 DL1
SMC snoop
5Ballapuram, Sharif, and Lee
External Snoop Request
L2 queue (FIFO)L2 queue (FIFO)
L2 L2 cachecache
Snoop queue Snoop queue (FIFO)(FIFO)
Other Other logic logic and and
buffersbuffers
External interconnectExternal interconnect
CMP core interconnectCMP core interconnect
Core 0
IL1IL1 DL1
SMC snoop
Core 1
IL1 DL1
SMC snoop
Core 2
IL1 DL1
SMC snoop
Core 3
IL1 DL1
SMC snoop
6Ballapuram, Sharif, and Lee
Modified L2 Eviction, External Request, etc
IL1IL1
L2 queue (FIFO)L2 queue (FIFO)
L2 L2 cachecache
Snoop queue Snoop queue (FIFO)(FIFO)
Other Other logic logic and and
buffersbuffers
External interconnectExternal interconnect
CMP core interconnectCMP core interconnect
Core 0
IL1IL1 DL1
SMC snoop
Core 1
IL1 DL1
SMC snoop
Core 2
IL1 DL1
SMC snoop
Core 3
IL1 DL1
SMC snoop
7Ballapuram, Sharif, and Lee
Modified L2 Eviction, External Request, etc
L2 queue (FIFO)L2 queue (FIFO)
L2 L2 cachecache
Snoop queue Snoop queue (FIFO)(FIFO)
Other Other logic logic and and
buffersbuffers
External interconnectExternal interconnect
CMP core interconnectCMP core interconnect
Core 0
IL1IL1 DL1
SMC snoop
Core 1
IL1 DL1
SMC snoop
Core 2
IL1 DL1
SMC snoop
Core 3
IL1 DL1
SMC snoop
As # of cores increasesPower
Performance
8Ballapuram, Sharif, and Lee
Number of Snoop Probes
• SMC Snoops to I-Cache > Snoops to D-Cache > Snoops to LSB.
0
1
2
3
4
5
6
7
8
9
10
11
12to
_lsb
to_d
cach
e
to_i
cach
e
to_l
sb
to_d
cach
e
to_i
cach
e
to_l
sb
to_d
cach
e
to_i
cach
e
to_l
sb
to_d
cach
e
to_i
cach
e
to_l
sb
to_d
cach
e
to_i
cach
e
SPEC INT 2006 SPEC FP 2006 games/multi-media server multi-threaded apps
Num
ber o
f sno
op p
robe
s in
Mill
ions
2C
4C
2 x 4C8C
16.4M
9Ballapuram, Sharif, and Lee
Snoop Probe and Snoop Rate
• % of data snoop > % of instruction cache snoop
02468
1012141618202224262830
2C 4C 2Px4C 8C 8C-MT 2Px4C-MT
Num
ber
of s
noop
s in
Mill
ions
0%
200%
400%
600%
800%
1000%
1200%
1400%
1600%
1800%
2000%
2200%
2400%
Processor configuration
% o
f sno
op in
crea
se
to_lsbto_dcacheto_icachetotal snoops% of data snoop increase% of SMC snoop increase% of total snoop increase
~22x increase
~12x increase
10Ballapuram, Sharif, and Lee
We propose two techniques to reduce the power consumed by snoop probes:
1. Selective Snoop Probe (SSP)2. Essential Snoop Probe (ESP)
11Ballapuram, Sharif, and Lee
Selective Snoop Probe (SSP)- SSP for SMC- SSP for Non-Stack Accesses- SSP for Stack Accesses
12Ballapuram, Sharif, and Lee
Selective Snoop Probe (SSP)- SSP for SMC
13Ballapuram, Sharif, and Lee
Normal Operation: To Support SMC
L1 I-Cache
From RS or LSB
dispatch
SMC snoop probe
L1 D-cache MSHR
Core 0
14Ballapuram, Sharif, and Lee
Core 0
SSP (SMC) – No SMC Snoop if BF1 miss
From RS or LSB
dispatch
All store addr
HASH
cntr
MSHR
u1
r1
r1 – read Bloom filteru1 – update Bloom filtercntr- counting Bloom filter
BF1SMC snoop probe
L1 I-Cache
L1 D-cache
To filter SMC/XMC snoops
15Ballapuram, Sharif, and Lee
Core 0
SSP (SMC) – No SMC Snoop if BF1 Hit
From RS or LSB
dispatch
All store addr
HASH
cntr
MSHR
u1
r1
r1 – read Bloom filteru1 – update Bloom filtercntr- counting Bloom filter
BF1SMC snoop probe
L1 I-Cache
L1 D-cache
16Ballapuram, Sharif, and Lee
Selective Snoop Probe (SSP)- SSP for Stack Accesses
17Ballapuram, Sharif, and Lee
Normal Operation: Always Snoop for All Accesses
Snoopprobes
Snoop probes
L2 queue
Last Level Cache
dL1 miss
Core 0
From RS or LSB
dispatch
L1 D-cache MSHR
Snoop controller
Snoop queue
18Ballapuram, Sharif, and Lee
Core 0
SSP – Stack Accesses
All addresses(carry S-bit annotation)
L2 queue
From RS or LSB
dispatch
L1 D-cache MSHR
dL1 miss
Last Level Cache
Snoop controller
0100
Snoop queue
Annotated by
Front-End
19Ballapuram, Sharif, and Lee
Selective Snoop Probe (SSP)- SSP for Non-Stack Accesses
20Ballapuram, Sharif, and Lee
Core 0
SSP – Non-stack Accesses Update BF2
From RS From RS or LSBor LSB
dispatchdispatch
All non-stack addressesAll non-stack addresses
MEME SISI SISIMEME
L1 D-cacheL1 D-cache MSHRMSHR
L2 queueL2 queue
Last Level Cache
Snoop controller
1000
Snoop queuer2 – read Bloom filter
u2 - update Bloom filtercntr - counting Bloom filter
u2u2
Filter snoops to non-stack region
HASH cntr
BF2
21Ballapuram, Sharif, and Lee
SSP – Non-stack Accesses Read BF2
All non-stack addresses
Filter snoops to non-stack region
HASH cntr
u2u2
L2 queue
dL1 miss
r2
r2All addresses(carry S-bit annotation)
r2 – read Bloom filteru2 - update Bloom filtercntr - counting Bloom filter
Last Level Cache
Snoop controller
1000
Snoop queue
BF2
Core 0From RS From RS or LSBor LSB
dispatchdispatch
All non-stack addressesAll non-stack addresses
MEME SISI SISIMEME
L1 D-cacheL1 D-cache MSHRMSHR
22Ballapuram, Sharif, and Lee
SSP - Selectively Send Snoop Probes
Selectively send snoops
L2 queue
Last Level Cache
Snoop controller
1000
Snoop queuer2 – read Bloom filter
u2 - update Bloom filtercntr - counting Bloom filter
u2u2
Selectively send snoops
All non-stack addressesu2u2
All addresses(carry S-bit annotation)
Core 0From RS From RS or LSBor LSB
dispatchdispatch
All non-stack addressesAll non-stack addresses
MEME SISI SISIMEME
L1 D-cacheL1 D-cache MSHRMSHR
Filter snoops to non-stack region
HASH cntr
BF2
dL1 miss
23Ballapuram, Sharif, and Lee
Essential Snoop Probe (ESP)- ESP for SMC- ESP for all variables
24Ballapuram, Sharif, and Lee
Essential Snoop Probe (ESP)- ESP for SMC
25Ballapuram, Sharif, and Lee
Core 0
SMC – Normal Operation
L1 I-$
Every Store SnoopsI-cache
From RS or
LSB dispatch
L1 D-$
Other pipe stages
26Ballapuram, Sharif, and Lee
Core 0
ESP Essential Snoop Probe
From RS or
LSB dispatch
Other pipe stages
L1 I-$ L1 D-$
• OS sets a control register bit (SMC-CR) • SMC-CR=1 Non Self-Modifying Code• SMC-CR=0 Self-Modifying Code
SMC-CR=1
27Ballapuram, Sharif, and Lee
Essential Snoop Probe (ESP)- ESP for all variables
28Ballapuram, Sharif, and Lee
Core 0
Normal Operation – Snoop for All Variables
Snoop probes
L2 queue
From RS or
LSB dispatch
Other pipe stages
CMP interconnect domain
Snoop probes
Snoop controller
Snoop queue
Last Level Cache
L1 I-$ L1 D-$
dL1 miss
29Ballapuram, Sharif, and Lee
Core 0
Essential Snoop Probe (ESP) – SMN bit 1
dL1 misswith SMN bit annotation
L2 queue
From RS or
LSB dispatch
Other pipe stages
CMP interconnect domain
SMN bitSMN bit – Snoop-Me-Not bit is 0/1
Snoop controller
1100
Snoop queue
Last Level Cache
L1 I-$ L1 D-$
30Ballapuram, Sharif, and Lee
Core 0
Essential Snoop Probe (ESP) – SMN bit 0
L2 queue
From RS or
LSB dispatch
ESP
Other pipe stages
CMP interconnect domain
SMN bit – Snoop-Me-Not bit is 0/1
Last Level Cache
SMN bit
Snoop controller
0100
Snoop queue
L1 I-$ L1 D-$
ESPESP
dL1 misswith SMN bit annotation
31Ballapuram, Sharif, and Lee
Energy Savings in D-Cache Using SSP
• In the 2C config 5% - 10% data cache energy savings and in the 8C config 30% - 65% is achieved.
• The data cache energy savings increases with the number of cores on the die as the number of snoops to all the cores increases.
0%
5%
10%
15%
20%
25%
30%
35%
40%
45%
50%
55%
60%
65%
70%
2C 4C 2Px4C 8C
Processor configuration
% o
f dat
a ca
che
ener
gy s
avin
gs p
er c
ore
SPEC INT 2006SPEC FP 2006games/multi-mediaservermulti-threaded application
32Ballapuram, Sharif, and Lee
Energy Savings in I-Cache Using SSP
• There is a 50% - 70% instruction cache tag energy savings is achieved across all processor configurations.
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
2C 4C 2Px4C 8C
Processor configuration
% o
f ica
che
tag
ener
gy s
avin
gs p
er c
ore
SPEC INT 2006SPEC FP 2006games/multi-mediaservermulti-threaded application
33Ballapuram, Sharif, and Lee
Performance Impact with SSP
• On average there is 1% - 2% performance improvement across various benchmark categories and different processor configurations is achieved.
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
110%
120%
SPEC INT 2006 SPEC FP 2006 games/multi-media
server multi-threadedapplication
Harmean acrossbenchmarks
min performanceobserved
maxperformance
observed
2C 4C 2Px4C 8C
34Ballapuram, Sharif, and Lee
Energy Savings with ESP
• It shows that 5% to a maximum of 82% data cache energy is spent on the non-essential snoop probes that can be eliminated using the ESP technique.
• Also, 85% of the snoops to the instruction cache tag energy can be eliminated using ESP.
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
dcache icache dcache icache dcache icache dcache icache dcache icache dcache icache
SPEC INT 2006 SPEC FP 2006 games/multi-media server multi-threadedapplication
Harmonic meanacross benchmarks%
of c
ache
ene
rgy
spen
t on
non-
esse
ntia
l sno
ops
per
core
2C 4C 2Px4C 8C
35Ballapuram, Sharif, and Lee
• Semantics and program behavior are useful indicators
• They are exploited to reduce power due to snoops
• We proposed– Selective Snoop Probe (SSP) – Essential Snoop Probe (ESP)
• Energy Reduction Results– 5% to 65% in D-cache per core– 50% to 70% in I-cache per core
• 1% - 2% performance improvement
• Extensible to optimize integrated platforms with graphics processor
Conclusion
Georgia TechElectrical and Computer Engineering MARS Labshttp://arch.ece.gatech.edu
Thank You !
BACKUP
38Ballapuram, Sharif, and Lee
Simulation InfrastructureExecution Engine 4-wide, Out-of-OrderLoad buf / Store buf / RS / ROB 96 / 64 / 128 / 256 entriesL1 / L2 latency 4 / 8 cyclesL1 I, L1 D cache size 32KB, 8 way, 64BL2 Cache 4MB, 16 way, 64BL1 TLB entries 128, 4 wayMemory 2GB, DDR 2 timingsCACTI 4.2 70nm power modelBenchmark class Example applicationsServer specJBB, TPCCSPEC FP 2006 wrf, namd, lbm, soplexSPEC INT 2006 hmmer, gobmk, omnetpp,
gccGames and multi-media shooters, realtime
strategy, raytracerMulti-threaded applications ray tracer, cinebench
39Ballapuram, Sharif, and Lee
Number of Modified Lines
• It shows the number of modified lines that needs to be evicted to the last level cache.
0
20
40
60
80
100
120
140
160
180
200
220
SPEC INT 2006 SPEC FP 2006 games/multi-media server multi-threadedapplication
Average acrossbenchmarks
Num
ber o
f mod
ified
line
s at
com
plet
ion
2C4C2Px4C8C
40Ballapuram, Sharif, and Lee
Cache access Vs Snoop access
• Cache access – Read one sub-bank (8 bytes)• Snoop access – Need to read all sub-banks to ship the data to other cores
or other processor in an MP system. (all 64 bytes, cache line size)
41Ballapuram, Sharif, and Lee
Hash functions
Cache LineCache Line(physical address)(physical address)
(48-bits)(48-bits)
MESIMESIstatestate
Tag + Tag + Index Index bitsbits
DataData
cntrcntr cntrcntrHASH HASH 33
HASH HASH 33
If M/E stateIf M/E state If S stateIf S state
Unused bitsUnused bits BBCC AATag + Index bits [6-32]Tag + Index bits [6-32]
cntcntrr
cntcntrr
cntcntrr
HASH HASH 33
If bit-10 is 0, HASH3 = A ^ B ^ CIf bit-10 is 0, HASH3 = A ^ B ^ CIf bit-10 is 1, HASH3 = (A ^ 0x22) ^ B ^ CIf bit-10 is 1, HASH3 = (A ^ 0x22) ^ B ^ C
6153347
42Ballapuram, Sharif, and Lee
Incoming Events to LLCIncoming events to the last level cache
RFO
Data Read
Code fetch
Shared L2 evict
43Ballapuram, Sharif, and Lee
Incoming Events to LLC and Sources of Snoop TriggersIncoming events to the last level cache
iL1 of thiscore
dL1 ofthiscore
RFO - Event trigger
Data Read - Event trigger
Code fetch
Event trigger
Shared L2 evict
44Ballapuram, Sharif, and Lee
Snooped Units in the Triggered CoreIncoming events to the last level cache
iL1 of thiscore
dL1 ofthiscore
LSB of thiscore
MSHR,WBB of this core
RFO - Event trigger
- -
Data Read - Event trigger
- -
Code fetch
Event trigger
SMC snoop
Snoop store buffer only (updated writes)
Snoop (update writes)
Shared L2 evict
- Snoop - Snoop
45Ballapuram, Sharif, and Lee
Snoop Probes for Incoming Data ReadIncoming events to the last level cache
iL1 of thiscore
dL1 ofthiscore
LSB of thiscore
MSHR,WBB of this core
iL1 ofother 3cores
dL1 ofother 3cores
LSB of other 3cores
MSHR,WBB of other 3 cores
Shared L2queue
RFO - Event trigger
- - XMC snoop to invalidate line
Snoop snoop load buffer only to invalidate
Snoop to invalidate pending requests
Snoop to invalidate
Data Read - Event trigger
- - XMC snoop to invalidate line
Snoop - Snoop Snoop
Code fetch
Event trigger
SMC snoop
Snoop store buffer only (updated writes)
Snoop (update writes)
- XMC snoop
Snoop store buffer only (update writes)
Snoop SMC Snoop
Shared L2 evict
- Snoop - Snoop - Snoop - Snoop Snoop
46Ballapuram, Sharif, and Lee
Snoop Triggers and Snoop UnitsIncoming events to the last level cache
iL1 of thiscore
dL1 ofthiscore
LSB of thiscore
MSHR,WBB of this core
iL1 ofother 3cores
dL1 ofother 3cores
LSB of other 3cores
MSHR,WBB of other 3 cores
Shared L2queue
RFO - Event trigger
- - XMC snoop to invalidate line
Snoop snoop load buffer only to invalidate
Snoop to invalidate pending requests
Snoop to invalidate
Data Read - Event trigger
- - XMC snoop to invalidate line
Snoop - Snoop Snoop
Code fetch
Event trigger
SMC snoop
Snoop store buffer only (updated writes)
Snoop (update writes)
- XMC snoop
Snoop store buffer only (update writes)
Snoop SMC Snoop
Shared L2 evict
- Snoop - Snoop - Snoop - Snoop Snoop
SMC snoop to iL1
On all store addr disp
- - SMC snoop to iL1
On all store addr disp
- - -