Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6 0Alternatives for Large Caches with CACTI 6.0
Naveen Muralimanohar
Rajeev Balasubramonian
Norman P Jouppipp
1
University of Utah & HP Labs 1
Large Caches
I t l M t itCache hierarchies will dominate chip area
Intel Montecito
3D stacked processors with an entire die for on-chip cache could be common
Cache Cachecommon
Montecito has two private 12 MB L3 caches (27MB including L2)( g )
Long global wires are required to transmit data/address
2University of Utah 2
Wire Delay/Power
Wi d l tl f f dWire delays are costly for performance and power
− Latencies of 60 cycles to reach ends of a chip at 32nm (@ 5 GHz)
− 50% of dynamic power is in interconnect y pswitching (Magen et al. SLIP 04)
CACTI* access time for 24 MB cache is 90 cyclesCACTI access time for 24 MB cache is 90 cycles @ 5GHz, 65nm Tech
3University of Utah 3*version 4
Contribution
Support for various interconnect modelsImproved design space exploration
Support for modeling Non-Uniform Cache Access (NUCA)
University of Utah 4
Cache Design BasicsI t ddBitlines Input address
oderWordline
Bitlines
rray
arra
y
Dec
o
Tag
a
Dat
a a
Column muxesColumn muxesSense Amps
Comparatorsp
Output driverMux drivers
Data output
Output driver
5University of Utah 5
Valid output?
Existing Model - CACTIW dli & bitli d l
Decoder delay Decoder delayWordline & bitline delay Wordline & bitline delay
Cache model with 4 sub-arrays Cache model with 16 sub-arrays
Decoder delay = H tree delay + logic delay6University of Utah 6
Decoder delay = H-tree delay + logic delay
Power/Delay Overhead of Wires70%
50%
60%
70%H-tree delay percentageH-tree power percentage
H-tree delay increases with cache size
30%
40%
50%H-tree power continues to dominate
Bitli th j
10%
20%
30%Bitlines are other major contributors to total power
0%
10%
2 4 8 16 32Cache Size (MB)
p
7
Cache Size (MB)
Motivation
Th d i t l f i t t i lThe dominant role of interconnect is clear
Lack of tool to model interconnect in detail can impede progress
C rrent sol tions ha e limited ire optionsCurrent solutions have limited wire options
Orion, CACTI
- Weak wire model
- No support for modeling Multi-megabyte caches
8University of Utah 8
CACTI 6.0 Enhancements
Incorporation ofIncorporation of Different wire modelsDifferent router modelsGrid topology for NUCAShared bus for UCAContention values for various cache configurations
Methodology to compute optimal NUCA organizationImproved interface that enables trade-off analysisValidation analysis
9University of Utah 9
Full-swing Wires
Z
X Y
10University of Utah 10
X
Full-swing Wires II
10% Delay Three different design pointspenalty 20% Delay
penalty30% Delaypenalty
Repeater size
Caveat: Repeater sizing and spacing cannot be controlled precisely all the time
11University of Utah 11
p y
Full-Swing Wires
F t d i lFast and simpleDelay proportional to sqrt(RC) as against RC
Hi h b d idthHigh bandwidthCan be pipelined
- Requires silicon area- High energy
- Quadratic dependence on voltage
12
Low-swing wires50mV
400mV
50mVraise400mV
400mV
400mV
Differential wires50mVdrop
13University of Utah 13
Differential Low-swing+ Very low power can be routed over other+ Very low-power, can be routed over other
modules- Relatively slow low-bandwidth high areaRelatively slow, low bandwidth, high area
requirement, requires special transmitter and receiver
Bitlines are a form of low-swing wireOptimized for speed and area as against powerDriver and pre-charger employ full Vdd voltage
14University of Utah 14
Delay Characteristics
Quadratic increase in delay
15University of Utah 15
Energy Characteristics
16University of Utah 16
Search Space of CACTI-5
Design space with global wires optimized for delay17University of Utah 17
Design space with global wires optimized for delay
Search Space of CACTI-6Low swingLow-swing
30% DelayPenalty
Least Delay
Design space with global and low swing wires
Least Delay
18University of Utah 18
Design space with global and low-swing wires
CACTI – Another Limitation
Access delay is equal to the delay of slowest subAccess delay is equal to the delay of slowest sub-array
Very high hit time for large cachesy g gPotential solution – NUCAExtend CACTI to model NUCAEmploys a separate bus for each cache bank for multi-banked caches
Not scalableNot scalableExploit different wire types and networkdesign choices to improve the search space
19University of Utah 19
design choices to improve the search space
Non-Uniform Cache Access (NUCA)*
Large cache is broken into
a number of small banks CPU & L1Employs on-chip network
for communication
Access delay α (distance
between bank and cache
controller)Cache banks
*(Kim et al ASPLOS 02)20University of Utah 20
(Kim et al. ASPLOS 02)
Extension to CACTIOn chip networkOn-chip network
Wire model based on ITRS 2005 parametersGrid network3-stage speculative router pipeline
Network latency vs Bank access latency tradeoffIt t diff t b k iIterate over different bank sizesCalculate the average network delay based on the number of banks and bank sizesConsider contention values for different cache configurations
Similarly we also consider power consumed for each organization
21University of Utah 21
organization
Trade-off Analysis (32 MB Cache)
300
350
400Total No. of Cycles
Network Latency
200
250
300
cycl
es) Bank access latency
Network contention Cycles
100
150
Late
ncy
(c
16 Core CMP
0
50
2 4 8 16 32 64
L
22
No. of Banks
Effect of Core Count16 core
250300
Cyc
les 16-core
8-core4
100150200
ntio
n C 4-core
050
100
Con
ten
02 4 8 16 32 64
Bank Count
C
23
Bank Count
Power Centric Design (32MB Cache)
8.E-099.E-091.E-08
Total EnergyBank Energy
4 E 095.E-096.E-097.E-09
ergy
J
Network Energy
Power Optimal Point
1.E-092.E-093.E-094.E-09
Ene
0.E+00
2 4 8 16 32 64
B k C t24University of Utah 24
Bank Count
Validation
HSPICE tool
Predictive Technology Model (65nm tech.)gy ( )
Analytical model that employs PTM parameters compared against HSPICE
Distributed wordlines, bitlines, low-swing st buted o d es, b t es, o s gtransmitters, wires, receivers
V ifi d t b ithi 12%Verified to be within 12%University of Utah 25
Case Study: Heterogeneous D-NUCA
Dynamic NUCADynamic-NUCAReduces access time by dynamic data movementNear b banks are accessed more freq entlNear-by banks are accessed more frequently
Heterogeneous BanksN b b k d ll d hNear-by banks are made smaller and hence fasterAccess to nearby banks consume less powerAccess to nearby banks consume less powerOther banks can be made larger and more power efficient
26
power efficient
Access Frequency120.00%
80.00%
100.00%
120.00%
40.00%
60.00%
0.00%
20.00%
768
568
368
168
968
768
568
368
168
968
768
32,7
3,30
9,5
6,58
6,3
9,86
3,1
13,1
39,9
16,4
16,7
19,6
93,5
22,9
70,3
26,2
47,1
29,5
23,9
32,8
00,7
% request satisfied by x KB of cache27
% request satisfied by x KB of cache
Few Heterogeneous Organizations Considered by CACTI
Model 1
Model 2University of Utah 28
Model 2
Other Applications
E i i tiExposing wire propertiesNovel cache pipelining
Early lookup Aggressive lookup (ISCA 07)Early lookup, Aggressive lookup (ISCA 07)Flit-reservation flow control (Peh et al., HPCA 00)00)Novel topologies
Hybrid network (ISCA 07)y ( )
29
ConclusionNet ork parameters and contention pla aNetwork parameters and contention play a critical role in deciding NUCA organizationWire choices have significant impact on cacheWire choices have significant impact on cache propertiesCACTI 6 0 can identify models that reduceCACTI 6.0 can identify models that reduce power by a factor of three for a delay penalty of 25%
http://www.hpl.hp.com/personal/Norman_Jouppi/cacti6.html
30
http://www.cs.utah.edu/~rajeev/cacti6/