Optimizing NUCA Organizations and Wiring ...€¦ · Extend CACTI to model NUCA Employs a separate bus for each cache bank for multi-banked caches NotscalableNot scalable Exploit

Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6 0Alternatives for Large Caches with CACTI 6.0

Naveen Muralimanohar

Rajeev Balasubramonian

Norman P Jouppipp

1

University of Utah & HP Labs 1

Large Caches

I t l M t itCache hierarchies will dominate chip area

Intel Montecito

3D stacked processors with an entire die for on-chip cache could be common

Cache Cachecommon

Montecito has two private 12 MB L3 caches (27MB including L2)( g )

Long global wires are required to transmit data/address

2University of Utah 2

Wire Delay/Power

Wi d l tl f f dWire delays are costly for performance and power

− Latencies of 60 cycles to reach ends of a chip at 32nm (@ 5 GHz)

− 50% of dynamic power is in interconnect y pswitching (Magen et al. SLIP 04)

CACTI* access time for 24 MB cache is 90 cyclesCACTI access time for 24 MB cache is 90 cycles @ 5GHz, 65nm Tech

3University of Utah 3*version 4

Contribution

Support for various interconnect modelsImproved design space exploration

Support for modeling Non-Uniform Cache Access (NUCA)

University of Utah 4

Cache Design BasicsI t ddBitlines Input address

oderWordline

Bitlines

rray

arra

y

Dec

o

Tag

a

Dat

a a

Column muxesColumn muxesSense Amps

Comparatorsp

Output driverMux drivers

Data output

Output driver


Valid output?

Existing Model - CACTIW dli & bitli d l

Decoder delay Decoder delayWordline & bitline delay Wordline & bitline delay

Cache model with 4 sub-arrays Cache model with 16 sub-arrays

Decoder delay = H tree delay + logic delay6University of Utah 6

Decoder delay = H-tree delay + logic delay

Power/Delay Overhead of Wires70%

50%

60%

70%H-tree delay percentageH-tree power percentage

H-tree delay increases with cache size

30%

40%

50%H-tree power continues to dominate

Bitli th j

10%

20%

30%Bitlines are other major contributors to total power

0%

10%

2 4 8 16 32Cache Size (MB)

p

7

Cache Size (MB)

Motivation

Th d i t l f i t t i lThe dominant role of interconnect is clear

Lack of tool to model interconnect in detail can impede progress

C rrent sol tions ha e limited ire optionsCurrent solutions have limited wire options

Orion, CACTI

- Weak wire model

- No support for modeling Multi-megabyte caches


CACTI 6.0 Enhancements

Incorporation ofIncorporation of Different wire modelsDifferent router modelsGrid topology for NUCAShared bus for UCAContention values for various cache configurations

Methodology to compute optimal NUCA organizationImproved interface that enables trade-off analysisValidation analysis


Full-swing Wires

Z

X Y


X

Full-swing Wires II

10% Delay Three different design pointspenalty 20% Delay

penalty30% Delaypenalty

Repeater size

Caveat: Repeater sizing and spacing cannot be controlled precisely all the time


p y

Full-Swing Wires

F t d i lFast and simpleDelay proportional to sqrt(RC) as against RC

Hi h b d idthHigh bandwidthCan be pipelined

- Requires silicon area- High energy

- Quadratic dependence on voltage

12

Low-swing wires50mV

400mV

50mVraise400mV

400mV

400mV

Differential wires50mVdrop


Differential Low-swing+ Very low power can be routed over other+ Very low-power, can be routed over other

modules- Relatively slow low-bandwidth high areaRelatively slow, low bandwidth, high area

requirement, requires special transmitter and receiver

Bitlines are a form of low-swing wireOptimized for speed and area as against powerDriver and pre-charger employ full Vdd voltage


Delay Characteristics

Quadratic increase in delay


Energy Characteristics


Search Space of CACTI-5

Design space with global wires optimized for delay17University of Utah 17

Design space with global wires optimized for delay

Search Space of CACTI-6Low swingLow-swing

30% DelayPenalty

Least Delay

Design space with global and low swing wires

Least Delay


Design space with global and low-swing wires

CACTI – Another Limitation

Access delay is equal to the delay of slowest subAccess delay is equal to the delay of slowest sub-array

Very high hit time for large cachesy g gPotential solution – NUCAExtend CACTI to model NUCAEmploys a separate bus for each cache bank for multi-banked caches

Not scalableNot scalableExploit different wire types and networkdesign choices to improve the search space


design choices to improve the search space

Non-Uniform Cache Access (NUCA)*

Large cache is broken into

a number of small banks CPU & L1Employs on-chip network

for communication

Access delay α (distance

between bank and cache

controller)Cache banks

*(Kim et al ASPLOS 02)20University of Utah 20

(Kim et al. ASPLOS 02)

Extension to CACTIOn chip networkOn-chip network

Wire model based on ITRS 2005 parametersGrid network3-stage speculative router pipeline

Network latency vs Bank access latency tradeoffIt t diff t b k iIterate over different bank sizesCalculate the average network delay based on the number of banks and bank sizesConsider contention values for different cache configurations

Similarly we also consider power consumed for each organization


organization

Trade-off Analysis (32 MB Cache)

300

350

400Total No. of Cycles

Network Latency

200

250

300

cycl

es) Bank access latency

Network contention Cycles

100

150

Late

ncy

(c

16 Core CMP

0

50

2 4 8 16 32 64

L

22

No. of Banks

Effect of Core Count16 core

250300

Cyc

les 16-core

8-core4

100150200

ntio

n C 4-core

050

100

Con

ten

02 4 8 16 32 64

Bank Count

C

23

Bank Count

Power Centric Design (32MB Cache)

8.E-099.E-091.E-08

Total EnergyBank Energy

4 E 095.E-096.E-097.E-09

ergy

J

Network Energy

Power Optimal Point

1.E-092.E-093.E-094.E-09

Ene

0.E+00

2 4 8 16 32 64

B k C t24University of Utah 24

Bank Count

Validation

HSPICE tool

Predictive Technology Model (65nm tech.)gy ( )

Analytical model that employs PTM parameters compared against HSPICE

Distributed wordlines, bitlines, low-swing st buted o d es, b t es, o s gtransmitters, wires, receivers

V ifi d t b ithi 12%Verified to be within 12%University of Utah 25

Case Study: Heterogeneous D-NUCA

Dynamic NUCADynamic-NUCAReduces access time by dynamic data movementNear b banks are accessed more freq entlNear-by banks are accessed more frequently

Heterogeneous BanksN b b k d ll d hNear-by banks are made smaller and hence fasterAccess to nearby banks consume less powerAccess to nearby banks consume less powerOther banks can be made larger and more power efficient

26

power efficient

Access Frequency120.00%

80.00%

100.00%

120.00%

40.00%

60.00%

0.00%

20.00%

768

568

368

168

968

768

568

368

168

968

768

32,7

3,30

9,5

6,58

6,3

9,86

3,1

13,1

39,9

16,4

16,7

19,6

93,5

22,9

70,3

26,2

47,1

29,5

23,9

32,8

00,7

% request satisfied by x KB of cache27

% request satisfied by x KB of cache

Few Heterogeneous Organizations Considered by CACTI

Model 1

Model 2University of Utah 28

Model 2

Other Applications

E i i tiExposing wire propertiesNovel cache pipelining

Early lookup Aggressive lookup (ISCA 07)Early lookup, Aggressive lookup (ISCA 07)Flit-reservation flow control (Peh et al., HPCA 00)00)Novel topologies

Hybrid network (ISCA 07)y ( )

29

ConclusionNet ork parameters and contention pla aNetwork parameters and contention play a critical role in deciding NUCA organizationWire choices have significant impact on cacheWire choices have significant impact on cache propertiesCACTI 6 0 can identify models that reduceCACTI 6.0 can identify models that reduce power by a factor of three for a delay penalty of 25%

http://www.hpl.hp.com/personal/Norman_Jouppi/cacti6.html

30

http://www.cs.utah.edu/~rajeev/cacti6/

Documents

Optimizing NUCA Organizations and Wiring ...€¦ · Extend CACTI to model NUCA Employs a separate bus for each cache bank for multi-banked caches NotscalableNot scalable Exploit