Non-Uniform Power Access in Large Caches with Low-Swing Wires Aniruddha N. Udipi with Naveen Muralimanohar*, Rajeev Balasubramonian University of Utah

Non-Uniform Power Access in Large Caches with Low-Swing Wires

Aniruddha N. Udipi

with Naveen Muralimanohar*,

Rajeev Balasubramonian

University of Utah and *HP Labs

University of Utah 2

Motivation

• Future CMPs likely to be power-limited

• Growing gap between processor and main memory performance – the Bandwidth Wall

– Large caches required to alleviate this problem– Nehalem already has 8MB of last-level cache

• These large caches contribute significantly to energy consumption

– They are often the cache coherence interface in CMPs– Cache energy contribution likely to rise as core energy reduces with

simpler and more efficient cores


Executive Summary

• H-tree identified as energy bottleneck within large cache banks

• Study various techniques to introduce low-swing wiring to address this bottleneck

• Non-Uniform Power Access to allow access to different regions of cache at different energies

• Architectural mechanisms to increase fraction of accesses hitting in the low-power region

• Significant cache energy reductions at very modest performance penalties


Outline

• Cache design background• Technique I – Single low-swing bus• Technique II – Multiple low-swing buses• Technique III – Fully-pipelined low-swing bus• Technique IV – Non-Uniform Power Access• Technique V – Architectural mechanisms• Evaluation• Conclusion

NUCA design

• Increasing disparity in access delays to different parts of the cache

• Non-Uniform Cache Access– Divide large cache into multiple “banks”

– On-chip network connects these banks and transfers address and data

– Bank count and size of each bank determined by relative contribution of banks and network to total energy/delay

– Per CACTI 6.0, even a 64MB NUCA cache likely to have large 2 or 4MB banks


Interconnect

Cache

Core

Cache

Core

Cache

Core

Cache

Core

Bank design basics


Input address

Dec

oderWordline

Bitlines

Tag

arr

ay

Dat

a ar

ray

Column muxesSense Amps

Comparators

Mux drivers

Data output

Output driver

Bank design considerations

• Naïve implementation would take the form of a single array of memory cells with centralized control logic, but such a design would not scale

– Wordlines (area considerations) and bitlines (differential signaling) cannot be repeated – delay increase with cache size

– Cache bandwidth is a function of cycle time – single array would have small bandwidth

• Performance limited by wordline/bitline length– Divide into multiple segments called “subarrays”– Subarrays connected by an internal network


Bank organization

• Bank organization determined by NDWL,NDBL

• Fewer subarrays gives increased area efficiency, but larger delay due to longer wordlines/bitlines


NDWL = 4

ND

BL =

4

H-TREE

SUBARRAY

Interconnect

Cache

Core

Cache

Core

Cache

Core

Cache

Core

Bank Energy Consumption

H-tree is clearly the dominant component of energy consumption


Low-swing wires

• High power dissipation in global wires due to full swing requirement imposed by repeaters

• Use low-voltage swing differential signaling– Two wires per signal– Voltage swing as low as 100mV– Approx. 10X energy savings compared to full swing wires– Increased delay, cannot be used over long distances– Non-trivial pipelining costs

• What is the best way to use low-swing wires to build the H-tree?



Outline


Single low-swing bus

• Simplest solution, simply build entire H-tree with low-swing wires

• Best energy savings

• Significant performance drops– Cycle time becomes equal to access time

– Increased contention

• Not worth considering unless energy is considerably more important than performance



Outline


Multiple low-swing buses

• Spread contention around

• Fast vertical bus, tristate buffers at intersections

• Energy overhead modeled accurately


LOW-SWING BUS

TRI-STATE BUFFERS


Outline


Fully-pipelined low-swing bus

• Pipelining low-swing wires is non-trivial

• Differential transmitter and receiver required at every pipeline stage

• Amortized over 1mm, every transceiver is a 58% energy overhead

• Performance improves compared to non-pipelined low-swing



Outline


Non-Uniform Power Access


LOW-SWING H-TREE TRUNK

DEFAULT FULL-SWING H-TREE

LOW-POWER REGION

HIGH-POWER REGION

Non-Uniform Power Access

• Introduction of the low-swing trunk does not affect basic H-tree design significantly

• Limited low-swing length– Access time same as that for the default H-tree– New bus transparent to processor

• Energy savings proportional to fraction of rows accessible via the low-swing bus

– Only two central rows - 1/16th in our case (NDBL = 32)– Architectural mechanisms required to increase this

fraction



Outline


Exploiting Non-Uniform Power Access

• Increase fraction of accesses served by the “low-power region”

• Assign a fraction of the ways of the set to the “low-power region (LP)” and the rest of the ways to the “high-power region (HP)”

• On every access, check all tags in parallel, if it hits in the LP region, it is a low-power access

• If not, bring the line into the low-power region at this point

– the next use will then likely be a low-power access


Swap scheme

• Bring block into low-power region on first-touch• The block currently in LRU position in that set is swapped out into the high-power region

– Most recently used (MRU) ways of every set are in the LP region

• Every low-power fetch incurs a swap which costs two low-power and two high-power accesses

• For Swap to consume less energy than baseline with N accesses

– N * H > 2 * H + (N+1) * L

– N > 2.5


Duplicate scheme

• Bring block into low-power and high-power region on first touch

• Block currently in LRU position in low-power region is– Simply dropped if clean – better than Swap– Written back to high-power region if dirty – same as Swap

• Every L2 miss results in one additional HP access initially

• Forming equations similar to Swap– Nclean > 1.16

– Ndirty > 2.6


Dynamic Reconfiguration

• Good energy savings if a modestly high hit-rate in the low-power region

• Below a certain threshold, extra energy required to move blocks between LP and HP region overshadows savings

• Track average reuse count and turn-off architectural mechanisms in bad phases, operate like default cache

– Single five bit saturating counter for entire cache

– Increment counter on hit in LP region, decrement on miss


Comparison to L2/L3 or Filter Cache

• Data placement and mapping schemes do bear resemblance to L2/L3 hierarchy or filter cache

– our approach is orthogonal to the hierarchy and can continue to be used for the largest last-level cache

– need for interconnects between multiple physical cache structures eliminated

– Non-uniform access model 25% more efficient than a filter cache model with similar capacities



Outline



Methodology

• SimpleScalar 3.0 OOO-simulator

• CACTI 6.0 for cache energy/delay computation

• 32nm process, 5GHz clock

• 32K each I- and D-L1, 2-way

• Unified 4MB L2 cache, 16-way

• 300 cycle main memory latency

• SPEC2k benchmark suite

Low-swing design points - Energy


Low-swing design points - IPC


Low-swing design points

• Clearly a trade-off between energy savings and performance drops

• ED2 metric– Non-uniform model gives 5% improvement over

baseline

– Pipelined low-swing model is next best, with a 3% improvement over baseline

– These are the two most compelling design points


Architectural mechanisms


Dynamic reconfiguration


Sensitivity to cache size



Outline



Related Work

• Low-swing wires– “Smart memories” project, CACTI 6.0

• Cache access energy– Drowsy cache, gated-ground cache, L0 instruction

cache, non-uniformity in number of ways per set

• Ours is the first work to optimize the internal structure of the cache, and propose non-uniform power access within a cache bank

Key Contributions

• Study of the internal organization of large cache banks,

identification of bottleneck

• Exploration the design space of low-swing wiring within

large caches

• Introduction of the notion of Non-Uniform Power Access

– Definition of the architectural mechanisms required to

maximize the energy-saving potential of low-swing wires



Thank you..

• Questions?

Documents

Non-Uniform Power Access in Large Caches with Low-Swing Wires Aniruddha N. Udipi with Naveen Muralimanohar*, Rajeev Balasubramonian University of Utah