Evaluating the Reuse Cache for mobile...

Evaluating the Reuse Cache for mobile processors

Lokesh Jindal, Urmish Thakker, Swapnil HariaCS-752 Fall 2014

University of Wisconsin-Madison

Executive SummaryProblem : Mobile SOCs – Area is money!Cache Area = 20-30% of SOC die area

Solutions?Technology Scaling

Expensive, diminishing returnsReuse Caches

Reduced cache size, comparable performance

Results=>50% Area reductions for 5% performance loss

Outline

• Motivation• Implementation• Methodology• Performance Evaluation• Future Directions and Conclusion

Outline

Caches in mobile SOCs

Area-Performance Tradeoffs*Pe

Cache Size

Conventional Cache

* Representative Graph

Area-Performance Tradeoffs*Pe

Cache Size

Conventional CacheReuse Cache

* Representative Graph

Motivation

Jorge Albericio, Pablo Ibáñez, Víctor Viñals, and José M. Llabería. 2013. The reuse cache: downsizing the shared last-level cache. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-46).

Reuse Cache : Original idea

Jorge Albericio, Pablo Ibáñez, Víctor Viñals, and José M. Llabería. 2013. The reuse cache: downsizing the shared last-level cache. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-46).

Reuse Cache : good idea for desktop processors (seemingly)

but what about for mobile processors?

Revisiting Reuse Caches for the Mobile Environment

• Questions– Memory characteristics of mobile workloads?– Spatial/Temporal/Reuse locality at L2 level?– Reuse Cache performance for smaller, simpler caches?

Low temporal locality

% Dead Lines (on Fetch) =

% Dead Lines (on Eviction) =

90.38 90.89 90.15 90.34 90.25 90.7785.83 86.45 85.58 85.78 85.72 86.26

100.00

baidumap bbench frozenbubble k9mail kingsoftoffice netease

AsimBench Benchmarks

Dead Lines (On Load)

Dead Lines (On Eviction)

Fetched Lines #Reused Lines # - 1

Evicted Lines # Lines Evicted Unused#

Low temporal locality

% Dead Lines (on Fetch) =

% Dead Lines (on Eviction) =

90.38 90.89 90.15 90.34 90.25 90.7785.83 86.45 85.58 85.78 85.72 86.26

100.00

AsimBench Benchmarks

Dead Lines (On Load)

Dead Lines (On Eviction)

Fetched Lines #Reused Lines # - 1

Evicted Lines # Lines Evicted Unused#

85-90% dead lines!

Outline

Organization

• Tag Array– Inclusive (aids coherence)– Set associative– (Forward) Pointer to data array

8 ways….

Organization

• Tag Array– Inclusive (aids coherence)– Set associative– (Forward) Pointer to data array

• Data Array– Set associative– Stores reused lines– (Reverse) Pointer to tag array

8 ways….

Coherence

• TO-MOESI protocol– Small extension to the MOESI protocol

• New Tag-Only state• Tag+Data states : Modified, Shared, Owned, Exclusive• Transitions triggered by data insertions/evictions• Read the paper!

Replacement

• Independent replacement policies for Tag and Data arrays

• Tag Array– Not Recently Reused (NRR)

• Data Array– Not Recently Used (NRU) + Shadow directory

Shadow Directory• Pomerene et al. proposed shadow directory

– Transient lines that must be flushed quickly – Lines that become active after long periods of inactivity.

Prefetching mechanism for a high speed buffer store, James Herbert Pomerene, Thomas Roberts Puzak, Rudolph Nathan Rechtschaffen, Frank John Sparacio, 1984, Patent EP 0157175 A2

Tag+Data

No Tag

Shadow Directory• Pomerene et al. proposed shadow directory

– Transient lines that must be flushed quickly – Lines that become active after long periods of inactivity.

• Shadow bit added in reuse cache to break inefficient cycles

Shadow bit set here

Least priority for eviction

Tag+Data

No Tag

Outline

Methodology

Baseline• 32KB/4way L1 I$ + 32KB/4way L1 D$ • 4MB/8way L2$ (Shared LLC)• 8 wide OOO pipeline• 1.4GHZ | 32nm | Low power device (Based on ARM A15 specifications)

Workloads• AsimBench• SPEC CPU 2006• Self-written microbenchmarks(functional verification)

Workload – AsimBench*

• Popular Android applications• 11 diverse applications

• 6 most memory-intensive apps selected for performance evaluation– BBench (Web Browser): Load web pages– K9Mail (Email): Load/Show emails– NeteaseNews (News): Check and load news– KingsoftOffice (Document Process): Open doc/xls/ppt file– BaiduMap (Map): Load map information of a specific area– FrozenBubble (Game): Load game

* Yongbing Huang, Zhongbin Zha, Mingyu Chen, Lixin Zhang. AsimBench: A Mobile Benchmark Suite for Architectural Simulators. IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), Monterey, CA, March 2014.

Outline

Performance - AsimBench

Simulation length of 5 Billion instructions after boot in Full System mode

Reuse Cache (4M+1M)

Conventional Cache (4M + 4M)

Average performance loss of 4.16% (not bad!)

Performance – SPEC CPU2006

(Simulation length of 10B instructions in System Emulation mode)

Reuse Cache(4M+1M)

Conventional Cache (4M+4M)

Average performance drop of 7.17 %

Configuration Exploration (WIP)

1.101.00

1.050.98

Reuse 4M + 2M

Multi-core Performance

Reuse Cache (4M + 1M)

Conventional Cache (4M + 4M)

SPEC benchmarks simulated on 4 cores, running 10 Billion instructions each(Full system + multi-core + AsimBench == Too slow)

~15% Degradation

Area Improvements

Conventional (4M + 4M) Reuse Cache (4M + 1M) Reuse Cache (4M + 2M)

Area (in mm2) 2.37 1.18 1.5

Area savings of 50% for the 4+1 reuse cache, 37% for the 4+2 cache(Data generated using Cacti v6.5)

% Improvement in Power

% Improvement in Power~10 % Improvement

+ Reduced Leakage power- Increased DRAM power

Live-ness Analysis - AsimBench

• % Live lines increased from 9.54 % to 47.95%

Conventional Cache (4M+4M)

Reuse Cache (4M+1M)

Pictures!

• AsimBench apps running on gem5 …

Outline

Future Directions

• Replacement Policies (NRR + Shadow) for Fully Associative data array

• Explore allotment policies• Better uses for extra tag entries• Longer/Better simulations using simpoints• Multi-core simulations using AsimBench to validate coherence

protocols• Comprehensive energy analysis

Summary

For mobile workloads, reuse cache promises + Significant area reduction (~50%)+ Decent power savings (~10%)- Marginal performance loss (~6%)

Questions?

Thank You!!

BACKUP

Configuration Exploration

1 2 3 4 5

Increased Power (mW)

L2 => 1 MB CacheAnimate this to cross out 1 MB and say 50% of data area

L2 => 4 MB CacheCross out and say Performance of 16 MB Cache

Results

• Mapping exploration -> IPC , ASIM + SPEC• 4+4 vs (4+1) spec , asimbench• Shadow vs NRR • Boot times• Bigger is better• CACti graph• Energy graph• Dead lines

Baseline

• 32KB/4way L1 I$ + 32KB/4way L1 D$ • 4MB/8way L2$ (Shared LLC)• 8 wide OOO pipeline• 1.4GHZ | 32nm | Low power device

• Based on ARM A15 specifications

Challenges

• ARM aarch32 incompatible with gem5’s Ruby memory model

• Simulation infrastructure for ARM in full system mode

Contributions

Evaluating the Reuse Cache for mobile...

Documents

Trace-based Compilationckrintz/classes/s20/cs... · Code Cache •Translated code blocks (traces) are stored in a code cache (in native code) for reuse nOld, unused blocks are replaced

Prime-Cache / Pro-Cache / Power-Cache Archive Appliance User

Modeling and Verifying Cache-Coherent Protocols, VIP, and Designs · 2020-06-13 · Cache Memory Agent – A2 Cache Figure 1: Cache coherence management One way to manage cache coherence

Managing Cache Coherency on Cortex-M7 Based MCUsww1.microchip.com/downloads/en/DeviceDoc/Managing-Cache-Cohe… · The cache hits only update the cache memory. Cache misses on a write,

Say Goodbye to Off-heap Caches! On-heap Caches Using ...Spark: Caching Impacts Performance 4 •Jobs cache intermediate data in memory •Subsequent jobs reuse cached data •Caching

On heap cache vs off-heap cache

FRD: A Filtering based Buffer Cache Algorithm that ......reuse distance (FRD). The proposed algorithm concentrates on the manner in which to utilize both the access frequency and reuse

thaicom.dkthaicom.dk/assets/files/Thanulux Company Profile.pdfladieswear, childrenswear and leather goods under brand Lollipop , CACHE CACHE , ERAWON Cadeau , Lollipop , CACHE: CACHE

2013/01/14 Yun-Chung Yang Energy-Efficient Trace Reuse Cache for Embedded Processors Yi-Ying Tsai and Chung-Ho Chen 2010 IEEE Transactions On Very Large

Cache memory and cache

Oracle Cache Fusion Cache Fusion Concepts, Data Block Shipping, and Recovery with Cache Fusion

Cache Memory - UFRGSflavio/ensino/cmp502/memoria.pdf · 2002. 11. 7. · cache next-level memory/cache CSE 141 Carro Cache Fundamentals, cont. • cache block size or cache line size

ECI-Cache: A High-Endurance and Cost-EfficientI/O Caching ... · Unlike traditional I/O caching schemes which allocate cache size only based on reuse distance of accesses, we propose

InputsMetricsCode MAIN MEMORY core Interconnection network Private data (LI) cache Cache controller core Cache controller Private data (LI) cache MULTICORE

Advanced cache optimizationsstrukov/ece154bSpring2013/week...Advanced Cache Optimization 1) Way prediction 2) Victim cache 3) Pipelined cache 4) Nonblockingcache 5) Multibankedcache

A Detailed GPU Cache Model Based on Reuse Distance Theory

Revisiting The Vertex Cache€¦ · Reuse in Triangle Meshes •Exploit high vertex valence •Ex.: Triangle strips, fans… •Index list can cause ≪1.0shaded vertex per triangle

INF3380: Parallel Programming for Natural Sciences · Cache Cache Core Core Core Core Cache Cache Bus Compute Node Memory Core Core Core Core Cache Cache Core Core Core Core ... Compute

Memory Hierarchy Design Memory Hierarchy Design. 2 Outline Introduction Cache Basics Cache Performance Reducing Cache Miss Penalty Reducing Cache Miss

Section 4. Prefetch Cache · 2011. 2. 22. · Prefetch Cache 4 Section 4. Prefetch Cache ... Program Flash Memory (PFM) cache and the Prefet ch Cache module increase performance for