Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
RECAP: REGION-AWARE CACHEPARTITIONING
-Raj Muchhala--Susmit Joshi
OVERVIEW
Motivation Introduction Recap Architecture Experimental Methodology Evaluation Conclusion
MOTIVATION
High performance computing systems obtained many cores and share a Last level cache(LLC).
Many partitioning schemes in last ten years have had the potential to increase performance for the LLC partitioning.
Partition based on requirements of each core. Idea was to give each core a share of the cache
resources. But did not consider LLC energy saving. Required the number of cache ways to be more
than the number of sharing cores to be effective.
MOTIVATION(CONTD.) Majority(80%) of
accesses request shared data.
But 93% data in the cache is actually private data.
INTRODUCTION ( RECAP) Identify and separate private and shared data. Private data- Left hand side of the cache. Shared data- Right hand side of the cache. For private requests:
Only ways on left of the cache gets activated. For shared requests:
Only ways on right of the cache gets activated. Significant dynamic energy savings. Supports core to cache way ratio of 1:1.
EXAMPLE DATA LAYOUT IN RECAP
Cache is made up of sets that can fit many cache ways.
Cache ways are equal sections of the cache with size equal to the cache page size.
OVERVIEW OF RECAP ARCHITECTURE
RECAP ARCHITECTURE
Usage Monitoring. Cache Partitioning Control. Cache Reconfiguration. Separating Private and Shared Data. Data Cache Transitioning. Private Data Expansion.
USAGE MONITORING
Need to determine the optimal number of ways for each core to achieve its highest utilization.
Utility monitors used to track the accesses by each core to characterize each thread’s use of the cache.
Equal priority threads, each core gets number of ways based on performance benefits it can realize.
Differing priority threads: High-priority gets more number of ways.Low-priority gets less number of ways.
CACHE PARTITIONING CONTROL
Access Permission Register(APR) controls each core’s accesses to the LLC.
1 bit per core for R/W, Shared bit and Flush Bit.
Shared bit indicate a way shared by all cores. Flush bit used when contracting a core’s cache
allocation. Enforces cores to access only certain ways. Dynamic and Static energy savings.
C0 C1 …. FL S
CACHE RECONFIGURATION
Initially all cores share one cache way.
All core bits in the APR are set.
Full access to all threads.
If a core requires more cache ways, it enters into an Expansion mode for that thread.
If a core requires fewer ways than current, it enters a Contraction mode.
EXPANSION MODE
Expands the number of ways that a core can access.
Done by setting the appropriate bit in the APR for each way that the core requires.
If any of the ways were previously turned off, they are enabled once more.
CONTRACTION MODE
Reduces the number of ways that a core can access.
Need to flush dirty data back to main memory before resetting the core’s APR bit for the ways.
Contraction bit vector for each way tracks the sets to be flushed.
When flush bit in APR is set, all cores access the corresponding way to flush data back from the set they are consulting.
Sets the relevant bits in the Contraction vector. Speeds up the contraction process.
SEPARATING PRIVATE & SHARED DATA
Accesses to LLC pass through the directory. Directory used to determine requested datatype. Data in exclusive or modified state is private. Data in shared state is shared. Looking for private data, core only searches
private ways it has access to. For shared data, searches only shared ways. When there is no entry for a particular data item,
the core searches its ways in both private & shared regions.
DATA CATEGORY TRANSITIONING
Situation I: Data in shared region but a core wants to write to it.
Invalidate that data in LLC. Core then evicts that data from its L1 cache. Data written into the private region of LLC. On another core request to read the same data,
data is written back into LLC’s shared side. No reads/writes into main memory.
DATA CATEGORY TRANSITIONING
Situation II: Data in private region but another core wants to read it.
Data must be placed in shared region. Modified state data:
Invalidate data in LLC, obtain updated copy from the L1 that owns it and place it in shared region.
Exclusive state data: First send the data back to the requesting core, then
invalidate the existing line and refetch from main memory writing into the shared region.
Overheads due to involvement of DRAM.
PRIVATE DATA EXPANSION
After partitioning decision, private data portions for each core are set.
Each partition expanded by a single way after one interval execution.
Expand again if larger partition size achieves 10% more performance for its core.
Repeat until no more ways left to expand or performance increase is less than 10%.
Allows overlapping partitions without impacting performance.
EXPERIMENTAL METHODOLOGY
Simulator: Implemented partitioned cache architecture in gem5. Simulated x86-based in-order processor with 8 cores. For energy information, Cacti used at 45nm.
Workloads: Ran Parsec benchmarks for multi-threaded
applications . Random mix of benchmarks from SPEC CPU2006 for
multi-programmed workloads.
SPEC2006 COMBINATIONS & PARSEC WORKLOADS
SELECTION OF SPEC BENCHMARKS
Group benchmarks based on their Misses per kilo instructions(MPKI).
Value above 5 are thrashing applications. Random selection of 8 benchmarks, containing 1
through 7 thrashing applications. Example:
3T-1 refers first group containing 3 thrashing applications.
EXPERIMENTAL METHODOLOGY
Comparison Approaches:
Two comparison techniques: Thrasher caging (TC) – Policy for containing
thrashing workloads.
TA-DRRIP – State-of-the-art cache replacement policy targeting high performance.
EVALUATION
Evaluate the performance and energy consumption for the multi-threaded and multi-programmed workloads for 8 & 16-way cache.
Evaluate the effectiveness of the flush bit within APR by comparing three different approaches to contraction.
Evaluate the number of data blocks transitioning from private to shared and from shared to private.
PERFORMANCE & ENERGY
PERFORMANCE & ENERGY RESULTS
Recap achieves 15% better performance for the SPEC application mixes using an 8-way cache compared to LRU scheme.
83% of dynamic energy and 87% of static energy consumption of RECAP for SPEC application mixes using an 8-way cache.
For Parsec workloads, average consumption of 83% dynamic and 59% static energy consumption.
CYCLES REQUIRED FOR CONTRACTION
CYCLES REQUIRED FOR CONTRACTION
Results: (SPEC application mixes) Only 10m cycles required for All scheme. 65m cycles for Contracting only scheme & 59m for
Contracting and Cohabiting scheme. 85% faster than the Contracting only scheme.
Results: (Parsec Workloads) 13m, 77m and 69m cycles required for the schemes
mentioned above respectively.
DATA CATEGORY TRANSITIONING
Average 8m transitions from private to shared.
Small compared to overall LLC accesses.
Additional power overheads due to refetching data subsumed by large energy savings.
CONCLUSION
PROS: Achieves 17% & 13% dynamic and static energy
savings with no slowdown in performance across 21 mixes of SPECCPU2006 applications on an 8-core with 8-way LLC.
LLC partitioning possible even if way-to-core ratio is low.
CONS: Require extra bits for APR and Contraction bit
vectors. RECAP cache consumes more power than standard
LLC. Extra latency for cache accesses due to directory
lookup.