RECAP: R -A C PARTITIONINGmeseec.ce.rit.edu/722-projects/spring2015/2-2.pdf · Recap achieves 15% better performance for the SPEC application mixes using an 8-way cache compared to

RECAP: REGION-AWARE CACHEPARTITIONING

-Raj Muchhala--Susmit Joshi

OVERVIEW

Motivation Introduction Recap Architecture Experimental Methodology Evaluation Conclusion

MOTIVATION

High performance computing systems obtained many cores and share a Last level cache(LLC).

Many partitioning schemes in last ten years have had the potential to increase performance for the LLC partitioning.

Partition based on requirements of each core. Idea was to give each core a share of the cache

resources. But did not consider LLC energy saving. Required the number of cache ways to be more

than the number of sharing cores to be effective.

MOTIVATION(CONTD.) Majority(80%) of

accesses request shared data.

But 93% data in the cache is actually private data.

INTRODUCTION ( RECAP) Identify and separate private and shared data. Private data- Left hand side of the cache. Shared data- Right hand side of the cache. For private requests:

Only ways on left of the cache gets activated. For shared requests:

Only ways on right of the cache gets activated. Significant dynamic energy savings. Supports core to cache way ratio of 1:1.

EXAMPLE DATA LAYOUT IN RECAP

Cache is made up of sets that can fit many cache ways.

Cache ways are equal sections of the cache with size equal to the cache page size.

OVERVIEW OF RECAP ARCHITECTURE

RECAP ARCHITECTURE

Usage Monitoring. Cache Partitioning Control. Cache Reconfiguration. Separating Private and Shared Data. Data Cache Transitioning. Private Data Expansion.

USAGE MONITORING

Need to determine the optimal number of ways for each core to achieve its highest utilization.

Utility monitors used to track the accesses by each core to characterize each thread’s use of the cache.

Equal priority threads, each core gets number of ways based on performance benefits it can realize.

Differing priority threads: High-priority gets more number of ways.Low-priority gets less number of ways.

CACHE PARTITIONING CONTROL

Access Permission Register(APR) controls each core’s accesses to the LLC.

1 bit per core for R/W, Shared bit and Flush Bit.

Shared bit indicate a way shared by all cores. Flush bit used when contracting a core’s cache

allocation. Enforces cores to access only certain ways. Dynamic and Static energy savings.

C0 C1 …. FL S

CACHE RECONFIGURATION

Initially all cores share one cache way.

All core bits in the APR are set.

Full access to all threads.

If a core requires more cache ways, it enters into an Expansion mode for that thread.

If a core requires fewer ways than current, it enters a Contraction mode.

EXPANSION MODE

Expands the number of ways that a core can access.

Done by setting the appropriate bit in the APR for each way that the core requires.

If any of the ways were previously turned off, they are enabled once more.

CONTRACTION MODE

Reduces the number of ways that a core can access.

Need to flush dirty data back to main memory before resetting the core’s APR bit for the ways.

Contraction bit vector for each way tracks the sets to be flushed.

When flush bit in APR is set, all cores access the corresponding way to flush data back from the set they are consulting.

Sets the relevant bits in the Contraction vector. Speeds up the contraction process.

SEPARATING PRIVATE & SHARED DATA

Accesses to LLC pass through the directory. Directory used to determine requested datatype. Data in exclusive or modified state is private. Data in shared state is shared. Looking for private data, core only searches

private ways it has access to. For shared data, searches only shared ways. When there is no entry for a particular data item,

the core searches its ways in both private & shared regions.

DATA CATEGORY TRANSITIONING

Situation I: Data in shared region but a core wants to write to it.

Invalidate that data in LLC. Core then evicts that data from its L1 cache. Data written into the private region of LLC. On another core request to read the same data,

data is written back into LLC’s shared side. No reads/writes into main memory.


Situation II: Data in private region but another core wants to read it.

Data must be placed in shared region. Modified state data:

Invalidate data in LLC, obtain updated copy from the L1 that owns it and place it in shared region.

Exclusive state data: First send the data back to the requesting core, then

invalidate the existing line and refetch from main memory writing into the shared region.

Overheads due to involvement of DRAM.

PRIVATE DATA EXPANSION

After partitioning decision, private data portions for each core are set.

Each partition expanded by a single way after one interval execution.

Expand again if larger partition size achieves 10% more performance for its core.

Repeat until no more ways left to expand or performance increase is less than 10%.

Allows overlapping partitions without impacting performance.

EXPERIMENTAL METHODOLOGY

Simulator: Implemented partitioned cache architecture in gem5. Simulated x86-based in-order processor with 8 cores. For energy information, Cacti used at 45nm.

Workloads: Ran Parsec benchmarks for multi-threaded

applications . Random mix of benchmarks from SPEC CPU2006 for

multi-programmed workloads.

SPEC2006 COMBINATIONS & PARSEC WORKLOADS

SELECTION OF SPEC BENCHMARKS

Group benchmarks based on their Misses per kilo instructions(MPKI).

Value above 5 are thrashing applications. Random selection of 8 benchmarks, containing 1

through 7 thrashing applications. Example:

3T-1 refers first group containing 3 thrashing applications.

EXPERIMENTAL METHODOLOGY

Comparison Approaches:

Two comparison techniques: Thrasher caging (TC) – Policy for containing

thrashing workloads.

TA-DRRIP – State-of-the-art cache replacement policy targeting high performance.

EVALUATION

Evaluate the performance and energy consumption for the multi-threaded and multi-programmed workloads for 8 & 16-way cache.

Evaluate the effectiveness of the flush bit within APR by comparing three different approaches to contraction.

Evaluate the number of data blocks transitioning from private to shared and from shared to private.

PERFORMANCE & ENERGY

PERFORMANCE & ENERGY RESULTS

Recap achieves 15% better performance for the SPEC application mixes using an 8-way cache compared to LRU scheme.

83% of dynamic energy and 87% of static energy consumption of RECAP for SPEC application mixes using an 8-way cache.

For Parsec workloads, average consumption of 83% dynamic and 59% static energy consumption.

CYCLES REQUIRED FOR CONTRACTION

CYCLES REQUIRED FOR CONTRACTION

Results: (SPEC application mixes) Only 10m cycles required for All scheme. 65m cycles for Contracting only scheme & 59m for

Contracting and Cohabiting scheme. 85% faster than the Contracting only scheme.

Results: (Parsec Workloads) 13m, 77m and 69m cycles required for the schemes

mentioned above respectively.


Average 8m transitions from private to shared.

Small compared to overall LLC accesses.

Additional power overheads due to refetching data subsumed by large energy savings.

CONCLUSION

PROS: Achieves 17% & 13% dynamic and static energy

savings with no slowdown in performance across 21 mixes of SPECCPU2006 applications on an 8-core with 8-way LLC.

LLC partitioning possible even if way-to-core ratio is low.

CONS: Require extra bits for APR and Contraction bit

vectors. RECAP cache consumes more power than standard

LLC. Extra latency for cache accesses due to directory

lookup.

Documents

RECAP: R -A C PARTITIONINGmeseec.ce.rit.edu/722-projects/spring2015/2-2.pdf · Recap achieves 15% better performance for the SPEC application mixes using an 8-way cache compared to