29
RECAP: REGION-AWARE CACHE PARTITIONING -Raj Muchhala --Susmit Joshi

RECAP: R -A C PARTITIONINGmeseec.ce.rit.edu/722-projects/spring2015/2-2.pdf · Recap achieves 15% better performance for the SPEC application mixes using an 8-way cache compared to

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: RECAP: R -A C PARTITIONINGmeseec.ce.rit.edu/722-projects/spring2015/2-2.pdf · Recap achieves 15% better performance for the SPEC application mixes using an 8-way cache compared to

RECAP: REGION-AWARE CACHEPARTITIONING

-Raj Muchhala--Susmit Joshi

Page 2: RECAP: R -A C PARTITIONINGmeseec.ce.rit.edu/722-projects/spring2015/2-2.pdf · Recap achieves 15% better performance for the SPEC application mixes using an 8-way cache compared to

OVERVIEW

Motivation Introduction Recap Architecture Experimental Methodology Evaluation Conclusion

Page 3: RECAP: R -A C PARTITIONINGmeseec.ce.rit.edu/722-projects/spring2015/2-2.pdf · Recap achieves 15% better performance for the SPEC application mixes using an 8-way cache compared to

MOTIVATION

High performance computing systems obtained many cores and share a Last level cache(LLC).

Many partitioning schemes in last ten years have had the potential to increase performance for the LLC partitioning.

Partition based on requirements of each core. Idea was to give each core a share of the cache

resources. But did not consider LLC energy saving. Required the number of cache ways to be more

than the number of sharing cores to be effective.

Page 4: RECAP: R -A C PARTITIONINGmeseec.ce.rit.edu/722-projects/spring2015/2-2.pdf · Recap achieves 15% better performance for the SPEC application mixes using an 8-way cache compared to

MOTIVATION(CONTD.) Majority(80%) of

accesses request shared data.

But 93% data in the cache is actually private data.

Page 5: RECAP: R -A C PARTITIONINGmeseec.ce.rit.edu/722-projects/spring2015/2-2.pdf · Recap achieves 15% better performance for the SPEC application mixes using an 8-way cache compared to

INTRODUCTION ( RECAP) Identify and separate private and shared data. Private data- Left hand side of the cache. Shared data- Right hand side of the cache. For private requests:

Only ways on left of the cache gets activated. For shared requests:

Only ways on right of the cache gets activated. Significant dynamic energy savings. Supports core to cache way ratio of 1:1.

Page 6: RECAP: R -A C PARTITIONINGmeseec.ce.rit.edu/722-projects/spring2015/2-2.pdf · Recap achieves 15% better performance for the SPEC application mixes using an 8-way cache compared to

EXAMPLE DATA LAYOUT IN RECAP

Cache is made up of sets that can fit many cache ways.

Cache ways are equal sections of the cache with size equal to the cache page size.

Page 7: RECAP: R -A C PARTITIONINGmeseec.ce.rit.edu/722-projects/spring2015/2-2.pdf · Recap achieves 15% better performance for the SPEC application mixes using an 8-way cache compared to

OVERVIEW OF RECAP ARCHITECTURE

Page 8: RECAP: R -A C PARTITIONINGmeseec.ce.rit.edu/722-projects/spring2015/2-2.pdf · Recap achieves 15% better performance for the SPEC application mixes using an 8-way cache compared to

RECAP ARCHITECTURE

Usage Monitoring. Cache Partitioning Control. Cache Reconfiguration. Separating Private and Shared Data. Data Cache Transitioning. Private Data Expansion.

Page 9: RECAP: R -A C PARTITIONINGmeseec.ce.rit.edu/722-projects/spring2015/2-2.pdf · Recap achieves 15% better performance for the SPEC application mixes using an 8-way cache compared to

USAGE MONITORING

Need to determine the optimal number of ways for each core to achieve its highest utilization.

Utility monitors used to track the accesses by each core to characterize each thread’s use of the cache.

Equal priority threads, each core gets number of ways based on performance benefits it can realize.

Differing priority threads: High-priority gets more number of ways.Low-priority gets less number of ways.

Page 10: RECAP: R -A C PARTITIONINGmeseec.ce.rit.edu/722-projects/spring2015/2-2.pdf · Recap achieves 15% better performance for the SPEC application mixes using an 8-way cache compared to

CACHE PARTITIONING CONTROL

Access Permission Register(APR) controls each core’s accesses to the LLC.

1 bit per core for R/W, Shared bit and Flush Bit.

Shared bit indicate a way shared by all cores. Flush bit used when contracting a core’s cache

allocation. Enforces cores to access only certain ways. Dynamic and Static energy savings.

C0 C1 …. FL S

Page 11: RECAP: R -A C PARTITIONINGmeseec.ce.rit.edu/722-projects/spring2015/2-2.pdf · Recap achieves 15% better performance for the SPEC application mixes using an 8-way cache compared to

CACHE RECONFIGURATION

Initially all cores share one cache way.

All core bits in the APR are set.

Full access to all threads.

If a core requires more cache ways, it enters into an Expansion mode for that thread.

If a core requires fewer ways than current, it enters a Contraction mode.

Page 12: RECAP: R -A C PARTITIONINGmeseec.ce.rit.edu/722-projects/spring2015/2-2.pdf · Recap achieves 15% better performance for the SPEC application mixes using an 8-way cache compared to

EXPANSION MODE

Expands the number of ways that a core can access.

Done by setting the appropriate bit in the APR for each way that the core requires.

If any of the ways were previously turned off, they are enabled once more.

Page 13: RECAP: R -A C PARTITIONINGmeseec.ce.rit.edu/722-projects/spring2015/2-2.pdf · Recap achieves 15% better performance for the SPEC application mixes using an 8-way cache compared to

CONTRACTION MODE

Reduces the number of ways that a core can access.

Need to flush dirty data back to main memory before resetting the core’s APR bit for the ways.

Contraction bit vector for each way tracks the sets to be flushed.

When flush bit in APR is set, all cores access the corresponding way to flush data back from the set they are consulting.

Sets the relevant bits in the Contraction vector. Speeds up the contraction process.

Page 14: RECAP: R -A C PARTITIONINGmeseec.ce.rit.edu/722-projects/spring2015/2-2.pdf · Recap achieves 15% better performance for the SPEC application mixes using an 8-way cache compared to

SEPARATING PRIVATE & SHARED DATA

Accesses to LLC pass through the directory. Directory used to determine requested datatype. Data in exclusive or modified state is private. Data in shared state is shared. Looking for private data, core only searches

private ways it has access to. For shared data, searches only shared ways. When there is no entry for a particular data item,

the core searches its ways in both private & shared regions.

Page 15: RECAP: R -A C PARTITIONINGmeseec.ce.rit.edu/722-projects/spring2015/2-2.pdf · Recap achieves 15% better performance for the SPEC application mixes using an 8-way cache compared to

DATA CATEGORY TRANSITIONING

Situation I: Data in shared region but a core wants to write to it.

Invalidate that data in LLC. Core then evicts that data from its L1 cache. Data written into the private region of LLC. On another core request to read the same data,

data is written back into LLC’s shared side. No reads/writes into main memory.

Page 16: RECAP: R -A C PARTITIONINGmeseec.ce.rit.edu/722-projects/spring2015/2-2.pdf · Recap achieves 15% better performance for the SPEC application mixes using an 8-way cache compared to

DATA CATEGORY TRANSITIONING

Situation II: Data in private region but another core wants to read it.

Data must be placed in shared region. Modified state data:

Invalidate data in LLC, obtain updated copy from the L1 that owns it and place it in shared region.

Exclusive state data: First send the data back to the requesting core, then

invalidate the existing line and refetch from main memory writing into the shared region.

Overheads due to involvement of DRAM.

Page 17: RECAP: R -A C PARTITIONINGmeseec.ce.rit.edu/722-projects/spring2015/2-2.pdf · Recap achieves 15% better performance for the SPEC application mixes using an 8-way cache compared to

PRIVATE DATA EXPANSION

After partitioning decision, private data portions for each core are set.

Each partition expanded by a single way after one interval execution.

Expand again if larger partition size achieves 10% more performance for its core.

Repeat until no more ways left to expand or performance increase is less than 10%.

Allows overlapping partitions without impacting performance.

Page 18: RECAP: R -A C PARTITIONINGmeseec.ce.rit.edu/722-projects/spring2015/2-2.pdf · Recap achieves 15% better performance for the SPEC application mixes using an 8-way cache compared to

EXPERIMENTAL METHODOLOGY

Simulator: Implemented partitioned cache architecture in gem5. Simulated x86-based in-order processor with 8 cores. For energy information, Cacti used at 45nm.

Workloads: Ran Parsec benchmarks for multi-threaded

applications . Random mix of benchmarks from SPEC CPU2006 for

multi-programmed workloads.

Page 19: RECAP: R -A C PARTITIONINGmeseec.ce.rit.edu/722-projects/spring2015/2-2.pdf · Recap achieves 15% better performance for the SPEC application mixes using an 8-way cache compared to

SPEC2006 COMBINATIONS & PARSEC WORKLOADS

Page 20: RECAP: R -A C PARTITIONINGmeseec.ce.rit.edu/722-projects/spring2015/2-2.pdf · Recap achieves 15% better performance for the SPEC application mixes using an 8-way cache compared to

SELECTION OF SPEC BENCHMARKS

Group benchmarks based on their Misses per kilo instructions(MPKI).

Value above 5 are thrashing applications. Random selection of 8 benchmarks, containing 1

through 7 thrashing applications. Example:

3T-1 refers first group containing 3 thrashing applications.

Page 21: RECAP: R -A C PARTITIONINGmeseec.ce.rit.edu/722-projects/spring2015/2-2.pdf · Recap achieves 15% better performance for the SPEC application mixes using an 8-way cache compared to

EXPERIMENTAL METHODOLOGY

Comparison Approaches:

Two comparison techniques: Thrasher caging (TC) – Policy for containing

thrashing workloads.

TA-DRRIP – State-of-the-art cache replacement policy targeting high performance.

Page 22: RECAP: R -A C PARTITIONINGmeseec.ce.rit.edu/722-projects/spring2015/2-2.pdf · Recap achieves 15% better performance for the SPEC application mixes using an 8-way cache compared to

EVALUATION

Evaluate the performance and energy consumption for the multi-threaded and multi-programmed workloads for 8 & 16-way cache.

Evaluate the effectiveness of the flush bit within APR by comparing three different approaches to contraction.

Evaluate the number of data blocks transitioning from private to shared and from shared to private.

Page 23: RECAP: R -A C PARTITIONINGmeseec.ce.rit.edu/722-projects/spring2015/2-2.pdf · Recap achieves 15% better performance for the SPEC application mixes using an 8-way cache compared to
Page 24: RECAP: R -A C PARTITIONINGmeseec.ce.rit.edu/722-projects/spring2015/2-2.pdf · Recap achieves 15% better performance for the SPEC application mixes using an 8-way cache compared to

PERFORMANCE & ENERGY

Page 25: RECAP: R -A C PARTITIONINGmeseec.ce.rit.edu/722-projects/spring2015/2-2.pdf · Recap achieves 15% better performance for the SPEC application mixes using an 8-way cache compared to

PERFORMANCE & ENERGY RESULTS

Recap achieves 15% better performance for the SPEC application mixes using an 8-way cache compared to LRU scheme.

83% of dynamic energy and 87% of static energy consumption of RECAP for SPEC application mixes using an 8-way cache.

For Parsec workloads, average consumption of 83% dynamic and 59% static energy consumption.

Page 26: RECAP: R -A C PARTITIONINGmeseec.ce.rit.edu/722-projects/spring2015/2-2.pdf · Recap achieves 15% better performance for the SPEC application mixes using an 8-way cache compared to

CYCLES REQUIRED FOR CONTRACTION

Page 27: RECAP: R -A C PARTITIONINGmeseec.ce.rit.edu/722-projects/spring2015/2-2.pdf · Recap achieves 15% better performance for the SPEC application mixes using an 8-way cache compared to

CYCLES REQUIRED FOR CONTRACTION

Results: (SPEC application mixes) Only 10m cycles required for All scheme. 65m cycles for Contracting only scheme & 59m for

Contracting and Cohabiting scheme. 85% faster than the Contracting only scheme.

Results: (Parsec Workloads) 13m, 77m and 69m cycles required for the schemes

mentioned above respectively.

Page 28: RECAP: R -A C PARTITIONINGmeseec.ce.rit.edu/722-projects/spring2015/2-2.pdf · Recap achieves 15% better performance for the SPEC application mixes using an 8-way cache compared to

DATA CATEGORY TRANSITIONING

Average 8m transitions from private to shared.

Small compared to overall LLC accesses.

Additional power overheads due to refetching data subsumed by large energy savings.

Page 29: RECAP: R -A C PARTITIONINGmeseec.ce.rit.edu/722-projects/spring2015/2-2.pdf · Recap achieves 15% better performance for the SPEC application mixes using an 8-way cache compared to

CONCLUSION

PROS: Achieves 17% & 13% dynamic and static energy

savings with no slowdown in performance across 21 mixes of SPECCPU2006 applications on an 8-core with 8-way LLC.

LLC partitioning possible even if way-to-core ratio is low.

CONS: Require extra bits for APR and Contraction bit

vectors. RECAP cache consumes more power than standard

LLC. Extra latency for cache accesses due to directory

lookup.