Cooperative Caching for Chip Multiprocessors

Cooperative Caching forChip Multiprocessors

Jichuan Chang†, Enric Herrero‡, Ramon Canal‡ and Gurindar S. Sohi*

HP Labs†

Universitat Politècnica de Catalunya‡

University of Wisconsin-Madison*

M. S. Obaidat and S. Misra (Editors)Chapter 13, Cooperative Networking (Wiley)

2

Outline

1. Motivation

2. Cooperative Caching for CMPs

3. Applications of CMP Cooperative Caching1) Latency Reduction

2) Adaptive Repartitioning

3) Performance Isolation

4. Conclusions

3

Motivation - Background

Mem.

• Chip multiprocessors (CMPs) both require and enable innovative on-chip cache designs

• Critical for CMPs– Processor/memory gap– Limited pin-bandwidth

• Current designs– Shared cache

• sharing can lead to contention

– Private caches• isolation can waste resources

P P

P P

$

Slow

Narrow

4-core CMP

P P

P P

P P

P P

Capacitycontention

Wastedcapacity

4

Motivation – Challenges

• Key challenges– Growing on-chip wire delay– Expensive off-chip accesses– Destructive inter-thread interference– Diverse workload characteristics

• Three important demands for CMP caching– Capacity: reduce off-chip accesses

– Latency: reduce remote on-chip references

– Isolation: reduce inter-thread interference

• Need to combine the strength of both private and shared cache designs

5

Outline

1. Motivation





4. Conclusions

6

CMP Cooperative Caching

• Form an aggregate global cache via cooperative private caches

– Use private caches to attract data for fast reuse– Share capacity through cooperative policies– Throttle cooperation to find an optimal sharing point

• Inspired by cooperative file/web caches– Similar latency tradeoff– Similar algorithms

P

L2

L1I L1DP

L2

L1I L1D

P

L2

L1I L1DP

L2

L1I L1D

Network

7

CCE

L1L1 L1

P PL2 L2

Main Memory

Bus

Interconnect

• “Private” L2 caches to reduce access latency.

• Centralized directory with duplicated tags grants coherence on-chip.

• Spilling: Evicted blocks are forwarded to other caches for a more efficient use of cache space. (N-Chance forwarding mechanism)

CMP Cooperative Caching

8

Distributed Cooperative Caching

• Objective: Keep the benefits of Cooperative caching while improving scalability and energy consumption.

• Distributed directory with different tag allocation mechanism.

Main Memory

Bus

Interconnect

L1L1 L1L1

P PL2 L2

DCE DCE

9

Tag Structure Comparison

10

Outline

1. Motivation





4. Conclusions

11

3. Applications of CMP Cooperative Caching

Several techniques have appeared that take advantage of Cooperative Caching for CMPs.

1.For Latency Reduction– Cooperation Throtling

2.For Adaptive Repartitioning– Elastic Cooperative Caching

3.For Performance Isolation– Cooperative Cache Partitioning

12

Outline

1. Motivation





4. Conclusions

13

Policies to Reduce Off-chip Accesses

• Cooperation policies for capacity sharing– (1) Cache-to-cache transfers of clean data

– (2) Replication-aware replacement

– (3) Global replacement of inactive data

• Implemented by two unified techniques– Policies enforced by cache replacement/placement

– Information/data exchange supported by modifying the coherence protocol

14

Policy (1) - Make use of all on-chip data

• Don’t go off-chip if on-chip (clean) data exist

• Beneficial and practical for CMPs– Peer cache is much closer than next-level storage

– Affordable implementations of “clean ownership”

• Important for all workloads– Multi-threaded: (mostly) read-only shared data

– Single-threaded: spill into peer caches for later reuse

15

Policy (2) – Control replication

• Intuition – increase # of unique on-chip data

• Latency/capacity tradeoff– Evict singlets only when no replications exist

– Modify the default cache replacement policy

• “Spill” an evicted singlet into a peer cache– Can further reduce on-chip replication

– Randomly choose a recipient cache for spilling

“singlets”

16

Policy (3) - Global cache management

• Approximate global-LRU replacement– Combine global spill/reuse history with local LRU

• Identify and replace globally inactive data– First become the LRU entry in the local cache– Set as MRU if spilled into a peer cache– Later become LRU entry again: evict globally

• 1-chance forwarding (1-Fwd)– Blocks can only be spilled once if not reused

17

Cooperation Throttling

• Why throttling?– Further tradeoff between capacity/latency

• Two probabilities to help make decisions– Cooperation probability: control replication– Spill probability: throttle spilling

Shared Private

Cooperative CachingCC 100%

Policy (1)

CC 0%

18

Performance Evaluation

SPECOMPSingle-

threaded

60%

80%

100%

120%

140%

160%

am

mp

ap

plu

ap

si

art

eq

ua

ke

fma

3d

mg

rid

sw

im

wu

pw

ise

gc

c

mc

f

pe

rlbm

k

vp

r

Hm

ea

n

Private Shared CC VR

60%

80%

100%

120%

OL

TP

Ap

ac

he

JB

B

Ze

us

Mix

1

Mix

2

Mix

3

Mix

4

Ra

te1

Ra

te2

Hm

ea

n

Private Shared CC VR

19

Outline

1. Motivation





4. Conclusions

20

CC for Adaptive Repartitioning

• Tradeoff between minimizing off-chip misses and avoiding inter-thread interference.

• Elastic Cooperative Caching adapts caches according to application requirements.– High cache requirements -> Big local private cache– Low data reuse -> Cache space reassigned to

increase the size of the global shared cache for spilled blocks.

21

Local Shared/Private

cache with independent

repartitioning unit.

Only local core can allocate

Distributed Coherence

Engines grant coherence

Allocates evicted blocks from all private regions

Distributes evicted blocks from

private partition among nodes.

Every N cycles repartitions cache based on LRU hits in S&P partitions.

Elastic Cooperative Caching structure

22

Request Hit

Repartitioning Unit Working Example

Independent Partitioning, distributed structure.

Request Hit

Repartition Cycle

34

Counter > HTPrivate ++

Counter < LTShared ++

23

Spilled Allocator Working Example

Only data from the private region can be spilled

No need of perfectly updated information, out of critical path.

Broadcast

Private Ways

Shared Ways

24

Performance and energy-efficiency evaluation

+12%OverASR

+24%OverASR

25

Outline

1. Motivation





4. Conclusions

26

• Throughput-fairness dilemma– Cooperation: Taking turns to speed up– Multiple time-sharing partitions (MTP)

• QoS guarantee– Cooperatively shrink/expand across MTPs– Bound average slowdown over the long term

P1 P2 P3 P4

Time

P1 P2 P3 P4

IPC = 0.52 WS=2.42QoS = -0.52 FS = 1.22

IPC = 0.52 WS = 2.42QoS = 0 FS = 1.97

==<<

Fairness improvement and QoS guarantee reflected by higher FS and bounded QoS values

Time

Time-share Based Partitioning

2727

1.00

1.20

1.40

1.60

1.80

2.00

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

MTP/noQoS

MTP/long-termQoS

SSP/noQoS

MTP/real-timeQoS

SSP/real-timeQoS

MTP Benefits

• Better than single spatial partition (SSP)– MTP/long-termQoS almost the same as MTP/noQoS

Offline analysis based on profile info, 210 workloads (job mixes)

Percentage of workloads achieving various FS values

Fair

Spe

edup

28

Better than MTP

• MTP issues– Not needed if LRU performs better

(LRU often near-optimal [Stone et al. IEEE TOC ’92])– Partitioning is more complex than SSP

• Cooperative Cache Partitioning (CCP)– Integration with Cooperative Caching (CC)– Exploit CC’s latency and LRU-based sharing benefits– Simplify the partitioning algorithm– Total execution time = Epochs(CC) + Epochs(MTP)

• Weighted by # of threads benefiting from CC vs. MTP

29

0

0.5

1

1.5

2

2.5

3

3.5

1 2 3 4 5 6 7 8 9 10 11 12 13

Partitioning Heuristic

• When is MTP better than CC– QoS: ∑speedup > ∑slowdown (over N partitions) – Speedup should be large

CC already good at fine-grained tuning

BaselineC_shrink C_expand

Thr

ough

put (

norm

aliz

ed)

Allocated cache ways (16-way total, 4-core)

Speedup

Slowdownthrashing _test

Speedup > (N-1) x Slowdown

thrashing _testSpeedup > (N-1) x Slowdown

30

Partitioning Algorithm

1. S = All threads - supplier threads (e.g., gcc, swim)• Allocate them with gPar (guaranteed partition, or min.

capacity needed for QoS) [Yeh/Reinman CASES ’05]

• For threads in S, init their C_expand and C_shrink

2. Do thrashing_test iteratively for each thread in S• If thread t fails, allocate t with gPar, remove t from S

• Update C_expand and C_shrink for other threads in S

3. Repeat until S is empty or all threads in S pass the test

3131

Fair Speedup Results

• Two groups of workloads– PAR: MTP better than CC (partitioning helps)– LRU: CC better than MTP (partitioning hurts)

PAR (67 out of 210 workloads) LRU (143 out of 210 workloads)Percentage of workloads Percentage of workloads

32

Performance and QoS evaluation

33

Outline

1. Motivation





4. Conclusions

34

4. Conclusions

• On-chip cache hierarchy plays an important role in CMPs and must provide fast and fair data accesses for multiple, competing processor cores.

• Cooperative Caching for CMPs is an effective framework to manage caches in such environment.

• Cooperative sharing mechanisms and the philosophy of using cooperation for conflict resolution can be applied to many other resource management problems.

Documents

Cooperative Caching for Chip Multiprocessors