Upload
bardia
View
42
Download
0
Embed Size (px)
DESCRIPTION
Cooperative Caching for Chip Multiprocessors. Jichuan Chang † , Enric Herrero ‡ , Ramon Canal ‡ and Gurindar S. Sohi * HP Labs † Universitat Politècnica de Catalunya ‡ University of Wisconsin-Madison * M. S. Obaidat and S. Misra (Editors) Chapter 13, Cooperative Networking (Wiley). Outline. - PowerPoint PPT Presentation
Citation preview
Cooperative Caching forChip Multiprocessors
Jichuan Chang†, Enric Herrero‡, Ramon Canal‡ and Gurindar S. Sohi*
HP Labs†
Universitat Politècnica de Catalunya‡
University of Wisconsin-Madison*
M. S. Obaidat and S. Misra (Editors)Chapter 13, Cooperative Networking (Wiley)
2
Outline
1. Motivation
2. Cooperative Caching for CMPs
3. Applications of CMP Cooperative Caching1) Latency Reduction
2) Adaptive Repartitioning
3) Performance Isolation
4. Conclusions
3
Motivation - Background
Mem.
• Chip multiprocessors (CMPs) both require and enable innovative on-chip cache designs
• Critical for CMPs– Processor/memory gap– Limited pin-bandwidth
• Current designs– Shared cache
• sharing can lead to contention
– Private caches• isolation can waste resources
P P
P P
$
Slow
Narrow
4-core CMP
P P
P P
P P
P P
Capacitycontention
Wastedcapacity
4
Motivation – Challenges
• Key challenges– Growing on-chip wire delay– Expensive off-chip accesses– Destructive inter-thread interference– Diverse workload characteristics
• Three important demands for CMP caching– Capacity: reduce off-chip accesses
– Latency: reduce remote on-chip references
– Isolation: reduce inter-thread interference
• Need to combine the strength of both private and shared cache designs
5
Outline
1. Motivation
2. Cooperative Caching for CMPs
3. Applications of CMP Cooperative Caching1) Latency Reduction
2) Adaptive Repartitioning
3) Performance Isolation
4. Conclusions
6
CMP Cooperative Caching
• Form an aggregate global cache via cooperative private caches
– Use private caches to attract data for fast reuse– Share capacity through cooperative policies– Throttle cooperation to find an optimal sharing point
• Inspired by cooperative file/web caches– Similar latency tradeoff– Similar algorithms
P
L2
L1I L1DP
L2
L1I L1D
P
L2
L1I L1DP
L2
L1I L1D
Network
7
CCE
L1L1 L1
P PL2 L2
Main Memory
Bus
Interconnect
• “Private” L2 caches to reduce access latency.
• Centralized directory with duplicated tags grants coherence on-chip.
• Spilling: Evicted blocks are forwarded to other caches for a more efficient use of cache space. (N-Chance forwarding mechanism)
CMP Cooperative Caching
8
Distributed Cooperative Caching
• Objective: Keep the benefits of Cooperative caching while improving scalability and energy consumption.
• Distributed directory with different tag allocation mechanism.
Main Memory
Bus
Interconnect
L1L1 L1L1
P PL2 L2
DCE DCE
9
Tag Structure Comparison
10
Outline
1. Motivation
2. Cooperative Caching for CMPs
3. Applications of CMP Cooperative Caching1) Latency Reduction
2) Adaptive Repartitioning
3) Performance Isolation
4. Conclusions
11
3. Applications of CMP Cooperative Caching
Several techniques have appeared that take advantage of Cooperative Caching for CMPs.
1.For Latency Reduction– Cooperation Throtling
2.For Adaptive Repartitioning– Elastic Cooperative Caching
3.For Performance Isolation– Cooperative Cache Partitioning
12
Outline
1. Motivation
2. Cooperative Caching for CMPs
3. Applications of CMP Cooperative Caching1) Latency Reduction
2) Adaptive Repartitioning
3) Performance Isolation
4. Conclusions
13
Policies to Reduce Off-chip Accesses
• Cooperation policies for capacity sharing– (1) Cache-to-cache transfers of clean data
– (2) Replication-aware replacement
– (3) Global replacement of inactive data
• Implemented by two unified techniques– Policies enforced by cache replacement/placement
– Information/data exchange supported by modifying the coherence protocol
14
Policy (1) - Make use of all on-chip data
• Don’t go off-chip if on-chip (clean) data exist
• Beneficial and practical for CMPs– Peer cache is much closer than next-level storage
– Affordable implementations of “clean ownership”
• Important for all workloads– Multi-threaded: (mostly) read-only shared data
– Single-threaded: spill into peer caches for later reuse
15
Policy (2) – Control replication
• Intuition – increase # of unique on-chip data
• Latency/capacity tradeoff– Evict singlets only when no replications exist
– Modify the default cache replacement policy
• “Spill” an evicted singlet into a peer cache– Can further reduce on-chip replication
– Randomly choose a recipient cache for spilling
“singlets”
16
Policy (3) - Global cache management
• Approximate global-LRU replacement– Combine global spill/reuse history with local LRU
• Identify and replace globally inactive data– First become the LRU entry in the local cache– Set as MRU if spilled into a peer cache– Later become LRU entry again: evict globally
• 1-chance forwarding (1-Fwd)– Blocks can only be spilled once if not reused
17
Cooperation Throttling
• Why throttling?– Further tradeoff between capacity/latency
• Two probabilities to help make decisions– Cooperation probability: control replication– Spill probability: throttle spilling
Shared Private
Cooperative CachingCC 100%
Policy (1)
CC 0%
18
Performance Evaluation
SPECOMPSingle-
threaded
60%
80%
100%
120%
140%
160%
am
mp
ap
plu
ap
si
art
eq
ua
ke
fma
3d
mg
rid
sw
im
wu
pw
ise
gc
c
mc
f
pe
rlbm
k
vp
r
Hm
ea
n
Private Shared CC VR
60%
80%
100%
120%
OL
TP
Ap
ac
he
JB
B
Ze
us
Mix
1
Mix
2
Mix
3
Mix
4
Ra
te1
Ra
te2
Hm
ea
n
Private Shared CC VR
19
Outline
1. Motivation
2. Cooperative Caching for CMPs
3. Applications of CMP Cooperative Caching1) Latency Reduction
2) Adaptive Repartitioning
3) Performance Isolation
4. Conclusions
20
CC for Adaptive Repartitioning
• Tradeoff between minimizing off-chip misses and avoiding inter-thread interference.
• Elastic Cooperative Caching adapts caches according to application requirements.– High cache requirements -> Big local private cache– Low data reuse -> Cache space reassigned to
increase the size of the global shared cache for spilled blocks.
21
Local Shared/Private
cache with independent
repartitioning unit.
Only local core can allocate
Distributed Coherence
Engines grant coherence
Allocates evicted blocks from all private regions
Distributes evicted blocks from
private partition among nodes.
Every N cycles repartitions cache based on LRU hits in S&P partitions.
Elastic Cooperative Caching structure
22
Request Hit
Repartitioning Unit Working Example
Independent Partitioning, distributed structure.
Request Hit
Repartition Cycle
34
Counter > HTPrivate ++
Counter < LTShared ++
23
Spilled Allocator Working Example
Only data from the private region can be spilled
No need of perfectly updated information, out of critical path.
Broadcast
Private Ways
Shared Ways
24
Performance and energy-efficiency evaluation
+12%OverASR
+24%OverASR
25
Outline
1. Motivation
2. Cooperative Caching for CMPs
3. Applications of CMP Cooperative Caching1) Latency Reduction
2) Adaptive Repartitioning
3) Performance Isolation
4. Conclusions
26
• Throughput-fairness dilemma– Cooperation: Taking turns to speed up– Multiple time-sharing partitions (MTP)
• QoS guarantee– Cooperatively shrink/expand across MTPs– Bound average slowdown over the long term
P1 P2 P3 P4
Time
P1 P2 P3 P4
IPC = 0.52 WS=2.42QoS = -0.52 FS = 1.22
IPC = 0.52 WS = 2.42QoS = 0 FS = 1.97
==<<
Fairness improvement and QoS guarantee reflected by higher FS and bounded QoS values
Time
Time-share Based Partitioning
2727
1.00
1.20
1.40
1.60
1.80
2.00
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
MTP/noQoS
MTP/long-termQoS
SSP/noQoS
MTP/real-timeQoS
SSP/real-timeQoS
MTP Benefits
• Better than single spatial partition (SSP)– MTP/long-termQoS almost the same as MTP/noQoS
Offline analysis based on profile info, 210 workloads (job mixes)
Percentage of workloads achieving various FS values
Fair
Spe
edup
28
Better than MTP
• MTP issues– Not needed if LRU performs better
(LRU often near-optimal [Stone et al. IEEE TOC ’92])– Partitioning is more complex than SSP
• Cooperative Cache Partitioning (CCP)– Integration with Cooperative Caching (CC)– Exploit CC’s latency and LRU-based sharing benefits– Simplify the partitioning algorithm– Total execution time = Epochs(CC) + Epochs(MTP)
• Weighted by # of threads benefiting from CC vs. MTP
29
0
0.5
1
1.5
2
2.5
3
3.5
1 2 3 4 5 6 7 8 9 10 11 12 13
Partitioning Heuristic
• When is MTP better than CC– QoS: ∑speedup > ∑slowdown (over N partitions) – Speedup should be large
CC already good at fine-grained tuning
BaselineC_shrink C_expand
Thr
ough
put (
norm
aliz
ed)
Allocated cache ways (16-way total, 4-core)
Speedup
Slowdownthrashing _test
Speedup > (N-1) x Slowdown
thrashing _testSpeedup > (N-1) x Slowdown
30
Partitioning Algorithm
1. S = All threads - supplier threads (e.g., gcc, swim)• Allocate them with gPar (guaranteed partition, or min.
capacity needed for QoS) [Yeh/Reinman CASES ’05]
• For threads in S, init their C_expand and C_shrink
2. Do thrashing_test iteratively for each thread in S• If thread t fails, allocate t with gPar, remove t from S
• Update C_expand and C_shrink for other threads in S
3. Repeat until S is empty or all threads in S pass the test
3131
Fair Speedup Results
• Two groups of workloads– PAR: MTP better than CC (partitioning helps)– LRU: CC better than MTP (partitioning hurts)
PAR (67 out of 210 workloads) LRU (143 out of 210 workloads)Percentage of workloads Percentage of workloads
32
Performance and QoS evaluation
33
Outline
1. Motivation
2. Cooperative Caching for CMPs
3. Applications of CMP Cooperative Caching1) Latency Reduction
2) Adaptive Repartitioning
3) Performance Isolation
4. Conclusions
34
4. Conclusions
• On-chip cache hierarchy plays an important role in CMPs and must provide fast and fair data accesses for multiple, competing processor cores.
• Cooperative Caching for CMPs is an effective framework to manage caches in such environment.
• Cooperative sharing mechanisms and the philosophy of using cooperation for conflict resolution can be applied to many other resource management problems.