View
213
Download
0
Category
Tags:
Preview:
Citation preview
Better than the Two: Exceeding Private and Shared Caches
via Two-Dimensional Page Coloring
Lei Jin and Sangyeun Cho
Dept. of Computer ScienceUniversity of Pittsburgh
CMPMSI’07 02/11/07
Multicore distributed L2 caches
L2 caches typically sub-banked and distributed• IBM Power4/5: 3 banks• Sun Microsystems T1: 4 banks• Intel Itanium2 (L3): many “sub-arrays”
(Distributed L2 caches + switched NoC) NUCA
Hardware-based management schemes• Private caching• Shared caching• Hybrid caching
Local L2 Cache
ProcessorCore
Router
CMPMSI’07 02/11/07
Private and shared caching
Private caching:Private caching:
short hit latency (always local)short hit latency (always local)
high on-chip miss ratehigh on-chip miss rate
long miss resolution timelong miss resolution time
complex coherence enforcementcomplex coherence enforcement
Shared caching:
low on-chip miss rate
straightforward data location
simple coherence (no replication)
long average hit latency
CMPMSI’07 02/11/07
Other approaches
Hybrid/flexible schemes• “Core clustering” [Speight et al., ISCA2005]• “Flexible CMP cache sharing” [Huh et al., ICS2004]• “Flexible bank mapping” [Liu et al., HPCA2004]
Improving shared caching• “Victim replication” [Zhang and Asanovic, ISCA2005]
Improving private caching• “Cooperative caching” [Chang and Sohi, ISCA2006]• “CMP-NuRAPID” [Chishti et al., ISCA2005]
CMPMSI’07 02/11/07
Motivation
Miss rateMiss rate
Hit latencyHit latency
What is the optimal balance between miss rate and hit latency?
CMPMSI’07 02/11/07
Talk roadmap
Data mapping, a key property [cho and Jin, Micro2006]
Two-dimensional (2D) page coloring algorithm
Evaluation and results
Conclusion and future works
CMPMSI’07 02/11/07
Data mapping
Data mapping• Memory data location in L2 cache
Private caching• Data mapping determined by program location• Mapping created at miss time• No explicit control
Shared caching• Data mapping determined by address
slice number = (block address) % (Nslice)
• Mapping is static• No explicit control
CMPMSI’07 02/11/07
Page
Change mapping granularity
slice number = (block address) % (N slice)
Block granularityPage granularity
Page
Page
Page
slice number = (page address) % (N slice)
CMPMSI’07 02/11/07
OS controlled page mapping
Memory pages Program 1
Program 2
OS PAGE ALLOCATIONOS PAGE ALLOCATION
Virtual address spacePhysical address space
CMPMSI’07 02/11/07
2D page coloring: the problem
PagePagePage500 30
500 3
500 10
500 7
500 12
access miss
PagePage
Network latency / hop = 3 cycles
Memory latency = 300 cycles
Cost(color #) = (# access x # hop x 3 cycles) + (# miss x 300 cycles)
cost
9000
6900
9000
8100
9600
P
CMPMSI’07 02/11/07
2D coloring algorithm
Collect L2 reference trace Derive conflict information [Sherwood et al., ICS1999]
Page A Page CPage B Page B
Reference 1 Reference 2 Reference 3 Reference 4
CMPMSI’07 02/11/07
2D coloring algorithm (cont’d)
Derive conflict information
Page A
Reference 1
Reference Matrix
A B C
A 0 0 0B 0 0 0C 0 0 0
Conflict Matrix
A B C
A 0 0 0B 0 0 0C 0 0 0
11
CMPMSI’07 02/11/07
2D coloring algorithm (cont’d)
Derive conflict information
Page A
Reference 1
Reference Matrix
A B C
A 0 0 0B 1 0 0C 1 0 0
Conflict Matrix
A B C
A 0 0 0B 0 0 0C 0 0 0
CMPMSI’07 02/11/07
2D coloring algorithm (cont’d)
Derive conflict information
Page A
Reference 1
Page B
Reference 2
Reference Matrix
A B C
A 0 0 0B 1 0 0C 1 0 0
Conflict Matrix
A B C
A 0 0 0B 0 0 0C 0 0 0
1
1
CMPMSI’07 02/11/07
2D coloring algorithm (cont’d)
Derive conflict information
Page A
Reference 1
Page B
Reference 2
Reference Matrix
A B C
A 0 1 0B 1 0 0C 1 1 0
Conflict Matrix
A B C
A 0 0 0B 0 0 0C 0 0 0
1
+1
0
CMPMSI’07 02/11/07
2D coloring algorithm (cont’d)
Derive conflict information
Page A
Reference 1
Page B
Reference 2
Page B
Reference 3
Reference Matrix
A B C
A 0 1 0B 0 0 0C 1 1 0
Conflict Matrix
A B C
A 0 0 0B 1 0 0C 0 0 0
CMPMSI’07 02/11/07
2D coloring algorithm (cont’d)
Derive conflict information
Page A
Reference 1
Page B
Reference 2
Page B
Reference 3
Page C
Reference 4
Reference Matrix
A B C
A 0 1 0B 0 0 0C 1 1 0
Conflict Matrix
A B C
A 0 0 0B 1 0 0C 0 0 0
11
CMPMSI’07 02/11/07
2D coloring algorithm (cont’d)
Derive conflict information
Page A
Reference 1
Page B
Reference 2
Page B
Reference 3
Page C
Reference 4
Reference Matrix
A B C
A 0 1 1B 0 0 1C 1 1 0
Conflict Matrix
A B C
A 0 0 0B 1 0 0C 0 0 0
+1 +1
1 10 0
CMPMSI’07 02/11/07
2D coloring algorithm (cont’d)
2D Page coloring
Page A
Reference 1
Page B
Reference 2
Page B
Reference 3
Page C
Reference 4
Reference Matrix
A B C
A 0 1 1B 0 0 1C 0 0 0
Conflict Matrix
A B C
A 0 0 0B 1 0 0C 1 1 0
Conflict Matrix
A B C
A 0 0 0B 1 0 0C 1 1 0
Access Counter
A B C
1 2 1
CMPMSI’07 02/11/07
2D coloring algorithm (cont’d)
2D Page coloring
Conflict Matrix
A B C
A 0 0 0B 1 0 0C 1 1 0
Access Counter
A B C
1 2 1
#Conflict(color) #Access
Cost(color, page#) = ( x mem latency) +
x #hop(color) x hop delay)
Optimal color(page#) = {C | Cost(C) = MIN[Cost(color, page#)] for all colors}
α x
(1-α) x
CMPMSI’07 02/11/07
Experiments setup
Experiments were carried out using simulator derived from SimpleScalar toolset.
The simulator models a 16-core tile-based CMP.
Each core has private 32KB I/D L1, global shared 256KB L2 slice (total 4MB).
Profiling 2D coloringTiming
Simulation
TracePage
mapping
Tuning α
CMPMSI’07 02/11/07
Optimal page mapping
0
50
100
150
200
250
300
350
400
450
gcc
α = 1/64
# o
f pages
x y
0
100
200
300
400
500
600
700
800
# o
f pages
xy
α = 1/256
CMPMSI’07 02/11/07
Access distribution
0%
20%
40%
60%
80%
100%
misshit
0%
20%
40%
60%
80%
100%
remotelocal
α 1/32 – 1/2048
CMPMSI’07 02/11/07
Conclusions
With cautious data placement, there is huge room for performance improvement.
Dynamic mapping schemes with information assisted by hardware are possible to achieve similar perform-ance improvement.
This method can also be applied to other optimization target.
CMPMSI’07 02/11/07
Current and future works
Dynamic mapping schemes• Performance• Power
Multiprogrammed and parallel workloads
CMPMSI’07 02/11/07
Private caching
1. L1 miss2. L2 access
• Hit• Miss
3. Access directory• A copy on chip• Global miss
L1 miss
Local L2 access short hit latency (always local)short hit latency (always local)
high on-chip miss ratehigh on-chip miss rate
long miss resolution timelong miss resolution time
complex coherence enforcementcomplex coherence enforcement
CMPMSI’07 02/11/07
Shared caching
1. L1 miss
2. L2 access• Hit• Miss
L1 miss
low on-chip miss rate
straightforward data location
simple coherence (no replication)
long average hit latency
Recommended