Upload
shepry
View
35
Download
0
Tags:
Embed Size (px)
DESCRIPTION
ACCESS: Smart Scheduling for Asymmetric Cache CMPs. Xiaowei Jiang†, Asit Mishra‡, Li Zhao†, Ravi Iyer†, Zhen Fang†, Sadagopan Srinivasan†, Srihari Makineni†, Paul Brett†, Chita Das‡ Intel Labs (Oregon) † Penn State‡. Agenda. Motivation Related Work ACCESS Architecture - PowerPoint PPT Presentation
Citation preview
11 HPCA2011
ACCESS: Smart Scheduling for Asymmetric Cache CMPs
Xiaowei Jiang†, Asit Mishra‡, Li Zhao†, Ravi Iyer†, Zhen Fang†, Sadagopan Srinivasan†, Srihari
Makineni†, Paul Brett†, Chita Das‡
Intel Labs (Oregon)† Penn State‡
22 HPCA2011
Agenda
• Motivation• Related Work• ACCESS Architecture• ACS Scheduler• Evaluation Results• Conclusions
33 HPCA2011
MotivationApplications tend to have non-uniform cache capacity requirement
00.5
11.5
22.5
CP
I nor
mai
zed
to 4
096
KB
4096 KB 2048 KB 1024 KB 512 KB
Energy inefficiency
Cache
Core0 Core1
Large Cache
Core0 Core1
Small Cache
Virtual Asymmetry Physical Asymmetry
44 HPCA2011
Benefit of Physically Asymmetric Caches• Fit the asymmetry in working set size of apps
– Apps have small WSS/streaming apps on small cache– Apps have large WSS on large cache
• Help improve energy per instruction– Large cache can be power gated when not in use– Smaller cache enables lower operating voltage
• Fit the need of heterogeneous-core architectures– Asymmetric cores naturally need asymmetric caches
512KB: 0.8v 4MB: 1.0v
55 HPCA2011
Challenges in Asymmetric Caches
• What are the H/W supports needed?– H/W exposes certain cache stats to OS
• What are the OS scheduler changes needed?– Scheduler be aware of the underlying cache asymmetry– New scheduling policy to explore cache asymmetry
66 HPCA2011
Contribution of ACCESS
• ACCESS Architecture– Enables asymmetric caches– ACCESS Prediction Engine
– Runtime online measurement of cache stats– Stats exposed to OS
• Asymmetric Cache Scheduler– Finds out best-performing schedule with one-time training– Deals with private caches and shared cache
– Estimate shared cache contention effects with simple heuristics– O(1) complexity
• Real machine measurement shows >20% performance improvement over default Linux scheduler
77 HPCA2011
Agenda
• Motivation• Related Work• ACCESS Architecture• ACS Scheduler• Evaluation Results• Conclusions
88 HPCA2011
Related Work
• OS schedulers for heterogeneous-core architectures– Li et al. HPCA’10– Kumar et al. Micro’03, ISCA’04
• OS scheduler or H/W approaches for mitigating cache contention effects– Chandra et al. HPCA’05– Kim et al. PACT’05
99 HPCA2011
Agenda
• Motivation• Related Work• ACCESS Architecture• ACS Scheduler• Evaluation Results• Conclusions
1010 HPCA2011
ACCESS Architecture
• Run tasks with one-time training• APE measures/predicts task cache stats on
big/small caches• OS makes schedule based on cache stats by APE
OSOS
Task1Task1 Task2Task2 Task3Task3 Task4Task4
L1
BigLLC
L1
Core
L1
Core
L1
OSOS
Task1Task1 Task2Task2 Task3Task3 Task4Task4
L1
Taskx
Small CachePerformance of
Taskx
L1
Tasky
Big CachePerformance of
Tasky
Core Core
SmallLLC
BigLLC
SmallLLCAPE
APE
Core Core
L1
BigLLC
L1
Core
L1
Core
L1
Core Core
SmallLLCAPE APE
1111 HPCA2011
Access Prediction Engine
• Shadow tags– Use set sampling to
reduce size and #accesses
4MB
App 1
Shadow Tags
Hit/Miss Controller
App 2
App 1 App 2
Set 0Set 1Set 2
Set 4095
Way 0Tag Array Way 1 Way 15
4MB
Set 0Set 1
Set 16
4MB LLC
0 1 15
0 1 15 0 1 15
App 1
0 1
App 2
0 1
512 KB
Shadow tag = cache w/o data arrayUsing multiple shadow tags, although App1&2 share the cache, we can still
measure the cache stats of App1&2 running alone on the 4MB(L) and 512KB(S) cache
Provides cache stats for each app on each cache (running alone)
1212 HPCA2011
Agenda
• Motivation• Related Work• ACCESS Architecture• ACS Scheduler• Evaluation Results• Conclusions
1313 HPCA2011
Asymmetric Cache Scheduler
• Goal of the scheduler: improve overall threads performance– Perform least training to minimize training overhead
• Thread’s stats available to the scheduler– Instruction count etc.– Cache misses of each thread running alone on each cache
• In practice, we find schedule that has minimal overall MPI yields best overall performance
_ _ _ _i isum Thread Thread
Threads on L Threads on S
MPI MPI MPI
1414 HPCA2011
ACS Examples
• Private caches, e.g. 2T case– calculate <MPIT1_L, MPIT1_S>, <MPIT2_L, MPIT2_S>– compute MPIsum of all possible schedules
– MPIsum1= MPIT1_L + MPIT2_S
– MPIsum2= MPIT1_S + MPIT2_L
– pick min(MPIsum1, MPIsum2)
• Shared caches, e.g. 4T case– calculate <MPITi_L, MPITi_S>– compute MPIsum of all possible schedules
– MPIsum= MPITiTj_L + MPITxTy_S
– MPITiTj_L and MPITxTy_S are estimated– pick MPIsum min
Large Cache
Core0 Core1
Small Cache
T1 T2
Large CacheSmall Cache
Core0 Core1 Core2 Core3
T1 T2 T3 T4
1515 HPCA2011
Estimating Cache Contention Effect• Task: given MPITi_L/S,MPITj_L/S, estimate MPITiTj_L/S
• Cache power law Hartstein et al.
SPEC 2006 Average
0.01
0.10
1.00
1 10 100 1000# of Cache Ways (8KB/way)
Mis
s R
ate
average
Pow er (average)
MRnew = MRold * (Cnew/Cold)-α
MPInew = MPIold * (Cnew/Cold)-α
α measures how sensitive the app is to cache capacityα = -logCL/CS
(MPITi_L/MPITi_S)
We can compute α of each thread
TiTjSLTiSLTiTj OccpMPIMPI /_/_
1616 HPCA2011
Estimating Cache Contention Effect (cont.)• Estimating cache occupancy for Ti when Ti,Tj share
cache
ji
i
irepljjjreplii
jrepliiTiTj
missmissmiss
probmissprobmissprobmissOccp
____
__
ji
iSLTiSLTiTj
missmissmissMPIMPI /_/_
1717 HPCA2011
Scheduler Compute Overhead
• Computing and sorting all possible schedules has O(n2) complexity
• To arrive at the best schedule, #thread migrations might be unbounded
1.E+03
1.E+04
1.E+05
1.E+06
1.E+07
1.E+08
1.E+09
2 4 6 8 10 12 14 16 18 20 22
Sch
edul
er C
ompu
tatio
n O
verh
ead
Number of threads
Naive
1818 HPCA2011
O(1) ACS
• Goal of O(1) ACS– O(1) complexity– Limited number of thread migrations to arrive a best schedule
• O(1) ACS algorithm– For each thread (Ti) arrival, comparing MPIsum of 6 cases
– 1. Ti on L– 2. Ti on L, migration candidate on L -> S– 3. Ti on L, migration candidate on L <-> migration candidate on S– 4. Ti on S– 5. Ti on S, migration candidate on S -> L– 6. Ti on S, migration candidate on S <-> migration candidate on L
– Pick the best schedule in 1-6– Update migration candidate based on the 2nd best schedule
1919 HPCA2011
O(1) ACS Example
State at t0Thread on L Thread on S Candidate on L Candidate on S MPI on L MPI on S MPIsum
T2 T1 T2 T1 0.45 0.50 0.95
Thread MPI on L MPI on ST1 0.40 0.50T2 0.45 0.90
T1 T2 MPIs
Thread MPI on L MPI on ST3 0.60 75
T3 MPI
Thread on L Thread on S Candidate on L Candidate on S MPI on L MPI on S MPIsumState after t1
Case MPIL MPIS MPIsum123456
1.050.601.000.450.850.40
0.501.400.901.250.751.65
1.552.001.901.701.602.05
ACS computation at t1T3 on L
T3 on L, T2->S
T3 on S
T3 on S, T1->L
T3 on S, T1->L, T2->S
T3 on L, T2->S, T1->L
T2,T3 T1 1.05 0.50 1.55T3 T1
2020 HPCA2011
O(1) ACS Efficacy
• Constant computation overhead• Always 97% close to best schedule
1.E+03
1.E+04
1.E+05
1.E+06
1.E+07
1.E+08
1.E+09
2 4 6 8 10 12 14 16 18 20 22
Number of threads
Sch
edul
er C
ompu
tatio
n O
verh
ead
Naive ACS
0%
1%
2%
3%
4%
5%
2 4 6 8 10 12 14 16 18 20 22
Number of threads
Err
or R
ate
2121 HPCA2011
Agenda
• Motivation• Related Work• ACCESS Architecture• ACS Scheduler• Evaluation Results• Conclusions
2222 HPCA2011
Evaluation Setup
• Real machine based measurement on Xeon5160– 4 cores at 3Ghz– 32KB split L1 caches– 2 cores share L2, 4MB and 512KB each
• ACS scheduler– Implemented in Linux 2.6.32– Enable fast thread migration– Since no APE h/w available, MPIs profiled offline with 2% errors
applied (to take into account effects of set sampling)• Benchmarks
– 17 C/C++ SPEC2006 benchmarks– 2T and 4T workloads that cover both cache sensitive (S) and
insensitive (I) benchmarks– Run until first thread exits
2323 HPCA2011
Evaluation Results of ACS (2T)
• Performance improvement in all 70 cases• Avg 20% speedup• Demonstrate the efficacy of ACS
0
0.5
1
1.5
2
2.5
3
3.5
gcc-
gbk
mcf
-sop
bzp-
264
mcf
-gcc
bzp-
gbk
Avg
20
lbq-
mcf
sop-
lbm
gcc-
mlc
gcc-
sjg
bzp-
mlc
ast-h
mr
Avg
36
mlc
-hm
r
lbq-
sjg
hmr-
pry
nmd-
sjg
pry-
lbm
Avg
14
Avg
70
S2I0 S1I1 S0I2
Wei
ghte
d S
peed
upLinux ACS
2424 HPCA2011
Evaluation Results of ACS (4T)
• Performance improvement in all 30 cases• Avg 31% speedup• Demonstrate the efficacy of ACS and cache contention estimation
effort
0
1
2
3
4
5
6
7
8hm
r-pry
-lbq-
mlc
lbm
-sjg
-pry
-nm
dm
lc-pr
y-lbq
-lbm
pry-s
jg-m
lc-lb
mlb
m-h
mr-l
bq-n
md
sjg-
hmr-m
lc-nm
dA
vg
lbm
-pry
-mlc-
mcf
lbq-
pry-l
bm-o
mt
nmd-p
ry-m
lc-bz
pnm
d-pry
-lbq-
hmr
264-
hmr-s
jg-m
lcso
p-lb
m-lb
q-pr
yA
vg
lbq-
hmr-m
cf-so
pm
lc-hm
r-mcf-
sop
mlc-
pry-o
mt-g
ccpr
y-nm
d-om
t-gcc
prb-
bzp-
mlc-
pry
omt-p
ry-2
64-n
md
Avg
omt-p
rb-a
st-m
lcgc
c-mlc-
omt-p
rbom
t-sop
-spx
-sjg
264-
bzp-
gbk-l
bqm
cf-so
p-bz
p-lb
m26
4-hm
r-spx
-sop
Avg
omt-p
rb-b
zp-s
pxm
cf-pr
b-gc
c-264
gcc-s
op-m
cf-om
tgb
k-om
t-spx
-sop
omt-s
op-2
64-b
zpm
cf-so
p-sp
x-prb
Avg
Avg
-all
S0I4 S1I3 S2I2 S3I1 S4I0
Wei
ghte
d S
peed
up
Linux ACS
2525 HPCA2011
Conclusions
• We have proposed ACCESS architecture– Enforce physically asymmetric caches– ACCESS Prediction Engine
– Use shadow tags to conduct online cache simulation• We have also proposed ACS scheduler
– One time training, using MPIsum metric to derive the best performing schedule
– Practical approach to estimate shared cache contention effects
– O(1) ACS scheduler– Minimizes scheduler computation overhead– Limits thread migrations
– Real platform measurements show >20% speedup over Linux scheduler
2626 HPCA2011
Thanks!