ACCESS: Smart Scheduling for Asymmetric Cache CMPs

11 HPCA2011

ACCESS: Smart Scheduling for Asymmetric Cache CMPs

Xiaowei Jiang†, Asit Mishra‡, Li Zhao†, Ravi Iyer†, Zhen Fang†, Sadagopan Srinivasan†, Srihari

Makineni†, Paul Brett†, Chita Das‡

Intel Labs (Oregon)† Penn State‡

22 HPCA2011

Agenda

• Motivation• Related Work• ACCESS Architecture• ACS Scheduler• Evaluation Results• Conclusions

33 HPCA2011

MotivationApplications tend to have non-uniform cache capacity requirement

00.5

11.5

22.5

CP

I nor

mai

zed

to 4

096

KB

4096 KB 2048 KB 1024 KB 512 KB

Energy inefficiency

Cache

Core0 Core1

Large Cache

Core0 Core1

Small Cache

Virtual Asymmetry Physical Asymmetry

44 HPCA2011

Benefit of Physically Asymmetric Caches• Fit the asymmetry in working set size of apps

– Apps have small WSS/streaming apps on small cache– Apps have large WSS on large cache

• Help improve energy per instruction– Large cache can be power gated when not in use– Smaller cache enables lower operating voltage

• Fit the need of heterogeneous-core architectures– Asymmetric cores naturally need asymmetric caches

512KB: 0.8v 4MB: 1.0v

55 HPCA2011

Challenges in Asymmetric Caches

• What are the H/W supports needed?– H/W exposes certain cache stats to OS

• What are the OS scheduler changes needed?– Scheduler be aware of the underlying cache asymmetry– New scheduling policy to explore cache asymmetry

66 HPCA2011

Contribution of ACCESS

• ACCESS Architecture– Enables asymmetric caches– ACCESS Prediction Engine

– Runtime online measurement of cache stats– Stats exposed to OS

• Asymmetric Cache Scheduler– Finds out best-performing schedule with one-time training– Deals with private caches and shared cache

– Estimate shared cache contention effects with simple heuristics– O(1) complexity

• Real machine measurement shows >20% performance improvement over default Linux scheduler

77 HPCA2011

Agenda


88 HPCA2011

Related Work

• OS schedulers for heterogeneous-core architectures– Li et al. HPCA’10– Kumar et al. Micro’03, ISCA’04

• OS scheduler or H/W approaches for mitigating cache contention effects– Chandra et al. HPCA’05– Kim et al. PACT’05

99 HPCA2011

Agenda


1010 HPCA2011

ACCESS Architecture

• Run tasks with one-time training• APE measures/predicts task cache stats on

big/small caches• OS makes schedule based on cache stats by APE

OSOS

Task1Task1 Task2Task2 Task3Task3 Task4Task4

L1

BigLLC

L1

Core

L1

Core

L1

OSOS

Task1Task1 Task2Task2 Task3Task3 Task4Task4

L1

Taskx

Small CachePerformance of

Taskx

L1

Tasky

Big CachePerformance of

Tasky

Core Core

SmallLLC

BigLLC

SmallLLCAPE

APE

Core Core

L1

BigLLC

L1

Core

L1

Core

L1

Core Core

SmallLLCAPE APE

1111 HPCA2011

Access Prediction Engine

• Shadow tags– Use set sampling to

reduce size and #accesses

4MB

App 1

Shadow Tags

Hit/Miss Controller

App 2

App 1 App 2

Set 0Set 1Set 2

Set 4095

Way 0Tag Array Way 1 Way 15

4MB

Set 0Set 1

Set 16

4MB LLC

0 1 15

0 1 15 0 1 15

App 1

0 1

App 2

0 1

512 KB

Shadow tag = cache w/o data arrayUsing multiple shadow tags, although App1&2 share the cache, we can still

measure the cache stats of App1&2 running alone on the 4MB(L) and 512KB(S) cache

Provides cache stats for each app on each cache (running alone)

1212 HPCA2011

Agenda


1313 HPCA2011

Asymmetric Cache Scheduler

• Goal of the scheduler: improve overall threads performance– Perform least training to minimize training overhead

• Thread’s stats available to the scheduler– Instruction count etc.– Cache misses of each thread running alone on each cache

• In practice, we find schedule that has minimal overall MPI yields best overall performance

_ _ _ _i isum Thread Thread

Threads on L Threads on S

MPI MPI MPI

1414 HPCA2011

ACS Examples

• Private caches, e.g. 2T case– calculate <MPIT1_L, MPIT1_S>, <MPIT2_L, MPIT2_S>– compute MPIsum of all possible schedules

– MPIsum1= MPIT1_L + MPIT2_S

– MPIsum2= MPIT1_S + MPIT2_L

– pick min(MPIsum1, MPIsum2)

• Shared caches, e.g. 4T case– calculate <MPITi_L, MPITi_S>– compute MPIsum of all possible schedules

– MPIsum= MPITiTj_L + MPITxTy_S

– MPITiTj_L and MPITxTy_S are estimated– pick MPIsum min

Large Cache

Core0 Core1

Small Cache

T1 T2

Large CacheSmall Cache

Core0 Core1 Core2 Core3

T1 T2 T3 T4

1515 HPCA2011

Estimating Cache Contention Effect• Task: given MPITi_L/S,MPITj_L/S, estimate MPITiTj_L/S

• Cache power law Hartstein et al.

SPEC 2006 Average

0.01

0.10

1.00

1 10 100 1000# of Cache Ways (8KB/way)

Mis

s R

ate

average

Pow er (average)

MRnew = MRold * (Cnew/Cold)-α

MPInew = MPIold * (Cnew/Cold)-α

α measures how sensitive the app is to cache capacityα = -logCL/CS

(MPITi_L/MPITi_S)

We can compute α of each thread

TiTjSLTiSLTiTj OccpMPIMPI /_/_

1616 HPCA2011

Estimating Cache Contention Effect (cont.)• Estimating cache occupancy for Ti when Ti,Tj share

cache

ji

i

irepljjjreplii

jrepliiTiTj

missmissmiss

probmissprobmissprobmissOccp

____

__

ji

iSLTiSLTiTj

missmissmissMPIMPI /_/_

1717 HPCA2011

Scheduler Compute Overhead

• Computing and sorting all possible schedules has O(n2) complexity

• To arrive at the best schedule, #thread migrations might be unbounded

1.E+03

1.E+04

1.E+05

1.E+06

1.E+07

1.E+08

1.E+09

2 4 6 8 10 12 14 16 18 20 22

Sch

edul

er C

ompu

tatio

n O

verh

ead

Number of threads

Naive

1818 HPCA2011

O(1) ACS

• Goal of O(1) ACS– O(1) complexity– Limited number of thread migrations to arrive a best schedule

• O(1) ACS algorithm– For each thread (Ti) arrival, comparing MPIsum of 6 cases

– 1. Ti on L– 2. Ti on L, migration candidate on L -> S– 3. Ti on L, migration candidate on L <-> migration candidate on S– 4. Ti on S– 5. Ti on S, migration candidate on S -> L– 6. Ti on S, migration candidate on S <-> migration candidate on L

– Pick the best schedule in 1-6– Update migration candidate based on the 2nd best schedule

1919 HPCA2011

O(1) ACS Example

State at t0Thread on L Thread on S Candidate on L Candidate on S MPI on L MPI on S MPIsum

T2 T1 T2 T1 0.45 0.50 0.95

Thread MPI on L MPI on ST1 0.40 0.50T2 0.45 0.90

T1 T2 MPIs

Thread MPI on L MPI on ST3 0.60 75

T3 MPI

Thread on L Thread on S Candidate on L Candidate on S MPI on L MPI on S MPIsumState after t1

Case MPIL MPIS MPIsum123456

1.050.601.000.450.850.40

0.501.400.901.250.751.65

1.552.001.901.701.602.05

ACS computation at t1T3 on L

T3 on L, T2->S

T3 on S

T3 on S, T1->L

T3 on S, T1->L, T2->S

T3 on L, T2->S, T1->L

T2,T3 T1 1.05 0.50 1.55T3 T1

2020 HPCA2011

O(1) ACS Efficacy

• Constant computation overhead• Always 97% close to best schedule

1.E+03

1.E+04

1.E+05

1.E+06

1.E+07

1.E+08

1.E+09

2 4 6 8 10 12 14 16 18 20 22

Number of threads

Sch

edul

er C

ompu

tatio

n O

verh

ead

Naive ACS

0%

1%

2%

3%

4%

5%

2 4 6 8 10 12 14 16 18 20 22

Number of threads

Err

or R

ate

2121 HPCA2011

Agenda


2222 HPCA2011

Evaluation Setup

• Real machine based measurement on Xeon5160– 4 cores at 3Ghz– 32KB split L1 caches– 2 cores share L2, 4MB and 512KB each

• ACS scheduler– Implemented in Linux 2.6.32– Enable fast thread migration– Since no APE h/w available, MPIs profiled offline with 2% errors

applied (to take into account effects of set sampling)• Benchmarks

– 17 C/C++ SPEC2006 benchmarks– 2T and 4T workloads that cover both cache sensitive (S) and

insensitive (I) benchmarks– Run until first thread exits

2323 HPCA2011

Evaluation Results of ACS (2T)

• Performance improvement in all 70 cases• Avg 20% speedup• Demonstrate the efficacy of ACS

0

0.5

1

1.5

2

2.5

3

3.5

gcc-

gbk

mcf

-sop

bzp-

264

mcf

-gcc

bzp-

gbk

Avg

20

lbq-

mcf

sop-

lbm

gcc-

mlc

gcc-

sjg

bzp-

mlc

ast-h

mr

Avg

36

mlc

-hm

r

lbq-

sjg

hmr-

pry

nmd-

sjg

pry-

lbm

Avg

14

Avg

70

S2I0 S1I1 S0I2

Wei

ghte

d S

peed

upLinux ACS

2424 HPCA2011

Evaluation Results of ACS (4T)

• Performance improvement in all 30 cases• Avg 31% speedup• Demonstrate the efficacy of ACS and cache contention estimation

effort

0

1

2

3

4

5

6

7

8hm

r-pry

-lbq-

mlc

lbm

-sjg

-pry

-nm

dm

lc-pr

y-lbq

-lbm

pry-s

jg-m

lc-lb

mlb

m-h

mr-l

bq-n

md

sjg-

hmr-m

lc-nm

dA

vg

lbm

-pry

-mlc-

mcf

lbq-

pry-l

bm-o

mt

nmd-p

ry-m

lc-bz

pnm

d-pry

-lbq-

hmr

264-

hmr-s

jg-m

lcso

p-lb

m-lb

q-pr

yA

vg

lbq-

hmr-m

cf-so

pm

lc-hm

r-mcf-

sop

mlc-

pry-o

mt-g

ccpr

y-nm

d-om

t-gcc

prb-

bzp-

mlc-

pry

omt-p

ry-2

64-n

md

Avg

omt-p

rb-a

st-m

lcgc

c-mlc-

omt-p

rbom

t-sop

-spx

-sjg

264-

bzp-

gbk-l

bqm

cf-so

p-bz

p-lb

m26

4-hm

r-spx

-sop

Avg

omt-p

rb-b

zp-s

pxm

cf-pr

b-gc

c-264

gcc-s

op-m

cf-om

tgb

k-om

t-spx

-sop

omt-s

op-2

64-b

zpm

cf-so

p-sp

x-prb

Avg

Avg

-all

S0I4 S1I3 S2I2 S3I1 S4I0

Wei

ghte

d S

peed

up

Linux ACS

2525 HPCA2011

Conclusions

• We have proposed ACCESS architecture– Enforce physically asymmetric caches– ACCESS Prediction Engine

– Use shadow tags to conduct online cache simulation• We have also proposed ACS scheduler

– One time training, using MPIsum metric to derive the best performing schedule

– Practical approach to estimate shared cache contention effects

– O(1) ACS scheduler– Minimizes scheduler computation overhead– Limits thread migrations

– Real platform measurements show >20% speedup over Linux scheduler

2626 HPCA2011

Thanks!

Documents

ACCESS: Smart Scheduling for Asymmetric Cache CMPs