Thesis Defense - Carnegie Mellon Universitysafari/thesis/lsubramanian_defen… · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer

Providing High and Predictable Performance in Multicore Systems

Through Shared Resource Management

Thesis Defense

Lavanya Subramanian

1

Committee:Advisor: Onur Mutlu

Greg GangerJames Hoe

Ravi Iyer (Intel)

The Multicore Era

2

Main MemoryCacheCore

The Multicore Era

3

Main Memory

Shared Cache

CoreCore

CoreCore

Interconnect

Multiple applications execute in parallelHigh throughput and efficiency

Challenge:Interference at Shared Resources

4

Main Memory

Shared Cache

CoreCore

CoreCore

Interconnect

Impact of Shared Resource Interference

0

1

2

3

4

5

6

leslie3d (core 0) gcc (core 1)

Slo

wd

ow

n

0

1

2

3

4

5

6

leslie3d (core 0) mcf (core 1)

Slo

wd

ow

n

2. Unpredictable application slowdowns1. High application slowdowns

5

gcc (core 1) mcf (core 1)

Why Predictable Performance?

• There is a need for predictable performance– When multiple applications share resources – Especially if some applications require performance

guarantees

• Example 1: In server systems– Different users’ jobs consolidated onto the same server– Need to provide bounded slowdowns to critical jobs

• Example 2: In mobile systems– Interactive applications run with non-interactive applications– Need to guarantee performance for interactive applications

6

Thesis Statement

High and predictable performance

can be achieved in multicore systems through simple/implementable mechanisms to

mitigate and quantify shared resource interference

7

Goals

Approaches

Goals and Approaches

8

Goals:1. High Performance

2. Predictable Performance

Mitigate Interference Quantify Interference

Approaches:

Focus Shared Resources in This Thesis

9

Main Memory

Shared Cache

Capacity

CoreCore

CoreCore

InterconnectMain Memory

Bandwidth

Related Prior Work

10

Mitigate Interference

Quantify Interference

Cache Capacity

Memory Bandwidth

CQoS (ICS ‘04), UCP (MICRO ‘06), DIP (ISCA

‘07), DRRIP (ISCA ‘10), EAF (PACT ‘12)

STFM (MICRO ’07),

PARBS (ISCA ’08), ATLAS (HPCA ’10), TCM (MICRO

’11), Criticality-aware(ISCA ‘’13)

Challenge: High complexity

PCASA (DATE ’12),

Ubik (ASPLOS ’14)

Goal: Meet resource allocation target

STFM (MICRO ’07)

Challenge: High inaccuracy

FST (ASPLOS ’10),

PTCA (TACO ’13)

Challenge: High inaccuracy

Much exploredNot our focus

Not our focus

Outline

11



Cache Capacity

Memory Bandwidth


Blacklisting Memory Scheduler

Not our focus

Memory Interference induced

Slowdown EstimationModel

and its uses

Application Slowdown

Modeland its uses

Outline

12



Cache Capacity

Memory Bandwidth



Not our focus



and its uses


Modeland its uses

Background: Main Memory

• FR-FCFS Memory Scheduler [Zuravleff and Robinson, US Patent ‘97; Rixner et al., ISCA ‘00]

– Row-buffer hit first

– Older request first

• Unaware of inter-application interference

Row Buffer

Bank 0 Bank 1 Bank 2 Bank 3

Row Buffer

Row Buffer

Row Buffer

Ro

ws

Columns

ChannelMemory

Controller

Bank 0 Bank 1 Bank 2 Bank 3

Row Buffer

13

Row-buffer hitRow-buffer miss

Tackling Inter-Application Interference:Application-aware Memory Scheduling

14

Monitor Rank

Highest Ranked AID

EnforceRanks

Full ranking increases critical path latency and area

significantly to improve performance and fairness

4

3

2

12

4

3

1

Req 1 1Req 2 4Req 3 1Req 4 1Req 5 3

Req 7 1Req 8 3

Request Buffer

Req 5 2

RequestApp. ID

(AID)

=

=

=

=

=

=

=

=

Performance vs. Fairness vs. Simplicity

15

Performance

Fairness

Simplicity

FRFCFS

PARBS

ATLAS

TCM

Blacklisting

Ideal

App-unaware

App-aware (Ranking)

Our Solution (No Ranking)

Is it essential to give up simplicity to optimize for performance and/or fairness?

Our solution achieves all three goalsVery Simple

Low performance and fairness

Complex

Our Solution

Problems with Previous Application-aware Memory Schedulers

1. Full ranking increases hardware complexity

2. Full ranking causes unfair slowdowns

16

Our Goal: Design a memory scheduler withLow Complexity, High Performance, and Fairness

Key Observation 1: Group Rather Than Rank

Observation 1: Sufficient to separate applications into two groups, rather than do full ranking

17

Benefit 1: Low complexity compared to ranking

Group

VulnerableInterference

Causing

>

Monitor Rank

4

3

2

12

4

3

1

4

2

3

1

Benefit 2: Lower slowdowns than ranking


Observation 1: Sufficient to separate applications into two groups, rather than do full ranking

18

Group

VulnerableInterference

Causing

>

Monitor Rank

4

3

2

12

4

3

1

4

2

3

1

How to classify applications into groups?

Key Observation 2

Observation 2: Serving a large number of consecutive requests from an application causes interference

Basic Idea:• Group applications with a large number of consecutive

requests as interference-causing Blacklisting• Deprioritize blacklisted applications• Clear blacklist periodically (1000s of cycles)

Benefits:• Lower complexity• Finer grained grouping decisions Lower unfairness

19

The Blacklisting Memory Scheduler (ICCD ‘14)

20

1. Monitor

Memory Controller

AID2 0 0

AID Blacklist

1 02 0

3 0

AID1AID1AID1AID1

1Last Req AID 3

# Consecutive Requests

1

21

2344

2. Blacklist

Memory Controller

Last Req AID 3

# Consecutive Requests

1

2. Blacklist

0 0

AID Blacklist

12 03 0

1

1. Monitor

Req Blacklist

Req 1 0Req 2 1

Req 3 1Req 4 0Req 5 0

Req 6 0Req 7 1Req 8 0

Request Buffer

?

?

?

3. Prioritize

4. Clear Periodically

0

Simple and scalable design

3. Prioritize

4. Clear Periodically

1. Monitor

?

?

?

?

?

Methodology

• Configuration of our simulated baseline system– 24 cores

– 4 channels, 8 banks/channel

– DDR3 1066 DRAM

– 512 KB private cache/core

• Workloads– SPEC CPU2006, TPC-C, Matlab , NAS

– 80 multiprogrammed workloads

21

Performance and Fairness

22

1

3

5

7

9

11

13

15

1 3 5 7 9

Un

fair

ne

ss

Performance

FRFCFS FRFCFS-Cap PARBS

ATLAS TCM Blacklisting

5%21%

(Higher is better)

(Lo

wer

is b

ette

r)

1. Blacklisting achieves the highest performance 2. Blacklisting balances performance and fairness

Complexity

23

0

20000

40000

60000

80000

100000

120000

0 2 4 6 8 10 12

Sch

ed

ule

r A

rea

(sq

. um

)

Critical Path Latency (ns)



43%

70%

Blacklisting reduces complexity significantly

Outline

24



Cache Capacity

Memory Bandwidth



Not our focus



and its uses


Modeland its uses

Impact of Interference on Performance

25

Alone (No interference)

time

Execution time

Shared (With interference)

time

Execution time

Impact of Interference

Slowdown: Definition

Shared

Alone

ePerformanc

ePerformanc Slowdown

26


27


time

Execution time


time

Execution time


Previous Approach: Estimate impact of interference at a per-request granularity

Difficult to estimate due to request overlap

Outline

28



Cache Capacity

Memory Bandwidth



Not our focus



and its uses


Modeland its uses

Observation: Request Service Rate is a Proxy for Performance

For a memory bound application, Performance Memory request service rate

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.3 0.4 0.5 0.6 0.7 0.8 0.9 1No

rmal

ize

d P

erf

orm

ance

Normalized Request Service Rate

omnetpp

mcf

astar

Shared

Alone

Rate ServiceRequest

Rate ServiceRequest Slowdown

Shared

Alone

ePerformanc

ePerformanc Slowdown

Easy

Difficult

Intel Core i7, 4 cores

29

Observation: Highest Priority Enables Request Service Rate Alone Estimation

Request Service Rate Alone (RSRAlone) of an application can be estimated by giving the

application highest priority at the memory controller

Highest priority Little interference

(almost as if the application were run alone)

30

Observation: Highest Priority Enables Request Service Rate Alone Estimation

Request Buffer State

Main Memory

1. Run aloneTime units Service order

Main Memory

12


Main Memory

2. Run with another applicationService order

Main Memory

123


Main Memory

3. Run with another application: highest priorityService order

Main Memory

123

Time units

Time units

3

31

Memory Interference-induced Slowdown Estimation (MISE) model for memory bound applications

)(RSR Rate ServiceRequest

)(RSR Rate ServiceRequest Slowdown

SharedShared

AloneAlone

32

Observation: Memory Bound vs. Non-Memory Bound

• Memory-bound application

No interference

Compute Phase

Memory Phase

With interference

Memory phase slowdown dominates overall slowdown

time

time

Req

Req

Req Req

Req Req

33

Observation: Memory Bound vs. Non-Memory Bound

• Non-memory-bound application

time

time

No interference

Compute Phase

Memory Phase

With interference

Only memory fraction ( ) slows down with interference

1

1

Shared

Alone

RSR

RSR

Shared

Alone

RSR

RSR ) - (1 Slowdown

Memory Interference-induced Slowdown Estimation (MISE) model for non-memory bound applications

34

Interval Based Operation

time

Interval

Estimate

slowdown

Interval

Estimate

slowdown

Measure RSRShared,

Estimate RSRAlone

Measure RSRShared,

Estimate RSRAlone

35

Previous Work on Slowdown Estimation

• Previous work on slowdown estimation– STFM (Stall Time Fair Memory) Scheduling [Mutlu et al., MICRO ’07]

– FST (Fairness via Source Throttling) [Ebrahimi et al., ASPLOS ’10]

– Per-thread Cycle Accounting [Du Bois et al., HiPEAC ’13]

• Basic Idea:

Shared

Alone

Time Stall

Time Stall Slowdown

Count number of cycles application receives interference36

Methodology

• Configuration of our simulated system– 4 cores

– 1 channel, 8 banks/channel

– DDR3 1066 DRAM

– 512 KB private cache/core

• Workloads– SPEC CPU2006

– 300 multi programmed workloads

37

Quantitative Comparison

1

1.5

2

2.5

3

3.5

4

0 20 40 60 80 100

Slo

wd

ow

n

Million Cycles

Actual

STFM

MISE

SPEC CPU 2006 applicationleslie3d

38

Comparison to STFM

cactusADM

0

1

2

3

4

0 50 100

Slo

wd

ow

n

0

1

2

3

4

0 50 100

Slo

wd

ow

n

GemsFDTD

0

1

2

3

4

0 50 100

Slo

wd

ow

n

soplex

0

1

2

3

4

0 50 100

Slo

wd

ow

n

wrf

0

1

2

3

4

0 50 100

Slo

wd

ow

n

calculix

0

1

2

3

4

0 50 100

Slo

wd

ow

n

povray

Average error of MISE: 8.2%Average error of STFM: 29.4%

(across 300 workloads)

39

Possible Use Cases of the MISE Model

• Bounding application slowdowns [HPCA ’13]

• Achieving high system fairness and performance [HPCA ’13]

• VM migration and admission control schemes [VEE ’15]

• Fair billing schemes in a commodity cloud

40

MISE-QoS: Providing “Soft” Slowdown Guarantees

• Goal

1. Ensure QoS-critical applications meet a prescribed slowdown bound

2. Maximize system performance for other applications

• Basic Idea

– Allocate just enough bandwidth to QoS-critical application

– Assign remaining bandwidth to other applications

41

Methodology

• Each application (25 applications in total) considered the QoS-critical application

• Run with 12 sets of co-runners of different memory intensities

• Total of 300 multi programmed workloads

• Each workload run with 10 slowdown bound values

• Baseline memory scheduling mechanism– Always prioritize QoS-critical application

[Iyer et al., SIGMETRICS 2007]

– Other applications’ requests scheduled in FR-FCFS order[Zuravleff and Robinson, US Patent 1997, Rixner+, ISCA 2000]

42

A Look at One Workload

0

0.5

1

1.5

2

2.5

3

leslie3d hmmer lbm omnetpp

Slo

wd

ow

n AlwaysPrioritize

MISE-QoS-10/1

MISE-QoS-10/3

MISE-QoS-10/5

MISE-QoS-10/7

MISE-QoS-10/9

QoS-critical non-QoS-critical

Slowdown Bound = 10 Slowdown Bound = 3.33

Slowdown Bound = 2

43

MISE is effective in 1. meeting the slowdown bound for the QoS-critical

application 2. improving performance of non-QoS-critical

applications

Effectiveness of MISE in Enforcing QoS

Predicted Met

PredictedNot Met

QoS Bound Met

78.8% 2.1%

QoS Bound Not Met

2.2% 16.9%

Across 3000 data points

MISE-QoS meets the bound for 80.9% of workloads

AlwaysPrioritize meets the bound for 83% of workloads

MISE-QoS correctly predicts whether or not the bound is met for 95.7% of workloads

44

Performance of Non-QoS-Critical Applications

Higher performance when bound is looseWhen slowdown bound is 10/3 MISE-QoS improves system performance by 10%

45

0

0.2

0.4

0.6

0.8

1

Syste

m

Pe

rfo

rma

nce AlwaysPrioritize

MISE-QoS-10/1

MISE-QoS-10/3

MISE-QoS-10/5

MISE-QoS-10/7

MISE-QoS-10/9

Outline

46



Cache Capacity

Memory Bandwidth



Not our focus



and its uses


Modeland its uses

Shared Cache Capacity Contention

47

Main Memory

Shared Cache

Capacity

CoreCore

CoreCore

Cache Capacity Contention

48

Main Memory

Shared Cache

Cache Access Rate

Priority

Core

Core

Applications evict each other’s blocks from the shared cache

Outline

49



Cache Capacity

Memory Bandwidth



Not our focus



and its uses


Modeland its uses

Estimating Cache and Memory Slowdowns

50

Core Core Core Core

Core Core Core Core

Core Core Core Core

Core Core Core Core

Main Memory

Shared Cache

Cache Service Rate

Memory Service Rate

Service Rates vs. Access Rates

51

Request service and access rates are tightly coupled

Core Core Core Core

Core Core Core Core

Core Core Core Core

Core Core Core Core

Main Memory

Shared Cache

Cache Service Rate

Cache Access Rate

The Application Slowdown Model

52

Core Core Core Core

Core Core Core Core

Core Core Core Core

Core Core Core Core

Main Memory

Shared Cache

Shared

Alone

Rate Access Cache

Rate Access CacheSlowdown

Cache Access Rate

Real System Studies:Cache Access Rate vs. Slowdown

53

1

1.2

1.4

1.6

1.8

2

2.2

1 1.2 1.4 1.6 1.8 2 2.2

Slo

wd

ow

n

Cache Access Rate Ratio

astar

lbm

bzip2

Challenge

How to estimate alone cache access rate?

54

Core Core Core Core

Core Core Core Core

Core Core Core Core

Core Core Core Core

Main Memory

Shared Cache

Cache Access Rate

Auxiliary Tag Store

Priority

Auxiliary Tag Store

55

Main Memory

Shared Cache

Cache Access Rate

Auxiliary Tag Store

Priority

Core

Core

Still in auxiliary tag store

Auxiliary Tag StoreAuxiliary tag store tracks such contention misses

Accounting for Contention Misses

• Revisiting alone memory request service rate

Cycles serving contention misses should not

count as high priority cycles

56

CyclesPriority High #

EpochsPriority High During Requests #

nApplicatioan of Rate ServiceRequest Alone

Alone Cache Access Rate Estimation

57

Cycles Contention Cache# - CyclesPriority High #

EpochsPriority High During Requests #

nApplicatioan of Rate Access Cache

Alone

Cache Contention Cycles: Cycles spent serving contention misses

Time ServiceMemory Average

x Misses Contention # Cycles Contention Cache

From auxiliary tag storewhen given high priority

Measured when given high priority

Application Slowdown Model (ASM)

58

Core Core Core Core

Core Core Core Core

Core Core Core Core

Core Core Core Core

Main Memory

Shared Cache

Cache Access Rate

Shared

Alone

Rate Access Cache

Rate Access CacheSlowdown

Previous Work on Slowdown Estimation

• Previous work on slowdown estimation– STFM (Stall Time Fair Memory) Scheduling [Mutlu et al., MICRO ’07]

– FST (Fairness via Source Throttling) [Ebrahimi et al., ASPLOS ’10]

– Per-thread Cycle Accounting [Du Bois et al., HiPEAC ’13]

• Basic Idea:

Shared

Alone

TimeExecution

TimeExecution Slowdown

Count interference experienced by each request59

Model Accuracy Results

Average error of ASM’s slowdown estimates: 10%

60

Select applications

0

20

40

60

80

100

120

140

160

calc

ulix

po

vray

ton

to

nam

d

dea

lII

sjen

g

per

lben

…

gob

mk

xala

ncb

…

sph

inx3

Gem

sF…

om

net

pp

lbm

lesl

ie3

d

sop

lex

milc

libq

mcf

NP

Bb

t

NP

Bft

NP

Bis

NP

Bu

a

Ave

rage

Slo

wd

ow

n E

stim

atio

n

Erro

r (i

n %

)

FST PTCA ASM

Leveraging ASM’s Slowdown Estimates

• Slowdown-aware resource allocation for high performance and fairness

• Slowdown-aware resource allocation to bound application slowdowns

• VM migration and admission control schemes [VEE ’15]

• Fair billing schemes in a commodity cloud

61

Cache Capacity Partitioning

62

Main Memory

Shared Cache

Cache Access Rate

Core

Core

Goal: Partition the shared cache among applications to mitigate contention

Cache Capacity Partitioning

63

Main Memory

Core

Core

Way 2

Set 0Set 1Set 2Set 3

..

Set N-1

Way 0

Way 1

Way 3

Previous partitioning schemes optimize for miss countProblem: Not aware of performance and slowdowns

ASM-Cache: Slowdown-aware Cache Way Partitioning

• Key Requirement: Slowdown estimates for all possible way partitions

• Extend ASM to estimate slowdown for all possible cache way allocations

• Key Idea: Allocate each way to the application whose slowdown reduces the most

64

Memory Bandwidth Partitioning

65

Main Memory

Shared Cache

Cache Access Rate

Core

Core

Goal: Partition the main memory bandwidth among applications to mitigate contention

ASM-Mem: Slowdown-aware Memory Bandwidth Partitioning

• Key Idea: Allocate high priority proportional to an application’s slowdown

• Application i’s requests given highest priority at the memory controller for its fraction

66

jj

ii

Slowdown

Slowdown FractionPriority High

Coordinated Resource Allocation Schemes

67

Core Core Core Core

Core Core Core Core

Core Core Core Core

Core Core Core Core

Main Memory

Shared Cache

Cache capacity-aware bandwidth allocation

1. Employ ASM-Cache to partition cache capacity 2. Drive ASM-Mem with slowdowns from ASM-Cache

Fairness and Performance Results

68

16-core system 100 workloads

Significant fairness benefits across different channel counts

4

5

6

7

8

9

10

11

1 2

Fair

nes

s (L

ow

er is

bet

ter)

Number of Channels

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

1 2

Per

form

ance

Number of Channels

FRFCFS-NoPart

FRFCFS+UCP

TCM+UCP

PARBS+UCP

ASM-Cache-Mem

Outline

69



Cache Capacity

Memory Bandwidth



Not our focus



and its uses


Model and its uses

Thesis Contributions

• Principles behind our scheduler and models– Simple two-level prioritization sufficient to

mitigate interference

– Request service rate a proxy for performance

• Simple and high-performance memory scheduler design

• Accurate slowdown estimation models

• Mechanisms that leverage our slowdown estimates

70

Summary

• Problem: Shared resource interference causes high and unpredictable application slowdowns

• Goals: High and predictable performance

• Approaches: Mitigate and quantify interference

• Thesis Contributions:1. Principles behind our scheduler and models 2. Simple and high-performance memory scheduler3. Accurate slowdown estimation models4. Mechanisms that leverage our slowdown estimates

71

Future Work

• Leveraging slowdown estimates at the system and cluster level

• Interference estimation and performance predictability for multithreaded applications

• Performance predictability in heterogeneous systems

• Coordinating the management of main memory and storage

72

Research Summary

• Predictable performance in multicore systems [HPCA ’13, SuperFri ’14, KIISE ’15]

• High and predictable performance in heterogeneous systems [ISCA ’12, SAFARI Tech Report ’15]

• Low-complexity memory scheduling [ICCD ’14]

• Memory channel partitioning [MICRO ’11]

• Architecture-aware cluster management [VEE ’15]

• Low-latency DRAM architectures [HPCA ’13]

73

Backup Slides

74

Blacklisting

75




76

Ranking Increases Hardware Complexity

77

Highest Ranked AID

EnforceRanks

Req 1 1Req 2 4Req 3 1Req 4 1Req 5 3

Req 7 1Req 8 3

Request Buffer

Req 5 4

RequestApp. ID

(AID)

Next Highest Ranked AID

Monitor Rank

4

3

2

12

4

3

1

=

=

=

=

=

=

=

=

Hardware complexity increases with application/core count

78

0

1

2

3

4

5

6

7

8

9

Cri

tica

l Pat

h L

ate

ncy

(in

ns)

App-unaware

App-aware

0

10000

20000

30000

40000

50000

60000

70000

80000

Sch

ed

ule

r A

rea

(in

sq

uar

e u

m)

App-unaware

App-aware

Ranking Increases Hardware Complexity

8x

1.8x

Ranking-based application-aware schedulers incur high hardware cost

From synthesis of RTL implementations using a 32nm library




79

Ranking Causes Unfair Slowdowns

80

GemsFDTD

0

10

20

30

0 20 40 60 80 100

Nu

mb

er

of

Re

qu

est

s

Execution Time (in 1000s of Cycles)

App-unaware

Ranking

GemsFDTD (high memory intensity)

sjeng

0

10

20

30

0 20 40 60 80 100

Nu

mb

er

of

Re

qu

est

s


App-unaware

Ranking

sjeng (low memory intensity)

0

10

20

30

0 20 40 60 80 100

Nu

mb

er

of

Re

qu

est

s


App-unaware

Ranking

Full ordered ranking of applicationsGemsFDTD denied request service

Ranking Causes Unfair Slowdowns

81

0

1

2

3

4

5

6

7

8

Slo

wd

ow

n

App-unaware

Ranking

0

1

2

3

4

5

6

7

8

Slo

wd

ow

n

App-unaware

Ranking

Ranking-based application-aware schedulers cause unfair slowdowns

GemsFDTD(high memory intensity)

sjeng(low memory intensity)


82

GemsFDTD (high memory intensity)

sjeng (low memory intensity)

0

10

20

30

0 20 40 60 80 100

Nu

mb

er

of

Re

qu

est

s


App-unaware

Ranking

Grouping

0

10

20

30

0 20 40 60 80 100

Nu

mb

er

of

Re

qu

est

s


App-unaware

Ranking

Grouping

No unfairness due to denial of request service

83


0

1

2

3

4

5

6

7

8

Slo

wd

ow

n

App-unaware

Ranking

Grouping

0

1

2

3

4

5

6

7

8

Slo

wd

ow

n

App-unaware

Ranking

Grouping

Benefit 2: Lower slowdowns than ranking

GemsFDTD(high memory intensity)

sjeng(low memory intensity)

Previous Memory Schedulers

• FRFCFS [Zuravleff and Robinson, US Patent 1997, Rixner et al., ISCA 2000]

– Prioritizes row-buffer hits and older requests

• FRFCFS-Cap [Mutlu and Moscibroda, MICRO 2007]

– Caps number of consecutive row-buffer hits

• PARBS [Mutlu and Moscibroda, ISCA 2008]

– Batches oldest requests from each application; prioritizes batch– Employs ranking within a batch

• ATLAS [Kim et al., HPCA 2010]

– Prioritizes applications with low memory-intensity

• TCM [Kim et al., MICRO 2010]

– Always prioritizes low memory-intensity applications– Shuffles thread ranks of high memory-intensity applications

84

Application-unaware+ Low complexity

- Low performance and fairness

Application-aware+ High performance and fairness

- High complexity


85

5

7

9

11

13

15

7.5 8 8.5 9 9.5 10

Un

fair

ne

ss

Performance



5%21%

1. Blacklisting achieves the highest performance 2. Blacklisting balances performance and fairness

Performance vs. Fairness vs. Simplicity

86

Performance

Fairness

Simplicity

FRFCFS

FRFCFS-Cap

PARBS

ATLAS

TCM

Blacklisting

Ideal

Highest performance

Close to simplest

Close to fairest

Blacklisting is the closest scheduler to ideal

Summary

• Applications’ requests interfere at main memory• Prevalent solution approach

– Application-aware memory request scheduling

• Key shortcoming of previous schedulers: Full ranking– High hardware complexity– Unfair application slowdowns

• Our Solution: Blacklisting memory scheduler– Sufficient to group applications rather than rank– Group by tracking number of consecutive requests

• Much simpler than application-aware schedulers at higher performance and fairness

87


0

2

4

6

8

10

We

igh

ted

Sp

ee

du

p FRFCFS

FRFCFS-Cap

PARBS

ATLAS

TCM

Blacklisting

0

5

10

15

Max

imu

m S

low

do

wn

FRFCFS

FRFCFS-Cap

PARBS

ATLAS

TCM

Blacklisting

5% higher system performance and 21% lower maximum slowdown than TCM

88

Complexity Results

Blacklisting achieves 70% lower latency than TCM

89

Blacklisting achieves 43% lower area than TCM

0

2

4

6

8

10

12

Late

ncy

(in

ns)

App-unaware

FRFCFS-Cap

PARBS

ATLAS

App-aware

Blacklisting

0

20000

40000

60000

80000

100000

120000

Are

a (i

n s

qu

are

um

) FRFCFS

FRFCFS-Cap

PARBS

ATLAS

TCM

Blacklisting

Understanding Why Blacklisting Works

0

0.1

0.2

0.3

0.4

0 10 20

Frac

tio

n o

f R

eq

ue

sts

Streak Length

FRFCFS

PARBS

TCM

Blacklisting 0

0.1

0.2

0.3

0.4

0.5

0 10 20

Frac

tio

n o

f R

eq

ue

sts

Streak Length

FRFCFS

PARBS

TCM

Blacklisting

libquantum(High memory-intensity

application)

Blacklisting shifts the request distribution towards the left

calculix(Low memory-intensity

application)

Blacklisting shifts the request distribution towards the right

90

Harmonic Speedup

91

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

Har

mo

nic

Sp

ee

du

p FRFCFS

FRFCFS-Cap

PARBS

ATLAS

TCM

Blacklisting

Effect of Workload Memory Intensity

92

0

0.2

0.4

0.6

0.8

1

1.2

1.4

25 50 75 100 Avg

We

igh

ted

Sp

ee

du

p

(No

rmal

ize

d)

0

0.5

1

1.5

2

25 50 75 100 Avg

Max

imu

m S

low

do

wn

(N

orm

aliz

ed

) FRFCFS

FRFCFS-Cap

PARBS

ATLAS

TCM

Blacklisting

Combining FRFCFS-Cap and Blacklisting

93

0

0.2

0.4

0.6

0.8

1

1.2

1.4

We

igh

ted

Sp

ee

du

p

0

0.2

0.4

0.6

0.8

1

1.2

Max

imu

m S

low

do

wn

FRFCFS

FRFCFS-Cap

Blacklisting

FRFCFS-Cap-Blacklisting

Sensitivity to Blacklisting Threshold

94

0

0.2

0.4

0.6

0.8

1

1.2

1.4

We

igh

ted

Sp

ee

du

p

0

0.2

0.4

0.6

0.8

1

1.2

Max

imu

m S

low

do

wn

FRFCFS

Blacklisting-1

Blaclisting-2

Blacklisting-4

Blacklisting-8

Blacklisting-16

Sensitivity to Clearing Interval

95

0

0.2

0.4

0.6

0.8

1

1.2

1.4

We

igh

ted

Sp

ee

du

p

0

0.2

0.4

0.6

0.8

1

1.2

Max

imu

m S

low

do

wn

FRFCFS

Blacklisting-1000

Blacklisting-10000

Blacklisting-100000

Sensitivity to Core Count

96

0

5

10

15

20

16 24 32 64

Pe

rfo

rman

ce

Core Count

0

10

20

30

40

16 24 32 64

Un

fair

ne

ss

Core Count

FRFCFS

FRFCFS-Cap

PARBS

ATLAS

TCM

Blacklisting

Sensitivity to Channel Count

97

0

5

10

15

1 2 4 8

Pe

rfo

rman

ce

Channel Count

0

50

100

150

1 2 4 8

Un

fair

ne

ss

Channel Count

FRFCFS

FRFCFS-Cap

PARBS

ATLAS

TCM

Blacklisting

Sensitivity to Cache Size

98

0

5

10

15

512KB 1MB 2MB

Pe

rfo

rman

ce

0

5

10

15

512KB 1MB 2MB

Un

fair

ne

ss

FRFCFS

FRFCFS-Cap

PARBS

ATLAS

TCM

Blacklisting

Performance and Fairness with Shared Cache

99

7

7.5

8

8.5

9

9.5

10

Pe

rfo

rman

ce

0

5

10

15

20

Un

fair

ne

ss

FRFCFS

FRFCFS-CAP

PARBS

ATLAS

TCM

Blacklisting

Breakdown of Benefits

100

0

0.2

0.4

0.6

0.8

1

1.2

1.4

We

igh

ted

Sp

ee

du

p

0

0.2

0.4

0.6

0.8

1

1.2

Max

imu

m S

low

do

wn

FRFCFS

TCM

Grouping

Blacklisting

BLISS vs. Criticality-aware Scheduling

101

0

0.2

0.4

0.6

0.8

1

1.2

1.4

We

igh

ted

Sp

ee

du

p

0

0.2

0.4

0.6

0.8

1

1.2

1.4

Max

imu

m S

low

do

wn

FRFCFS

PARBS

TCM

Crit-MaxStall

Crit-TotalStall

Blacklisting

Sub-row Interleaving

102

0

2

4

6

8

10

12

Max

imu

m S

low

do

wn

FRFCFS-Row

FRFCFS

FRFCFS-Cap

PARBS

ATLAS

TCM

Blacklisting

0

2

4

6

8

10

12

We

igh

ted

Sp

ee

du

p

MISE

103

Measuring RSRShared and α

• Request Service Rate Shared (RSRShared)– Per-core counter to track number of requests serviced– At the end of each interval, measure

• Memory Phase Fraction ( )– Count number of stall cycles at the core– Compute fraction of cycles stalled for memory

Length Interval

Served Requests ofNumber RSRShared

104

Estimating Request Service Rate Alone (RSRAlone)

• Divide each interval into shorter epochs

• At the beginning of each epoch– Randomly pick an application as the highest priority

application

• At the end of an interval, for each application, estimate

PriorityHigh Given n Applicatio Cycles ofNumber

EpochsPriority High During Requests ofNumber RSR

Alone

105

Goal: Estimate RSRAlone

How: Periodically give each application highest priority in accessing memory

Inaccuracy in Estimating RSRAlone

106

Request BufferState

Main Memory

Time units Service order

Main Memory

123

• When an application has highest priority– Still experiences some interference


Main Memory


Main Memory

123


Main Memory


Main Memory

123

Interference Cycles

High Priority

Accounting for Interference in RSRAlone Estimation

• Solution: Determine and remove interference cycles from RSRAlone calculation

• A cycle is an interference cycle if– a request from the highest priority application is

waiting in the request buffer and

– another application’s request was issued previously

107

Cycles ceInterferen -Priority High Given n Applicatio Cycles ofNumber

EpochsPriority High During Requests ofNumber ARSR

MISE Operation: Putting it All Together

time

Interval

Estimate

slowdown

Interval

Estimate

slowdown

Measure RSRShared,

Estimate RSRAlone

Measure RSRShared,

Estimate RSRAlone

108

MISE-QoS: Mechanism to Provide Soft QoS

• Assign an initial bandwidth allocation to QoS-critical application

• Estimate slowdown of QoS-critical application using the MISE model

• After every N intervals

– If slowdown > bound B +/- ε, increase bandwidth allocation

– If slowdown < bound B +/- ε, decrease bandwidth allocation

• When slowdown bound not met for N intervals

– Notify the OS so it can migrate/de-schedule jobs109

Performance of Non-QoS-Critical Applications

0

0.2

0.4

0.6

0.8

1

1.2

1.4

0 1 2 3 Avg

Ha

rmo

nic

Sp

ee

du

p

Number of Memory Intensive Applications

AlwaysPrioritize

MISE-QoS-10/1

MISE-QoS-10/3

MISE-QoS-10/5

MISE-QoS-10/7

MISE-QoS-10/9

Higher performance when bound is looseWhen slowdown bound is 10/3 MISE-QoS improves system performance by 10%

110

Case Study with Two QoS-Critical Applications

• Two comparison points

– Always prioritize both applications

– Prioritize each application 50% of time

0

1

2

3

4

5

6

7

8

9

10

astar mcf leslie3d mcf

Slo

wd

ow

n

AlwaysPrioritize

EqualBandwidth

MISE-QoS-10/1

MISE-QoS-10/2

MISE-QoS-10/3

MISE-QoS-10/4

MISE-QoS-10/5

MISE-QoS can achieve a lower slowdown bound for both applications

MISE-QoS provides much lower slowdowns for non-QoS-critical applications

111

Minimizing Maximum Slowdown

• Goal– Minimize the maximum slowdown experienced by any

application

• Basic Idea– Assign more memory bandwidth to the more slowed

down application

112

Mechanism

• Memory controller tracks– Slowdown bound B

– Bandwidth allocation of all applications

• Different components of mechanism– Bandwidth redistribution policy

– Modifying target bound

– Communicating target bound to OS periodically

113

Bandwidth Redistribution

• At the end of each interval,

– Group applications into two clusters

– Cluster 1: applications that meet bound

– Cluster 2: applications that don’t meet bound

– Steal small amount of bandwidth from each application in cluster 1 and allocate to applications in cluster 2

114

Modifying Target Bound

• If bound B is met for past N intervals– Bound can be made more aggressive

– Set bound higher than the slowdown of most slowed down application

• If bound B not met for past N intervals by more than half the applications– Bound should be more relaxed

– Set bound to slowdown of most slowed down application

115

Results: Harmonic Speedup

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

4 8 16

Ha

rmo

nic

Sp

ee

du

p

Core Count

FRFCFS

ATLAS

TCM

STFM

MISE-Fair

116

Results: Maximum Slowdown

0

2

4

6

8

10

12

14

16

4 8 16

Ma

xim

um

Slo

wd

ow

n

Core Count

FRFCFS

ATLAS

TCM

STFM

MISE-Fair

117

Sensitivity to Memory Intensity(16 cores)

0

5

10

15

20

25

0 25 50 75 100 Avg

Ma

xim

um

Slo

wd

ow

n

FRFCFS

ATLAS

TCM

STFM

MISE-Fair

118

MISE: Per-Application ErrorBenchmark STFM MISE Benchmark STFM MISE

453.povray 56.3 0.1 473.astar 12.3 8.1

454.calculix 43.5 1.3 456.hmmer 17.9 8.1

400.perlbench 26.8 1.6 464.h264ref 13.7 8.3

447.dealII 37.5 2.4 401.bzip2 28.3 8.5

436.cactusADM 18.4 2.6 458.sjeng 21.3 8.8

450.soplex 29.8 3.5 433.milc 26.4 9.5

444.namd 43.6 3.7 481.wrf 33.6 11.1

437.leslie3d 26.4 4.3 429.mcf 83.74 11.5

403.gcc 25.4 4.5 445.gobmk 23.1 12.5

462.libquantum 48.9 5.3 483.xalancbmk 18 13.6

459.GemsFDTD 21.6 5.5 435.gromacs 31.4 15.6

470.lbm 6.9 6.3 482.sphinx3 21 16.8

473.astar 12.3 8.1 471.omnetpp 26.2 17.5

456.hmmer 17.9 8.1 465.tonto 32.7 19.5119

Sensitivity to Epoch and Interval Lengths

1 mil. 5 mil. 10 mil. 25 mil. 50 mil.

1000 65.1% 9.1% 11.5% 10.7% 8.2%

10000 64.1% 8.1% 9.6% 8.6% 8.5%

100000 64.3% 11.2% 9.1% 8.9% 9%

1000000 64.5% 31.3% 14.8% 14.9% 11.7%

Interval Length

Epoch Length

120

Workload Mixes

Mix No. Benchmark 1 Benchmark 2 Benchmark 3

1 sphinx3 leslie3d milc

2 sjeng gcc perlbench

3 tonto povray wrf

4 perlbench gcc povray

5 gcc povray leslie3d

6 perlbench namd lbm

7 h264ref bzip2 libquantum

8 hmmer lbm omnetpp

9 sjeng libquantum cactusADM

10 namd libquantum mcf

11 xalancbmk mcf astar

12 mcf libquantum leslie3d121

STFM’s Effectiveness in Enforcing QoS

Predicted Met

PredictedNot Met

QoS Bound Met

63.7% 16%

QoS Bound Not Met

2.4% 17.9%

Across 3000 data points

122

STFM vs. MISE’s System Performance

0.7

0.75

0.8

0.85

0.9

0.95

MISE STFM

Syste

m P

erf

orm

an

ce

QoS-10/1

QoS-10/3

QoS-10/5

QoS-10/7

QoS-10/9

123

MISE’s Implementation Cost

1. Per-core counters worth 20 bytes• Request Service Rate Shared• Request Service Rate Alone

– 1 counter for number of high priority epoch requests– 1 counter for number of high priority epoch cycles– 1 counter for interference cycles

• Memory phase fraction ( )2. Register for current bandwidth allocation – 4

bytes3. Logic for prioritizing an application in each epoch

124

MISE Accuracy w/o Interference Cycles

• Average error – 23%

125

MISE Average Error by Workload Category

Workload Category (Number of memory intensive applications)

Average Error

0 4.3%

1 8.9%

2 21.2%

3 18.4%

126

ASM

127

Impact of Cache Capacity Contention

128

Cache capacity interference causes high application slowdowns

Shared Main Memory Shared Main Memory and Caches

0

0.5

1

1.5

2

bzip2 (core 0) soplex (core 1)

Slo

wd

ow

n

0

0.5

1

1.5

2

bzip2 (core 0) soplex (core 1)S

low

do

wn

Error with Sampling

129

Error Distribution

130

Impact of Prefetching

131

Sensitivity to Epoch and Quantum Lengths

132

Sensitivity to Core Count

133

Sensitivity to Cache Capacity

134

Sensitivity to Auxiliary Tag Store Sampling

135

ASM-Cache:Fairness and Performance Results

136

Significant fairness benefits across different systems

0

5

10

15

4 8 16

Fair

nes

s(L

ow

er

is b

ette

r)

Number of Cores

0

0.2

0.4

0.6

0.8

4 8 16

Pe

rfo

rman

ce

Number of Cores

NoPart

UCP

ASM-Cache

ASM-Mem: Fairness and Performance Results

137

0

5

10

15

20

4 8 16

Fair

nes

s (L

ow

er is

bet

ter)

Number of Cores

0

0.2

0.4

0.6

0.8

4 8 16P

erf

orm

ance

Number of Cores

FRFCFS

TCM

PARBS

ASM-Mem

Significant fairness benefits across different systems

ASM-QoS: Meeting Slowdown Bounds

138

0

0.5

1

1.5

2

2.5

3

3.5

4

h264ref mcf sphinx3 soplex

Slo

wd

ow

n

Naive-QoS

ASM-QoS-2.5

ASM-QoS-3

ASM-QoS-3.5

ASM-QoS-4

Previous Approach: Estimate Interference Experienced Per-Request

139

Shared (With interference) time

Execution time

Req A

Req B

Req C

Request Overlap Makes Interference Estimation Per-Request Difficult

Estimating PerformanceAlone

140


Execution time

Req A

Req B

Req CRequest QueuedRequest Served

Difficult to estimate impact of interference per-request due to request overlap


141


time

Execution time

Shared (With interference) time

Execution time


Previous Approach: Estimate impact of interference at a per-request granularity

Difficult to estimate due to request overlap

Application-aware Memory Channel Partitioning

142

Goal: Mitigate

Inter-Application Interference

Previous Approach:Application-Aware Memory

Request Scheduling

Our First Approach:Application-Aware Memory

Channel Partitioning

Our Second Approach:Integrated Memory Partitioning

and Scheduling

Observation: Modern Systems Have Multiple Channels

A new degree of freedom

Mapping data across multiple channels

143

Channel 0Red App

Blue App

MemoryController

MemoryController

Channel 1

Memory

Core

Core

Memory

Data Mapping in Current Systems

144

Channel 0Red App

Blue App

Memory Controller

Memory Controller

Channel 1

Memory

Core

Core

Memory

Causes interference between applications’ requests

Page

Partitioning Channels Between Applications

145

Channel 0Red App

Blue App

Memory Controller

Memory Controller

Channel 1

Memory

Core

Core

Memory

Page

Eliminates interference between applications’ requests

Integrated Memory Partitioning and Scheduling

146

Goal: Mitigate

Inter-Application Interference

Previous Approach:Application-Aware Memory

Request Scheduling

Our First Approach:Application-Aware Memory

Channel Partitioning

Our Second Approach:Integrated Memory Partitioning

and Scheduling

Slowdown/Interference Estimation in Existing Systems

147

Core Core Core Core

Core Core Core Core

Core Core Core Core

Core Core Core Core

Main Memory

Shared Cache

How do we detect/mitigate the impact of interference on a real system using existing performance counters?

Our Approach: Mitigating Interference in a Cluster

1. Detect memory bandwidth contention at each host

2. Estimate impact of moving each VM to a non-contended host (cost-benefit analysis)

3. Execute the migrations that provide the most benefit

148

Architecture-aware DRM – ADRM(VEE 2015)

149

VM

1

Kernel-based Virtual Machine

(KVM + QEMU)

App

Profiler

VM

2

App

VM

i

App

VM

i+1

App

PM1

VM

j

Kernel-based Virtual Machine

(KVM + QEMU)

App

Profiler

VM

j+1

App

VM

n-1

App

VM

n

App

PMM

… ……

Profiling

Engine

Contention

Detector

Recommendation

Engine

Actuation

Engine

AI-DRM

ADRM: Key Ideas and Results

• Key Ideas:

– Memory bandwidth captures impact of shared cache and memory bandwidth interference

– Model degradation in performance as linearly proportional to bandwidth increase/decrease

• Key Results:

– Average performance improvement of 9.67% on a 4-node cluster

150

QoS in Heterogeneous Systems

• Staged memory scheduling – In collaboration with Rachata Ausavarungnirun,

Kevin Chang and Gabriel Loh

– Goal: High performance in CPU-GPU systems

• Memory scheduling in heterogeneous systems– In collaboration with Hiroukui Usui

– Goal: Meet deadlines for accelerators while improving performance

151

Performance Predictability in Heterogeneous Systems

Core Core Core Core

Core Core Core Core

Main Memory

Shared Cache

152

Accelerator

Accelerator

Goal of our Scheduler (SQUASH)

• Goal: Design a memory scheduler that – Meets accelerators’ deadlines and

– Achieves high CPU performance

• Basic Idea:– Different CPU applications and hardware

accelerators have different memory requirements

– Track progress of different agents and prioritize accordingly

153

Key Observation:Distribute Priority for Accelerators

• Accelerators need priority to meet deadlines

• Worst case prioritization not always the best

• Prioritize accelerators when they are not on track to meet a deadline

154

Distributing priority mitigates impact of accelerators on CPU cores’ requests

Key Observation: Not All Accelerators are Equal

• Long-deadline accelerators are more likely to meet their deadlines

• Short-deadline accelerators are more likely to miss their deadlines

155

Schedule short-deadline accelerators based on worst-case memory access time

Key Observation: Not All CPU cores are Equal

• Memory-intensive cores are much less vulnerable to interference

• Memory non-intensive cores are much more vulnerable to interference

156

Prioritize accelerators over memory-intensive cores to ensure accelerators do not become urgent

SQUASH: Key Ideas and Results

• Distribute priority for HWAs

• Prioritize HWAs over memory-intensive CPU cores even when not urgent

• Prioritize short-deadline-period HWAs based on worst case estimates

157

Improves CPU performance by 7-21%Meets 99.9% of deadlines for HWAs

Documents

Thesis Defense - Carnegie Mellon Universitysafari/thesis/lsubramanian_defen… · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer