PerfIso: Performance Isolation for Commercial Latency ... · Paulo Tomita Alex Chen Jack Zhang...

Preview:

Citation preview

PerfIso: Performance Isolation for Commercial Latency-Sensitive Services

Călin Iorgulescu Reza Azimi Youngjin KwonEPFL Brown University University of Texas

Sameh Elnikety Manoj Syamala Vivek Narasayya Herodotus HerodotouMicrosoft Research Cyprus University of Technology

Paulo Tomita Alex Chen Jack Zhang Junhua WangMicrosoft Bing

Interactive services must feel instantaneous

2 / 297/12/18

Interactive services must feel instantaneous

2 / 297/12/18

Interactive services must feel instantaneous

≤ 0.1 s

2 / 297/12/18

A single query involves hundreds of machines!

≤ 0.1 s

3 / 297/12/18

Web Index

A single query involves hundreds of machines!

≤ 0.1 s

3 / 297/12/18

A single query involves hundreds of machines!

≤ 0.1 s

Embarrassingly parallel search

3 / 297/12/18

A single query involves hundreds of machines!

≤ 0.1 s

Embarrassingly parallel search

3 / 297/12/18

Slowest response must be << 0.1 s

A single query involves hundreds of machines!

≤ 0.1 s

Embarrassingly parallel search

3 / 297/12/18

Slowest response must be << 0.1 s

Multiple layers of aggregation!Just one service out of many!

Wednesday Thursday Friday Saturday Sunday Monday Tuesday

Qu

ery

Arr

ival

Rat

e

Machines are provisioned for peak load

Request-rate variation for a Microsoft Bing sub-cluster over 1 week in 20174 / 297/12/18

Wednesday Thursday Friday Saturday Sunday Monday Tuesday

Qu

ery

Arr

ival

Rat

e

Machines are provisioned for peak load

Request-rate variation for a Microsoft Bing sub-cluster over 1 week in 2017

Average load

4 / 297/12/18

Wednesday Thursday Friday Saturday Sunday Monday Tuesday

Qu

ery

Arr

ival

Rat

e

Machines are provisioned for peak load

Request-rate variation for a Microsoft Bing sub-cluster over 1 week in 2017

Average load Peak load

4 / 297/12/18

Wednesday Thursday Friday Saturday Sunday Monday Tuesday

Qu

ery

Arr

ival

Rat

e

Machines are provisioned for peak load

Request-rate variation for a Microsoft Bing sub-cluster over 1 week in 2017

Average load Peak load

4 / 297/12/18

>>

Wednesday Thursday Friday Saturday Sunday Monday Tuesday

Qu

ery

Arr

ival

Rat

e

Machines are provisioned for peak load

Request-rate variation for a Microsoft Bing sub-cluster over 1 week in 2017

Average load Peak load

4 / 297/12/18

Datacenters have spare resources

How can we leverage this ?

>>

Solution: colocate batch jobs with online services

• Get spare resources to do useful work

• Primary tenant – guaranteed performance• e.g., Bing IndexServe

• Secondary tenant – best-effort performance• e.g., Apache Spark

Primary Idle PrimaryBatch Job

Without colocation With colocation + PerfIso

5 / 297/12/18

PerfIso: performance isolation for online services

6 / 297/12/18

PerfIso: performance isolation for online services

• Maintains P99 of response-times (10s of ms) under colocation

Provides performance isolation of Primary

6 / 297/12/18

PerfIso: performance isolation for online services

• Maintains P99 of response-times (10s of ms) under colocation

Provides performance isolation of Primary

• 45% of the CPU is used to do useful batch work

Increases system efficiency

6 / 297/12/18

PerfIso: performance isolation for online services

• Maintains P99 of response-times (10s of ms) under colocation

Provides performance isolation of Primary

• 45% of the CPU is used to do useful batch work

Increases system efficiency

• Many different interactive services and hardware setups

Deployed on over 90,000 servers

6 / 297/12/18

Many papers published on performance isolation

Quasar [ASPLOS ‘14] Heracles [ISCA ‘15] Elfen [USENIX ATC ’16]

7 / 297/12/18

Many papers published on performance isolation

Quasar [ASPLOS ‘14] Heracles [ISCA ‘15] Elfen [USENIX ATC ’16]

7 / 297/12/18

Existing solutions do not fit our requirements

PerfIso: Requirements

8 / 297/12/18

PerfIso: Requirements

1. “Black-box”: Fewest assumptions about tenants (wider applicability)

8 / 297/12/18

PerfIso: Requirements

1. “Black-box”: Fewest assumptions about tenants (wider applicability)

2. “Standalone”: Primary acts like it runs alone (negligible interference)

8 / 297/12/18

PerfIso: Requirements

1. “Black-box”: Fewest assumptions about tenants (wider applicability)

2. “Standalone”: Primary acts like it runs alone (negligible interference)

3. “Integrability”: Minimize software-stack changes (easy deployment)

8 / 297/12/18

Why is Performance Isolation hard?

Interactive services – highly sensitive to interference!

Leaf-servers keep 99th percentile low

• Over 10 years of optimization work!• e.g., compression, adaptive parallelism, etc.

How often does the 99th percentile occur?

• For 10,000 queries / s → 100 times / s

What happens in a 100-node fanout?

• Every query runs at the 99th percentile!

10 / 297/12/18

The Primary demands many resources quickly

• Bing IndexServe: multi-threaded web-index server

➢Up to 15 threads wake up in 5𝜇s1

1Constant query rate 4,000 Q/s, 500k queries experiment

11 / 297/12/18

The Primary demands many resources quickly

• Bing IndexServe: multi-threaded web-index server

➢Up to 15 threads wake up in 5𝜇s1

• Burstiness due to query-processing optimizations!• some queries will spawn many workers

1Constant query rate 4,000 Q/s, 500k queries experiment

11 / 297/12/18

The Primary demands many resources quickly

• Bing IndexServe: multi-threaded web-index server

➢Up to 15 threads wake up in 5𝜇s1

• Burstiness due to query-processing optimizations!• some queries will spawn many workers

• Workload arrives in bursts – exacerbates problem

1Constant query rate 4,000 Q/s, 500k queries experiment

11 / 297/12/18

The Primary must behave as if it were standalone

7/12/18 12 / 29

The Primary must behave as if it were standalone

• Primary’s resource demands must be fulfilled instantly.

7/12/18 12 / 29

The Primary must behave as if it were standalone

• Primary’s resource demands must be fulfilled instantly.

• Any delays → performance penalties incurred

7/12/18 12 / 29

The Primary must behave as if it were standalone

• Primary’s resource demands must be fulfilled instantly.

• Any delays → performance penalties incurred

• Any resource can become a performance bottleneck.

7/12/18 12 / 29

The Primary must behave as if it were standalone

• Primary’s resource demands must be fulfilled instantly.

• Any delays → performance penalties incurred

• Any resource can become a performance bottleneck.

If a query is delayed, it is already too late!

7/12/18 12 / 29

PerfIso

PerfIso: Implemented as a user-mode service

14 / 297/12/18

OS

Primary

PerfIso

Secondary

• Only keeps track of Secondary’s PID

PerfIso: Managed resources

15 / 297/12/18

OS

CPU

Primary

PerfIso

Secondary

Blind Isolation

PerfIso: Managed resources

15 / 297/12/18

OS

DISK

Primary

PerfIso

Secondary

I/O throttling

CPU

PerfIso: Managed resources

15 / 297/12/18

OS

MEMORY

Primary

PerfIso

Secondary

Restrict footprint

DISKCPU

PerfIso: Managed resources

15 / 297/12/18

OS

NETWORK

Primary

PerfIso

Secondary

Throttle egress packets

MEMORYDISKCPU

PerfIso: CPU is the most important resource

15 / 297/12/18

OS

CPU

Primary

PerfIso

Secondary

Blind Isolation

CPU sharing without PerfIso

• Primary and Secondary compete for cores.

• Secondary is aggressive: no idle cores exist.

16 / 297/12/18

Machine with 12 cores

Primary

Secondary

CPU Blind Isolation: Keep a “buffer” of idle cores

• PerfIso only knows the Secondary.

• Restrict Secondary by changing core affinities.

17 / 297/12/18

Primary

Secondary

Machine with 12 cores

Restrict Secondary to create a buffer of idle cores.

CPU Blind Isolation: Keep a “buffer” of idle cores

• PerfIso only knows the Secondary.

• Restrict Secondary by changing core affinities.Primary

Secondary

Machine with 12 cores

Idle

Restricted Secondary

Buffer of idle cores

17 / 297/12/18

CPU Blind Isolation: Keep a “buffer” of idle cores

• Primary is unrestricted. Secondary is restricted.

17 / 297/12/18

Machine with 12 cores

Primary can expand into the buffer!

Restricted Secondary

Buffer of idle cores

Primary

Secondary

Idle

CPU Blind Isolation: Keep a “buffer” of idle cores

• Primary is unrestricted. Secondary is restricted.

Machine with 12 cores

Primary can expand into the buffer!

Restricted Secondary

Buffer of idle cores

Primary

Secondary

Idle

17 / 297/12/18

CPU Blind Isolation: React to bursts from Primary

• Continuously read idle core status.

• Adjust Secondary ”slice” to maintain buffer.

18 / 297/12/18

Machine with 12 cores

Restricted Secondary

Buffer of idle cores

Primary

Secondary

Idle

CPU Blind Isolation: React to bursts from Primary

• Continuously read idle core status.

• Adjust Secondary ”slice” to maintain buffer.

18 / 297/12/18

Machine with 12 cores

Restricted Secondary

Buffer of idle cores

Primary

Secondary

Idle

CPU Blind Isolation: React to bursts from Primary

• Continuously read idle core status.

• Adjust Secondary ”slice” to maintain buffer.

18 / 297/12/18

Machine with 12 cores

Restricted Secondary

Buffer of idle cores

Primary

Secondary

Idle

CPU Blind Isolation: Secondary gets spare cores

• Allow Secondary to use spare idle cores.

• Release spare cores incrementally.

19 / 297/12/18

Machine with 12 cores

Restricted Secondary

Buffer of idle cores

Idle

Primary

Secondary

CPU Blind Isolation: Secondary gets spare cores

• Allow Secondary to use spare idle cores.

• Release spare cores incrementally.

19 / 297/12/18

Machine with 12 cores

Restricted Secondary

Buffer of idle cores

Idle

Primary

Secondary

CPU Blind Isolation: Secondary gets spare cores

• Allow Secondary to use spare idle cores.

• Release spare cores incrementally.

19 / 297/12/18

Machine with 12 cores

Restricted Secondary

Buffer of idle cores

Idle

Primary

Secondary

CPU Blind Isolation: Secondary gets spare cores

• Allow Secondary to use spare idle cores.

• Release spare cores incrementally.

19 / 297/12/18

Machine with 12 cores

Restricted Secondary

Buffer of idle cores

Idle

Primary

Secondary

CPU Blind Isolation: Secondary gets spare cores

• Allow Secondary to use spare idle cores.

• Release spare cores incrementally.

19 / 297/12/18

Machine with 12 cores

Restricted Secondary

Buffer of idle cores

Idle

Primary

Secondary

CPU Blind Isolation: Secondary gets spare cores

• Allow Secondary to use spare idle cores.

• Release spare cores incrementally.

19 / 297/12/18

Machine with 12 cores

Restricted Secondary

Buffer of idle cores

Idle

Primary

Secondary

CPU Blind Isolation: We dedicate 1 core to PerfIso

• PerfIso does continuous polling → we affinitize it to 1 core.

PerfIso

Machine with 12 cores

Restricted Secondary

Buffer of idle cores

Idle

Primary

Secondary

20 / 297/12/18

Evaluation

Experiment testbed

Hardware

• Intel Xeon E5 – 24 cores (48 w/ HT)

• 128GB RAM

Primary: Bing IndexServe

• 569 GB index-slice

• Open-loop client

• 500,000 queries @ 2,000 Q / s

Secondary: CPU micro-benchmark

22 / 297/12/18

11.65

349.08

1

10

100

1000

P9

9 la

ten

cy (

ms)

Standalone Colocated

SLO

No isolation

Single server: PerfIso protects tail-latencySecondary: CPU-intensive micro-benchmark

23 / 297/12/18

11.65

349.08

1

10

100

1000

P9

9 la

ten

cy (

ms)

Standalone Colocated

SLO

No isolation

One order of magnitude worse !

Single server: PerfIso protects tail-latencySecondary: CPU-intensive micro-benchmark

23 / 297/12/18

11.65

349.08

1

10

100

1000

P9

9 la

ten

cy (

ms)

Standalone Colocated

SLO 11.65 12.07

1

10

100

1000

Standalone Colocated

No isolation PerfIso

One order of magnitude worse !

Single server: PerfIso protects tail-latencySecondary: CPU-intensive micro-benchmark

23 / 297/12/18

Single server: CPU utilization 3x higher! Secondary: CPU-intensive micro-benchmark

0

20

40

60

80

100

CP

U u

tiliz

atio

n %

Primary Secondary

24 / 297/12/18

No colocation PerfIso

21%

11.65 12.07

1

10

100

1000

Standalone Colocated

67%

Single server: CPU utilization 3x higher! Secondary: CPU-intensive micro-benchmark

0

20

40

60

80

100

CP

U u

tiliz

atio

n %

Primary Secondary

24 / 297/12/18

No colocation PerfIso

21%

11.65 12.07

1

10

100

1000

Standalone Colocated

67%46% of CPU time → useful work

Restricting CPU cycles does not workSecondary: CPU-intensive micro-benchmark

11.65

349.08

12.07

33.74

1

10

100

1000

P9

9 la

ten

cy (

ms)

Standalone No isolation

PerfIso Restrict cycles

SLO

25 / 297/12/18

Secondary → 5% of CPU cycles

P99 latency – 3x higher than SLO!

Restricting CPU cores does not workSecondary: CPU-intensive micro-benchmark

0

20

40

60

80

100

CP

U u

tiliz

atio

n %

Primary Secondary

PerfIsoStandalone Restrict cores

SLO 11.65

349.08

12.07 11.63

1

10

100

1000

P9

9 la

ten

cy (

ms)

Standalone No isolation

PerfIso Restrict cores

21%

67%

26 / 297/12/18

38%

Restricting CPU cores does not workSecondary: CPU-intensive micro-benchmark

0

20

40

60

80

100

CP

U u

tiliz

atio

n %

Primary Secondary

PerfIsoStandalone Restrict cores

SLO 11.65

349.08

12.07 11.63

1

10

100

1000

P9

9 la

ten

cy (

ms)

Standalone No isolation

PerfIso Restrict cores

21%

67%Provisioned for peak load→

CPU utilization ~30% lower!

26 / 297/12/18

38%

020406080

100

CP

U u

til.

%

Avg CPU Utilization %

0

1000

2000

3000

4000

5000

0

10

20

30

40

0 10 20 30 40 50 60

Qu

erie

s /

s

Late

ncy

(m

s)

Time (minutes)

Top-Level Aggregator P99 latency (ms) Queries / s

1-hour run of 650 machine clusterSecondary: Machine-Learning computation

27 / 297/12/18

020406080

100

CP

U u

til.

%

Avg CPU Utilization %

0

1000

2000

3000

4000

5000

0

10

20

30

40

0 10 20 30 40 50 60

Qu

erie

s /

s

Late

ncy

(m

s)

Time (minutes)

Top-Level Aggregator P99 latency (ms) Queries / s

1-hour run of 650 machine cluster

Average CPU utilization is 50% - 80%!

Secondary: Machine-Learning computation

27 / 297/12/18

Interesting details in the paper

• Effectiveness of static CPU isolation methods

• Restricting CPU cycles

• Restricting CPU cores

• Comparison of state-of-the-art techniques

• Managing disk, memory, and network

28 / 297/12/18

PerfIso: colocate batch jobs with online services

29 / 297/12/18

PerfIso: colocate batch jobs with online services

• Black-box: do not tailor to one specific service

29 / 297/12/18

PerfIso: colocate batch jobs with online services

• Black-box: do not tailor to one specific service

• Robustness: favor user-mode over kernel implementation

29 / 297/12/18

PerfIso: colocate batch jobs with online services

• Black-box: do not tailor to one specific service

• Robustness: favor user-mode over kernel implementation

• Headroom: some core-slack makes Primary behave like standalone

29 / 297/12/18

PerfIso: colocate batch jobs with online services

• Black-box: do not tailor to one specific service

• Robustness: favor user-mode over kernel implementation

• Headroom: some core-slack makes Primary behave like standalone

• CPU Blind Isolation → colocation without impacting service performance

29 / 297/12/18

Recommended