44
Christina Delimitrou and Christos Kozyrakis Stanford University http://mast.stanford.edu ASPLOS – March 3 rd 2014 QUASAR: RESOURCE-EFFICIENT AND QOS-AWARE CLUSTER MANAGEMENT

Stanford University ://delimitrou/slides/2014.asplos.quasar... · Collaborative filtering – similar to Netflix Challenge system ... Evaluation: Cloud Scenario Cluster 200 EC2 servers,

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Stanford University ://delimitrou/slides/2014.asplos.quasar... · Collaborative filtering – similar to Netflix Challenge system ... Evaluation: Cloud Scenario Cluster 200 EC2 servers,

Christina Delimitrou and Christos Kozyrakis

Stanford University

http://mast.stanford.edu

ASPLOS – March 3rd 2014

QUASAR: RESOURCE-EFFICIENT AND

QOS-AWARE CLUSTER MANAGEMENT

Page 2: Stanford University ://delimitrou/slides/2014.asplos.quasar... · Collaborative filtering – similar to Netflix Challenge system ... Evaluation: Cloud Scenario Cluster 200 EC2 servers,

2

Problem: low datacenter utilization

Overprovisioned reservations by users

Problem: high jitter on application performance

Interference, HW heterogeneity

Quasar: resource-efficient cluster management

User provides resource reservations performance goals

Online analysis of resource needs using info from past apps

Automatic selection of number & type of resources

High utilization and low performance jitter

Executive Summary

Page 3: Stanford University ://delimitrou/slides/2014.asplos.quasar... · Collaborative filtering – similar to Netflix Challenge system ... Evaluation: Cloud Scenario Cluster 200 EC2 servers,

3

Datacenter Underutilization

A few thousand server cluster at Twitter managed by Mesos

Running mostly latency-critical, user-facing apps

80% of servers @ < 20% utilization

Servers are 65% of TCO

Page 4: Stanford University ://delimitrou/slides/2014.asplos.quasar... · Collaborative filtering – similar to Netflix Challenge system ... Evaluation: Cloud Scenario Cluster 200 EC2 servers,

4

Datacenter Underutilization

Goal: raise utilization without introducing performance jitter

1 L. A. Barroso, U. Holzle. The Datacenter as a Computer, 2009.

Twitter Google1

Page 5: Stanford University ://delimitrou/slides/2014.asplos.quasar... · Collaborative filtering – similar to Netflix Challenge system ... Evaluation: Cloud Scenario Cluster 200 EC2 servers,

5

Reserved vs. Used Resources

Twitter: up to 5x CPU & up to 2x memory overprovisioning

3-5x

1.5-2x

Page 6: Stanford University ://delimitrou/slides/2014.asplos.quasar... · Collaborative filtering – similar to Netflix Challenge system ... Evaluation: Cloud Scenario Cluster 200 EC2 servers,

6

Reserved vs. Used Resources

20% of job under-sized, ~70% of jobs over-sized

Page 7: Stanford University ://delimitrou/slides/2014.asplos.quasar... · Collaborative filtering – similar to Netflix Challenge system ... Evaluation: Cloud Scenario Cluster 200 EC2 servers,

7

Rightsizing Applications is Hard

Page 8: Stanford University ://delimitrou/slides/2014.asplos.quasar... · Collaborative filtering – similar to Netflix Challenge system ... Evaluation: Cloud Scenario Cluster 200 EC2 servers,

8

Performance Depends on

Cores

Perf

orm

anc

e Scale-up

Page 9: Stanford University ://delimitrou/slides/2014.asplos.quasar... · Collaborative filtering – similar to Netflix Challenge system ... Evaluation: Cloud Scenario Cluster 200 EC2 servers,

9

Performance Depends on

Cores

Perf

orm

anc

e Heterogeneity

Page 10: Stanford University ://delimitrou/slides/2014.asplos.quasar... · Collaborative filtering – similar to Netflix Challenge system ... Evaluation: Cloud Scenario Cluster 200 EC2 servers,

10

Performance Depends on

Cores

Perf

orm

anc

e Heterogeneity

Servers

Perf

orm

anc

e Scale-out

Page 11: Stanford University ://delimitrou/slides/2014.asplos.quasar... · Collaborative filtering – similar to Netflix Challenge system ... Evaluation: Cloud Scenario Cluster 200 EC2 servers,

11

Performance Depends on

Cores

Perf

orm

anc

e Heterogeneity

Servers

Perf

orm

anc

e Scale-out

Input size

Perf

orm

anc

e Input load

Page 12: Stanford University ://delimitrou/slides/2014.asplos.quasar... · Collaborative filtering – similar to Netflix Challenge system ... Evaluation: Cloud Scenario Cluster 200 EC2 servers,

12

Performance Depends on

Cores

Perf

orm

anc

e Heterogeneity

Cores

Perf

orm

anc

e Interference

Input size

Perf

orm

anc

e Input load

Servers

Perf

orm

anc

e Scale-out

When sw changes,

when platforms change, etc.

Page 13: Stanford University ://delimitrou/slides/2014.asplos.quasar... · Collaborative filtering – similar to Netflix Challenge system ... Evaluation: Cloud Scenario Cluster 200 EC2 servers,

13

Performance Depends on

Cores

Perf

orm

anc

e Heterogeneity

Cores

Perf

orm

anc

e Interference

Input size

Perf

orm

anc

e Input load

Servers

Perf

orm

anc

e Scale-out

When sw changes,

when platforms change, etc.

Page 14: Stanford University ://delimitrou/slides/2014.asplos.quasar... · Collaborative filtering – similar to Netflix Challenge system ... Evaluation: Cloud Scenario Cluster 200 EC2 servers,

14

Rethinking Cluster Management

User provides resource reservations performance goals

Joint allocation and assignment of resources

Right amount depends on quality of available resources

Monitor and adjust dynamically as needed

But wait…

The manager must know the resource/performance tradeoffs

Page 15: Stanford University ://delimitrou/slides/2014.asplos.quasar... · Collaborative filtering – similar to Netflix Challenge system ... Evaluation: Cloud Scenario Cluster 200 EC2 servers,

15

Big cluster

data

Combine:

Small signal from short run of new app

Large signal from previously-run apps

Generate:

Detailed insights for resource management

Performance vs scale-up/out,

heterogeneity, …

Looks like a classification problem

Resource/performance

tradeoffs

Small app

signal

Understanding Resource/Performance Tradeoffs

Page 16: Stanford University ://delimitrou/slides/2014.asplos.quasar... · Collaborative filtering – similar to Netflix Challenge system ... Evaluation: Cloud Scenario Cluster 200 EC2 servers,

16

Something familiar…

Collaborative filtering – similar to Netflix Challenge system

Predict preferences of new users given preferences of other users

Singular Value Decomposition (SVD) + PQ reconstruction (SGD)

High accuracy, low complexity, relaxed density constraints

Sparse

utility

matrix

Initial

decomposition

SVD PQ

SGD

Reconstructed

utility matrix

Final

decomposition

SVD

movies

users

Page 17: Stanford University ://delimitrou/slides/2014.asplos.quasar... · Collaborative filtering – similar to Netflix Challenge system ... Evaluation: Cloud Scenario Cluster 200 EC2 servers,

17

Application Analysis with Classification

4 parallel classifications

Lower overheads & similar accuracy to exhaustive classification

Rows Columns Recommendation

Netflix Users Movies Movie ratings

Heterogeneity

Interference

Scale-up

Scale-out

Page 18: Stanford University ://delimitrou/slides/2014.asplos.quasar... · Collaborative filtering – similar to Netflix Challenge system ... Evaluation: Cloud Scenario Cluster 200 EC2 servers,

18

Heterogeneity Classification

Profiling on two randomly selected server types

Predict performance on each server type

Rows Columns Recommendation

Netflix Users Movies Movie ratings

Heterogeneity Apps Platforms Server type

Interference

Scale-up

Scale-out

Page 19: Stanford University ://delimitrou/slides/2014.asplos.quasar... · Collaborative filtering – similar to Netflix Challenge system ... Evaluation: Cloud Scenario Cluster 200 EC2 servers,

19

Interference Classification

Predict sensitivity to interference

Interference intensity that leads to >5% performance loss

Profiling by injecting increasing interference

Rows Columns Recommendation

Netflix Users Movies Movie ratings

Heterogeneity Apps Platforms Server type

Interference Apps Sources of interference Interference sensitivity

Scale-up

Scale-out

Page 20: Stanford University ://delimitrou/slides/2014.asplos.quasar... · Collaborative filtering – similar to Netflix Challenge system ... Evaluation: Cloud Scenario Cluster 200 EC2 servers,

20

Scale-Up Classification

Predict speedup from scale-up

Profiling with two allocations (cores & memory)

Rows Columns Recommendation

Netflix Users Movies Movie ratings

Heterogeneity Apps Platforms Server type

Interference Apps Sources of interference Interference sensitivity

Scale-up Apps Resource vectors Resources/node

Scale-out

Page 21: Stanford University ://delimitrou/slides/2014.asplos.quasar... · Collaborative filtering – similar to Netflix Challenge system ... Evaluation: Cloud Scenario Cluster 200 EC2 servers,

21

Scale-Out Classification

Predict speedup from scale-out

Profiling with two allocations (1 & N>1 nodes)

Rows Columns Recommendation

Netflix Users Movies Movie ratings

Heterogeneity Apps Platforms Server type

Interference Apps Sources of interference Interference sensitivity

Scale-up Apps Resource vectors Resources/node

Scale-out Apps Nodes Number of nodes

Page 22: Stanford University ://delimitrou/slides/2014.asplos.quasar... · Collaborative filtering – similar to Netflix Challenge system ... Evaluation: Cloud Scenario Cluster 200 EC2 servers,

22

Classification Validation

Heterogeneity Interference Scale-up Scale-out

avg max avg max avg max avg max

Single-node 4% 8% 5% 10% 4% 9% - -

Batch distributed 4% 5% 2% 6% 5% 11% 5% 17%

Latency-critical 5% 6% 7% 10% 6% 11% 6% 12%

Page 23: Stanford University ://delimitrou/slides/2014.asplos.quasar... · Collaborative filtering – similar to Netflix Challenge system ... Evaluation: Cloud Scenario Cluster 200 EC2 servers,

23

Quasar Overview

QoS

Page 24: Stanford University ://delimitrou/slides/2014.asplos.quasar... · Collaborative filtering – similar to Netflix Challenge system ... Evaluation: Cloud Scenario Cluster 200 EC2 servers,

24

Quasar Overview

QoS

Page 25: Stanford University ://delimitrou/slides/2014.asplos.quasar... · Collaborative filtering – similar to Netflix Challenge system ... Evaluation: Cloud Scenario Cluster 200 EC2 servers,

25

Quasar Overview

QoS

H <a..b...> I <..cd...> SU <e..f..> SO <kl....>

Page 26: Stanford University ://delimitrou/slides/2014.asplos.quasar... · Collaborative filtering – similar to Netflix Challenge system ... Evaluation: Cloud Scenario Cluster 200 EC2 servers,

26

Quasar Overview

H <a..b...> I <..cd...> SU <e..f..> SO <kl....>

UΣVT

mnmm

n

n

uuu

uuu

uuu

...

...

...

21

22221

11211

H <abcdefghi> I <qwertyuio> SU <esdfghjkl> SO <kljhgfdsa>

QoS

Page 27: Stanford University ://delimitrou/slides/2014.asplos.quasar... · Collaborative filtering – similar to Netflix Challenge system ... Evaluation: Cloud Scenario Cluster 200 EC2 servers,

27

Quasar Overview

H <a..b...> I <..cd...> SU <e..f..> SO <kl....>

UΣVT

mnmm

n

n

uuu

uuu

uuu

...

...

...

21

22221

11211

H <abcdefghi> I <qwertyuio> SU <esdfghjkl> SO <kljhgfdsa>

QoS

Page 28: Stanford University ://delimitrou/slides/2014.asplos.quasar... · Collaborative filtering – similar to Netflix Challenge system ... Evaluation: Cloud Scenario Cluster 200 EC2 servers,

28

Quasar Overview

H <a..b...> I <..cd...> SU <e..f..> SO <kl....>

UΣVT

mnmm

n

n

uuu

uuu

uuu

...

...

...

21

22221

11211

H <abcdefghi> I <qwertyuio> SU <esdfghjkl> SO <kljhgfdsa>

Profiling

[10-60sec]

Sparse

input

signal

Classification

[20msec]

Dense

output

signal

Resource

selection

[50msec-2sec]

QoS

Page 29: Stanford University ://delimitrou/slides/2014.asplos.quasar... · Collaborative filtering – similar to Netflix Challenge system ... Evaluation: Cloud Scenario Cluster 200 EC2 servers,

29

Greedy Resource Selection

Goals

Allocate least needed resources to meet QoS target

Pack together non-interfering applications

Overview

Start with most appropriate server types

Look for servers with interference below critical intensity

Depends on which applications are running on these servers

First scale-up, next scale-out

Page 30: Stanford University ://delimitrou/slides/2014.asplos.quasar... · Collaborative filtering – similar to Netflix Challenge system ... Evaluation: Cloud Scenario Cluster 200 EC2 servers,

30

Quasar Implementation

6,000 loc of C++ and Python

Runs on Linux and OS X

Supports frameworks in C/C++, Java and Python

~100-600 loc for framework-specific code

Side-effect free profiling using Linux containers with chroot

Page 31: Stanford University ://delimitrou/slides/2014.asplos.quasar... · Collaborative filtering – similar to Netflix Challenge system ... Evaluation: Cloud Scenario Cluster 200 EC2 servers,

31

Evaluation: Cloud Scenario

Cluster

200 EC2 servers, 14 different server types

Workloads: 1,200 apps with 1sec inter-arrival rate

Analytics: Hadoop, Spark, Storm

Latency-critical: Memcached, HotCrp, Cassandra

Single-threaded: SPEC CPU2006

Multi-threaded: PARSEC, SPLASH-2, BioParallel, Specjbb

Multiprogrammed: 4-app mixes of SPEC CPU2006

Objectives: high cluster utilization and good app QoS

Page 32: Stanford University ://delimitrou/slides/2014.asplos.quasar... · Collaborative filtering – similar to Netflix Challenge system ... Evaluation: Cloud Scenario Cluster 200 EC2 servers,

32

Demo In

stance

Siz

e

memcached

Cassandra

Storm Hadoop Single-node Spark

Core Allocation Map

Quasar

Instance Core

Core Allocation Map

Reservation & LL

Performance histogram

Quasar

Performance histogram

Reservation & LL

Cluster

Utilization

Progress bar Progress bar

100%

0%

100%

0%

Quasar Reservation + LL

Page 33: Stanford University ://delimitrou/slides/2014.asplos.quasar... · Collaborative filtering – similar to Netflix Challenge system ... Evaluation: Cloud Scenario Cluster 200 EC2 servers,

33

Page 34: Stanford University ://delimitrou/slides/2014.asplos.quasar... · Collaborative filtering – similar to Netflix Challenge system ... Evaluation: Cloud Scenario Cluster 200 EC2 servers,

34

Quasar achieves:

88% of applications get >95% performance

~10% overprovisioning as opposed to up to 5x

Up to 70% cluster utilization at steady-state

23% shorter scenario completion time

Cloud Scenario Summary

Page 35: Stanford University ://delimitrou/slides/2014.asplos.quasar... · Collaborative filtering – similar to Netflix Challenge system ... Evaluation: Cloud Scenario Cluster 200 EC2 servers,

35

Conclusions

Quasar: high utilization, high app performance

From reservation to performance-centric cluster management

Uses info from previous apps for accurate & online app analysis

Joint resource allocation and resource assignment

See paper for:

Utilization analysis of Twitter cluster

Detailed validation & sensitivity analysis of classification

Further evaluation scenarios and features

E.g., setting framework parameters for Hadoop

Page 36: Stanford University ://delimitrou/slides/2014.asplos.quasar... · Collaborative filtering – similar to Netflix Challenge system ... Evaluation: Cloud Scenario Cluster 200 EC2 servers,

36

Quasar: high utilization, high app performance

From reservation to performance-centric cluster management

Uses info from previous apps for accurate & online app analysis

Joint resource allocation and resource assignment

See paper for:

Utilization analysis of Twitter cluster

Detailed validation & sensitivity analysis of classification

Further evaluation scenarios and features

E.g., setting framework parameters for Hadoop

Questions??

Page 37: Stanford University ://delimitrou/slides/2014.asplos.quasar... · Collaborative filtering – similar to Netflix Challenge system ... Evaluation: Cloud Scenario Cluster 200 EC2 servers,

37

Thank you

Questions??

Page 38: Stanford University ://delimitrou/slides/2014.asplos.quasar... · Collaborative filtering – similar to Netflix Challenge system ... Evaluation: Cloud Scenario Cluster 200 EC2 servers,

38

Cloud Provider: Performance

Page 39: Stanford University ://delimitrou/slides/2014.asplos.quasar... · Collaborative filtering – similar to Netflix Challenge system ... Evaluation: Cloud Scenario Cluster 200 EC2 servers,

39

Cloud Provider: Performance

Most applications violate their QoS constraints

Page 40: Stanford University ://delimitrou/slides/2014.asplos.quasar... · Collaborative filtering – similar to Netflix Challenge system ... Evaluation: Cloud Scenario Cluster 200 EC2 servers,

40

Cloud Provider: Performance

83% of performance target when only assignment is heterogeneity &

interference aware

Page 41: Stanford University ://delimitrou/slides/2014.asplos.quasar... · Collaborative filtering – similar to Netflix Challenge system ... Evaluation: Cloud Scenario Cluster 200 EC2 servers,

41

Cloud Provider: Performance

98% of performance target on average

Page 42: Stanford University ://delimitrou/slides/2014.asplos.quasar... · Collaborative filtering – similar to Netflix Challenge system ... Evaluation: Cloud Scenario Cluster 200 EC2 servers,

42

Cluster Utilization

Baseline (Reservation+LL):

Imbalance in server utilization

Per-app QoS violations + higher execution time

Quasar increases server utilization by 47%

High performance for user

Better utilization for DC operator resource efficiency

Quasar Least-Loaded (LL)

Page 43: Stanford University ://delimitrou/slides/2014.asplos.quasar... · Collaborative filtering – similar to Netflix Challenge system ... Evaluation: Cloud Scenario Cluster 200 EC2 servers,

43

Reducing Overprovisioning

~10% overprovisioning, compared to 40%-5x for Reservation+LL

Page 44: Stanford University ://delimitrou/slides/2014.asplos.quasar... · Collaborative filtering – similar to Netflix Challenge system ... Evaluation: Cloud Scenario Cluster 200 EC2 servers,

44

Scheduling Overheads

4.1% of execution time on average, up to 15% for short-lived

workloads – mostly from profiling

Distributed

decisions

Cold-start

solutions